<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<?covid-19-tdm?>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Big Data</journal-id>
<journal-title>Frontiers in Big Data</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Big Data</abbrev-journal-title>
<issn pub-type="epub">2624-909X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">675882</article-id>
<article-id pub-id-type="doi">10.3389/fdata.2021.675882</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Big Data</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Development of An Individualized Risk Prediction Model for COVID-19 Using Electronic Health Record Data</article-title>
<alt-title alt-title-type="left-running-head">Mamidi et&#x20;al.</alt-title>
<alt-title alt-title-type="right-running-head">COVID-19 risk prediction with EHR</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Mamidi</surname>
<given-names>Tarun Karthik Kumar</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="fn" rid="fn1">
<sup>&#x2020;</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1171790/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Tran-Nguyen</surname>
<given-names>Thi K.</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<xref ref-type="fn" rid="fn1">
<sup>&#x2020;</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1251860/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Melvin</surname>
<given-names>Ryan L.</given-names>
</name>
<xref ref-type="aff" rid="aff3">
<sup>3</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/402448/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Worthey</surname>
<given-names>Elizabeth A.</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
</contrib>
</contrib-group>
<aff id="aff1">
<label>
<sup>1</sup>
</label>Center for Computational Genomics and Data Science, Departments of Pediatrics and Pathology, University of Alabama at Birmingham School of Medicine, <addr-line>Birmingham</addr-line>, <addr-line>AL</addr-line>, <country>United&#x20;States</country>
</aff>
<aff id="aff2">
<label>
<sup>2</sup>
</label>Hugh Kaul Precision Medicine Institute, University of Alabama at Birmingham, <addr-line>Birmingham</addr-line>, <addr-line>AL</addr-line>, <country>United&#x20;States</country>
</aff>
<aff id="aff3">
<label>
<sup>3</sup>
</label>Department of Anesthesiology and Perioperative Medicine, University of Alabama at Birmingham, <addr-line>Birmingham</addr-line>, <addr-line>AL</addr-line>, <country>United&#x20;States</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/928677/overview">Hong Qin</ext-link>, University of Tennessee at Chattanooga, United&#x20;States</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/689924/overview">Yaojiang Huang</ext-link>, Minzu University of China, China</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1311911/overview">Minh Pham</ext-link>, University of South Florida, United&#x20;States</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Elizabeth A. Worthey, <email>eworthey@peds.uab.edu</email>
</corresp>
<fn fn-type="equal" id="fn1">
<label>
<sup>&#x2020;</sup>
</label>
<p>These authors have contributed equally to this work and share first authorship.</p>
</fn>
<fn fn-type="other">
<p>This article was submitted to Medicine and Public Health, a section of the journal Frontiers in Big&#x20;Data</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>04</day>
<month>06</month>
<year>2021</year>
</pub-date>
<pub-date pub-type="collection">
<year>2021</year>
</pub-date>
<volume>4</volume>
<elocation-id>675882</elocation-id>
<history>
<date date-type="received">
<day>04</day>
<month>03</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>19</day>
<month>05</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2021 Mamidi, Tran-Nguyen, Melvin and Worthey.</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Mamidi, Tran-Nguyen, Melvin and Worthey</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these&#x20;terms.</p>
</license>
</permissions>
<abstract>
<p>Developing an accurate and interpretable model to predict an individual&#x2019;s risk for Coronavirus Disease 2019 (COVID-19) is a critical step to efficiently triage testing and other scarce preventative resources. To aid in this effort, we have developed an interpretable risk calculator that utilized de-identified electronic health records (EHR) from the University of Alabama at Birmingham Informatics for Integrating Biology and the Bedside (UAB-i2b2) COVID-19 repository under the U-BRITE framework. The generated risk scores are analogous to commonly used credit scores where higher scores indicate higher risks for COVID-19 infection. By design, these risk scores can easily be calculated in spreadsheets or even with pen and paper. To predict risk, we implemented a Credit Scorecard modeling approach on longitudinal EHR data from 7,262 patients enrolled in the UAB Health System who were evaluated and/or tested for COVID-19 between January and June 2020. In this cohort, 912 patients were positive for COVID-19. Our workflow considered the timing of symptoms and medical conditions and tested the effects by applying different variable selection techniques such as LASSO and Elastic-Net. Within the two weeks before a COVID-19 diagnosis, the most predictive features were respiratory symptoms such as cough, abnormalities of breathing, pain in the throat and chest as well as other chronic conditions including nicotine dependence and major depressive disorder. When extending the timeframe to include all medical conditions across all time, our models also uncovered several chronic conditions impacting the respiratory, cardiovascular, central nervous and urinary organ systems. The whole pipeline of data processing, risk modeling and web-based risk calculator can be applied to any EHR data following the OMOP common data format. The results can be employed to generate questionnaires to estimate COVID-19 risk for screening in building entries or to optimize hospital resources.</p>
</abstract>
<kwd-group>
<kwd>COVID-19</kwd>
<kwd>electronic health record</kwd>
<kwd>risk prediction</kwd>
<kwd>ICD-10</kwd>
<kwd>credit scorecard model</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<title>Introduction</title>
<p>Despite recent progress in the Coronavirus Disease 2019 (COVID-19) vaccines approval and distribution, the pandemic continues to pose a tremendous burden to our healthcare system. Global resources to manage this current crisis continued to be in short supply. It remains critical to quickly and efficiently identify, screen and monitor individuals with the highest risks for COVID-19 so that distribution of therapeutics can be based on individual risks. Many factors including pre-existing chronic conditions (<xref ref-type="bibr" rid="B17">Liu et&#x20;al., 2020</xref>), age, sex, ethnicity and racial background, access to health care, and other social-economic components (<xref ref-type="bibr" rid="B26">Rashedi et&#x20;al., 2020</xref>) have been shown to affect an individual&#x2019;s risk for this disease.</p>
<p>Accordingly, several predictive models that seek to optimize hospital resource management and clinical decisions have been developed (<xref ref-type="bibr" rid="B13">Jehi et&#x20;al., 2020a</xref>; <xref ref-type="bibr" rid="B14">Jehi et&#x20;al., 2020b</xref>; <xref ref-type="bibr" rid="B8">Gong et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B16">Liang et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B33">Wynants et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B35">Zhao et&#x20;al., 2020</xref>). To a large degree, these informatic tools leverage the vast and rich health information available from Electronic Health Record (EHR) data (<xref ref-type="bibr" rid="B14">Jehi et&#x20;al., 2020b</xref>; <xref ref-type="bibr" rid="B22">Oetjens et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B23">Osborne et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B30">Vaid et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B31">Wang et&#x20;al., 2021a</xref>; <xref ref-type="bibr" rid="B32">Wang et&#x20;al., 2021b</xref>; <xref ref-type="bibr" rid="B5">Estiri et&#x20;al., 2021</xref>; <xref ref-type="bibr" rid="B9">Halalau et&#x20;al., 2021</xref>; <xref ref-type="bibr" rid="B27">Schwab et&#x20;al., 2021</xref>). EHR systems contain longitudinal data about patients&#x2019; demographics, health history, current and past medications, hospital admissions, procedures, current and past symptoms and conditions. Although the primary purpose of EHRs is clinical, over the last decade researchers have used them to conduct clinical and epidemiological research. This has been notable especially during the COVID-19 pandemic where such research that generated invaluable data about COVID-19 risks, comorbidities, transmission and outcomes was quickly adapted for clinical decision making (<xref ref-type="bibr" rid="B37">Daglia et&#x20;al., 2021</xref>). To ensure interoperability across multiple hospital systems, EHR data incorporate standard reference terminology and standard classification systems such as the International Classification of Diseases (ICD) that organize and classify diseases and procedures for facile information retrieval (<xref ref-type="bibr" rid="B38">Bowman, 2005</xref>). Incorporated into the Medical Outcomes Partnership (OMOP) common data model (<xref ref-type="bibr" rid="B39">Blacketer, 2021</xref>), these ICD9/ICD10 codes facilitate systemic analyses of disparate EHR datasets across different healthcare organizations.</p>
<p>Many of these insights were generated using machine learning methods, based on multi-dimensional data (<xref ref-type="bibr" rid="B40">Mitchell, 1997</xref>). Studies have employed a variety of classification and/or regression methods including Naive Bayes, Support Vector Machine, Decision Tree, Random Forest, AdaBoost, K-nearest-neighbor, Gradient-boosted DT, Logistic Regression, Artificial Neural Network, and Extremely Randomized Trees (<xref ref-type="bibr" rid="B40">Alballa and Al-Turaiki, 2021</xref>). Among these, the most popular methods applied to COVID-19 have been linear regression, XGBoost, and Support Vector Machine (<xref ref-type="bibr" rid="B40">Alballa and Al-Turaiki, 2021</xref>).</p>
<p>To develop a COVD-19 risk model, we chose a Logistic Regression based Credit Scorecard modeling approach to estimate the probability of COVID-19 diagnosis given an individual&#x2019;s ICD9/ICD10 encoded symptoms and conditions. Credit Scorecard is a powerful predictive modeling technique widely adopted by the financial industry to manage risks and control losses when lending to individuals or businesses by predicting the probability of default (<xref ref-type="bibr" rid="B2">Bailey, 2006</xref>). The Credit Scorecard model is most frequently used by scorecard developers not only due to its high prediction accuracy, but also due to its interpretability, transparency and ease of implementation. This method has been implemented previously for EHR data based COVID-19 risk prediction (<xref ref-type="bibr" rid="B13">Jehi et&#x20;al., 2020a</xref>; <xref ref-type="bibr" rid="B14">Jehi et&#x20;al., 2020b</xref>).</p>
<p>Application of feature selection methods that attempt to retain the subset of features that are most applicable for classification has been applied to increase interpretability, enhance speed, reduce data dimensionality and prevent overfitting (<xref ref-type="bibr" rid="B40">Alballa and Al-Turaiki, 2021</xref>). While there are many feature selection methods, sparse feature selection methods such as LASSO (Least Absolute Shrinkage and Selection Operator) (<xref ref-type="bibr" rid="B29">Tibshirani, 1996</xref>) and Elastic-Net (<xref ref-type="bibr" rid="B36">Zou and Hastie, 2005</xref>) provide advantages. LASSO places an upper bound constraint on the sum of the absolute values of the model parameters by penalizing the regression coefficients based on their size and forcing certain coefficients to zero and eventually excluding them to retain the most useful features (<xref ref-type="bibr" rid="B29">Tibshirani, 1996</xref>). Expanded from LASSO, Elastic-Net adds a quadratic penalty term to the calculation of coefficients to prevent the &#x201c;saturation&#x201d; problem encountered when a limited number of variables are selected (<xref ref-type="bibr" rid="B36">Zou and Hastie, 2005</xref>). Several COVID-19 risk prediction models employed LASSO (<xref ref-type="bibr" rid="B8">Gong et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B16">Liang et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B42">Feng et&#x20;al., 2021</xref>) and Elastic-Net (<xref ref-type="bibr" rid="B43">Heldt et&#x20;al., 2021</xref>; <xref ref-type="bibr" rid="B44">Hu et&#x20;al., 2021</xref>; <xref ref-type="bibr" rid="B45">Huang et&#x20;al., 2021</xref>).</p>
<p>The major goals for this analysis were to determine whether we could: 1) leverage the existing hierarchical structure of the ICD9/ICD-10 classification system, in an unbiased approach, to capture patients&#x2019; symptoms and conditions and estimate their possibilities of having a COVID-19 diagnosis, 2) examine the temporal aspect of EHR (i.e.,&#x20;within a timeframe, for example, symptoms within 2-weeks of infection/diagnosis). to evaluate what current symptoms and/or pre-existing conditions affect COVID-19 risks, 3) apply a Credit Scorecard modeling approach to develop and validate a predictive model for COVID-19 risk from retrospective EHR data, and 4) develop a pipeline requiring minimal manual curation capable of generating COVID-19 risk models from any EHR data using the OMOP common data model (<xref ref-type="bibr" rid="B39">Blacketer, 2021</xref>). To demonstrate the latter goal a web application was created to take answers from individuals and produces a COVID-19 risk score. We have made the code freely available for anyone wishing to reproduce and deploy such a model at <ext-link ext-link-type="uri" xlink:href="https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/public/covid-19_risk_predictor">gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/public/covid-19_risk_predictor</ext-link>.</p>
</sec>
<sec sec-type="materials|methods" id="s2">
<title>Materials and Methods</title>
<sec id="s2-1">
<title>Dataset</title>
<p>The UAB Informatics Institute Integrating Biology and the Bedside (i2b2) COVID-19 Limited Datasets (LDS) contain de-identified EHR data that are also part of the NIH COVID-19 Data Warehouse (<xref ref-type="bibr" rid="B19">NCATS, 2020</xref>). Data was made available through the UAB Biomedical Research Information Technology Enhancement (U-BRITE) framework. Access to the level-2 i2b2 data was granted upon self-service pursuant to an IRB exemption. Our dataset contains longitudinal data of patients in the UAB Health System who had COVID-19 testing and/or diagnosis from January to June 2020. Aggregated from six different databases, our dataset was transformed to adhere to the OMOP Common Data Model Version 5.3.1 (<xref ref-type="bibr" rid="B39">Blacketer, 2021</xref>) to enable systemic analyses of EHR data from disparate sources.</p>
<p>The UAB i2b2&#x20;COVID-19 LDS is comprised of 14 tables corresponding to different domains: PERSON, OBSERVATION_PERIOD, SPECIMEN, DEATH, VISIT_OCCURRENCE, PROCEDURE_OCCURENCE, DRUG_EXPOSURE, DEVICE_EXPOSURE, CONDITION_OCCURENCE, MEASUREMENT, OBSERVATION, LOCATION, CARE_SITE and PROVIDER. For the purpose of this study, we limit assessment to previous reported conditions (from CONDITION_OCCURENCE) and lifestyle/habits (from OBSERVATION).</p>
</sec>
<sec id="s2-2">
<title>Data Processing</title>
<p>Data wrangling was performed using Python 3.8.5 with the Pandas package 1.2.1 and Numpy package 1.19.5. Code for recreating our process is freely available (see code availability statement below). The following subsections detail the information retrieved from the database tables mentioned&#x20;above.</p>
</sec>
<sec id="s2-3">
<title>Person Table</title>
<p>Demographic information (i.e.,&#x20;age, gender, race, and ethnicity) for each de-identified individual was extracted from the PERSON table. Ages were extracted using the &#x201c;year of birth&#x201d; values.</p>
</sec>
<sec id="s2-4">
<title>Measurement Table</title>
<p>Information about COVID-19 testing was stored in the Measurement table. We extracted the date, test type and test result for each person.</p>
<p>COVID-19 positivity was determined by the presence of either one of the three criteria: positive COVID-19 antibody test, positive COVID-19 Polymerase Chain Reaction (PCR) test, or the presence of ICD-10 U07.1 code in the EHR record. COVID-19 negativity was assigned if the person were tested for COVID-19 but has never had a positive test nor an ICD-10 U07.1&#x20;code.</p>
</sec>
<sec id="s2-5">
<title>Condition_Occurence Table</title>
<p>We extracted medical conditions (such as signs and symptoms, injury, abnormal findings and diagnosis) for each patient from this table by leveraging the inherent hierarchical structure of the ICD-10 classification system.</p>
</sec>
<sec id="s2-6">
<title>Observation Table</title>
<p>Lifestyle and habits (i.e.,&#x20;BMI, smoking, alcohol and substance use) were extracted from this table. This table also includes the current status (i.e.,&#x20;current, former, never or unknown) of habits for each patient.</p>
</sec>
<sec id="s2-7">
<title>Feature Filtering and Extraction</title>
<p>Demographics, lifestyle/habits and conditions (encoded by ICD-9/ICD-10) are obtained as features in our model. For the purpose of using the updated version of ICD codes as features, we converted all ICD-9 codes to ICD-10 codes using a publicly available converter script (<xref ref-type="bibr" rid="B10">Hanratty, 2019</xref>). We used these converted codes along with the original ICD-10 codes to map and extract conditions reported in the EHR for each patient.</p>
<p>Before feature extraction, we filtered out all COVID-19 related ICD-10 codes such as U07.1 (COVID-19, virus identified), Z86.16 (personal history of COVID-19), J12.82 (pneumonia due to coronavirus disease 2019), B94.8 (sequelae of COVID-19), B34.2 (Coronavirus infection, unspecified), and B97.2 (Coronavirus as the cause of diseases classified elsewhere). Discarding COVID-19-related codes is imperative to prevent data leakage in our predictive model. Data leakage refers to the inclusion of information about the target of the prediction in the features used for making the prediction that should not be (legitimately) available at the time a prediction is made (<xref ref-type="bibr" rid="B46">Huang et&#x20;al., 2000</xref>; <xref ref-type="bibr" rid="B47">Nisbet et&#x20;al., 2009</xref>; <xref ref-type="bibr" rid="B48">Kaufman et&#x20;al., 2012</xref>; <xref ref-type="bibr" rid="B49">Filho et&#x20;al., 2021</xref>).</p>
</sec>
<sec id="s3">
<title>Temporal Filter for Medical Condition data</title>
<p>For the positive cohort, we used the date of patients&#x2019; first COVID-19 testing or their first assignment of the COVID-19-related ICD-10 codes (U07.1, U07.2, Z86.16, J12.82, B94.8, B34.2, or B97.29) as the timestamp to apply a temporal filter for feature selections. For the negative cohort, we also used the date of their first COVID-19 testing as the timestamp. We define temporal filter as a restricted timeframe to study the effect of conditions for infection (i.e.,&#x20;to assess risk using medical conditions occurred within 2&#xa0;weeks before an infection). This temporal filter is crucial to once again avoid data leakage by excluding features that may emerge as a result of a COVID-19 infection or diagnosis.</p>
<p>To investigate how the timing of medical events and conditions may affect the risk for COVID-19, we extracted the condition data over two distinct time intervals. The first timeframe only considers the conditions within the 2-week window prior to the date of diagnosis whereas the second timeframe retains all condition data before a given patient&#x2019;s first COVID-19 test or diagnosis.</p>
</sec>
<sec id="s4">
<title>Credit Scorecard Model</title>
<sec id="s4-1">
<title>Variable (Feature) Selections</title>
<p>After extracting patients&#x2019; demographic information, lifestyle, habits and ICD-10 condition codes, we converted them to features using one-hot encoding. Features with more than 95% missing data or 95% identical values across all observations were removed. The remaining variables underwent weight-of-evidence (WoE) transformation, which standardizes the scale of features and establishes a monotonic relationship with the outcome variable (<xref ref-type="bibr" rid="B50">Zdravevski et&#x20;al., 2011</xref>). WoE transformation also handles missing and extreme outliers while supporting interpretability through enforcing strict linear relationships (<xref ref-type="bibr" rid="B50">Zdravevski et&#x20;al., 2011</xref>). WoE transformations require all continuous or discrete variables to be binned. This binning process is carried out programmatically based on conditional inference trees (<xref ref-type="bibr" rid="B51">Hothorn et&#x20;al., 2006</xref>). Missing values for each feature are placed in their own bin and eventually assigned their own WoE values. Each level <inline-formula id="inf1">
<mml:math id="m1">
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> of the binned values for each feature is then assigned a WoE value via <inline-formula id="inf2">
<mml:math id="m2">
<mml:mrow>
<mml:mi>W</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>E</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>ln</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mtext>&#x7c;</mml:mtext>
<mml:mi>y</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mtext>&#x7c;</mml:mtext>
<mml:mi>y</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> where <inline-formula id="inf3">
<mml:math id="m3">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>/</mml:mo>
<mml:mi>y</mml:mi>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is the conditional probability of <inline-formula id="inf4">
<mml:math id="m4">
<mml:mi>x</mml:mi>
</mml:math>
</inline-formula> given <inline-formula id="inf5">
<mml:math id="m5">
<mml:mi>y</mml:mi>
</mml:math>
</inline-formula>, and <inline-formula id="inf6">
<mml:math id="m6">
<mml:mi>y</mml:mi>
</mml:math>
</inline-formula> is the binary response variable. All values of the independent variables, including missing values, are then replaced with their corresponding WoE value (<xref ref-type="bibr" rid="B50">Zdravevski et&#x20;al., 2011</xref>; <xref ref-type="bibr" rid="B52">Szepannek, 2020</xref>). These transformed variables were then used in logistic regression to assign weights for the Scorecard.</p>
<p>For feature selection and regression on these transformed variables, we tested two regularization approaches, LASSO (<xref ref-type="bibr" rid="B29">Tibshirani, 1996</xref>) and Elastic-Net (<xref ref-type="bibr" rid="B36">Zou and Hastie, 2005</xref>), using a cross-validation-based logistic regression method from the Python package <italic>Scikit-Learn</italic> (version 0.23.2). This method incorporates the use of stratified cross-validation to determine optimal parameters for LASSO and Elastic-Net. LASSO is a modification to typical generalized linear modeling techniques such as logistic regression. Under the constraint the sum of the absolute value of the model coefficients are less than a constant, the residual sum of square errors is minimized (<xref ref-type="bibr" rid="B29">Tibshirani, 1996</xref>). The application of this constraint results in some coefficients being 0, making LASSO a simultaneous variable selection and model fitting technique. Building on LASSO, Elastic-Net adds a quadratic penalty term to the calculation of coefficients. Practically, this additional term prevents the &#x201c;saturation&#x201d; (<xref ref-type="bibr" rid="B36">Zou and Hastie, 2005</xref>) problem sometimes experienced with LASSO where an artificially limited number of variables are selected. Both techniques employ penalty terms to shrink variable coefficients to eliminate uninformative features and avoid collinearity.</p>
<p>Collinearity is a major problem in extracting features from ICD codes since some codes are frequently reported together, or different providers may use inconsistent and incomplete codes. Between the two approaches, LASSO is a more stringent variable selector. For example, in the case of two highly similar features, LASSO tends to eliminate one of them while Elastic-Net will shrink the corresponding coefficients and keep both features (<xref ref-type="bibr" rid="B11">Hastie et&#x20;al., 2001</xref>).</p>
<p>The regularization strength (for both LASSO and Elastic-Net) parameter and mixing parameter (for Elastic-Net) were selected using 10-fold stratified cross-validation (CV). This method creates 10 versions of the model using a fixed set of parameters, each trained on 90% of the training data with 10% held out in each &#x201c;fold&#x201d; for scoring that particular instance of the model. The stratified variant of CV ensures that the distribution of classes (here COVID-positive patients and COVID-negative patients) is identical across the 90%/10% split of each fold. This process enables the model developer to assess the predictive capability of the model given the specific set of parameters being tested. The scores over all folds are averaged to assign an overall score for the given set of parameters. This process is repeated for all candidate sets of parameters being tested. Cross-validation aids in preventing overfitting, i.e.,&#x20;failing to generalize the pattern from the data, because the model is judged based on its predictions on hold-out data, which are not used for training the&#x20;model.</p>
<p>For scoring candidate sets of parameters, we chose negative log loss, a probability-based scoring metric, because a Scorecard model is based on probabilities rather than strict binary predictions. In particular, negative log loss penalizes predictions based on how far their probability is from the correct response (<xref ref-type="bibr" rid="B53">Bishop, 2016</xref>). For example, consider a patient who is in truth COVID-negative. A forecast that a COVID-positive diagnosis is 51% likely will be penalized less harshly than a forecast that COVID-positive is 99% likely. Conversely, a forecast that a positive diagnosis is 49% likely will be rewarded less than one that such a diagnosis is 1% likely.</p>
<p>The hyperparameters evaluated for candidate LASSO models was regularization strength, or the inverse of lambda referred to in (<xref ref-type="bibr" rid="B29">Tibshirani, 1996</xref>). One-hundred candidate values on a log scale between 1e<sup>&#x2212;4</sup> and 1e<sup>4</sup> were considered. The model with the best score from the technique described above was considered to have the optimal hyperparameters. For Elastic-Net, the same set of regularization strength parameters was considered. Additionally, Elastic-Net has a mixing parameter that controls the relative strength of the LASSO-like penalty and the additional Elastic-Net penalty term. Ten evenly spaced values between 0 and 1 were considered for this hyperparameter.</p>
<p>To address the class imbalance between COVID-19 positive and negative group in the training data, we weighted each observation inversely proportional to the size of its class. Likewise, the use of a stratified cross-validation method reduces the risk of inflating some scoring metrics by the model preferring to simply predict the dominant class. Using the above methods, we wanted to compare and contrast four models to predict the risk for infection. Below are the four models:<list list-type="simple">
<list-item>
<label>1.</label>
<p> LASSO with all conditions/features reported before the infection/diagnosis</p>
</list-item>
<list-item>
<label>2.</label>
<p>Elastic-Net with all conditions/features reported before the infection/diagnosis</p>
</list-item>
<list-item>
<label>3.</label>
<p>LASSO with only conditions/features reported within 2&#xa0;weeks of infection/diagnosis</p>
</list-item>
<list-item>
<label>4.</label>
<p>Elastic-Net with only conditions/features reported within 2&#xa0;weeks of infection/diagnosis</p>
</list-item>
</list>
</p>
</sec>
<sec id="s4-2">
<title>Model Evaluations</title>
<p>Data were randomly split into 80% for the train set and 20% for the test set. The quality of the four models built from two different time-filtered datasets and two different regularization techniques were evaluated by plotting the Receiving Operating Characteristic (ROC) curve and measuring the corresponding Area Under the ROC Curve (AUC). We also considered other model quality metrics such as Accuracy (ACC)&#x2014;the percent of correct responses&#x2014;and F-score&#x2014;the harmonic mean of precision and recall. We also used the confusion matrices to judge the quality of our candidate models. Considering that these models are built to recommend COVID-19 testing, we sought to avoid False Negative predictions while being more lenient towards False Positive errors.</p>
</sec>
<sec id="s4-3">
<title>Risk Score Scaling Using the Scorecard Method</title>
<p>Coefficients from the resulting logistic regression models were then combined with the WoE-transformed variables to establish scores for each feature in the Scorecard. This scorecard generation was performed using the Scorecard method implemented in the <italic>scorecardpy</italic> python package (version 0.1.9.2). As opposed to pure logistic regression models, scorecard models allow a strictly linear combination of scores that can be calculated even on a piece of paper, without the aid of any technology. Calculating the probabilities from a logistic regression model would require inverse transformations of log odds. We chose the scorecard model for the strict linear interpretation and corresponding ease of deployment anywhere.</p>
<p>This method requires users to select target odds and target points (a baseline number of points corresponding to a baseline score) along with the points required to double the odds. As these choices are arbitrary, we used the package defaults, which set the target odds to 1/19, the corresponding target points to 600, and the default points required to double the odds to 50. <xref ref-type="sec" rid="s12">Supplemental Figure S1</xref> shows an example of a Scorecard distribution calculated in this manner. Since the final Scorecard model is a linear function of the predictors (i.e.,&#x20;higher scores indicate higher COVID-19 risks), using scorecards has many benefits such as transparency, interpretability and facile implementation.</p>
</sec>
</sec>
<sec id="s5">
<title>Building a Web Application to Predict COVID-19 Risks</title>
<p>Based on the final Scorecard model results, we used the <italic>streamlit</italic> package (version 0.77.0) in Python to build an interface and used interactive indicator plot from <italic>plotly</italic> to visualize the risk score. The Python code to build this application can be found in our gitlab repository at <ext-link ext-link-type="uri" xlink:href="http://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/public/covid-19_risk_predictor">gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/public/covid-19_risk_predictor</ext-link>.</p>
</sec>
</sec>
<sec sec-type="results" id="s6">
<title>Results</title>
<p>Our dataset was composed of 7,262 patients from within the UAB Health System who received COVID-19 testing or diagnosis from January to June 2020. The demographic information of this study population is shown in <xref ref-type="table" rid="T1">Table&#x20;1</xref>. Among them, 912 patients were diagnosed with COVID-19 and the remaining 6,350 patients, were not. On average, patients in the positive group received 13% more COVID-19 tests (1.45 vs. 1.19 tests/person). While there is no statistically significant difference in age and gender between the two groups, African American (46 vs. 39%), Asian (3 vs. 1%) and Others (11 vs. 3%) ethnicity were overrepresented in the positive group, a finding which is concordant with other reports about the racial disparity in COVID-19 (<xref ref-type="bibr" rid="B15">Kullar et&#x20;al., 2020</xref>). In this UAB Health System dataset, a greater number of patients in the negative group reported substance abuse (14 vs. 3%) and current smoking (25 vs. 9%). There was no difference in Body Mass Index (BMI) between the two groups. Although the COVID-19 negative group had more reported medical conditions (178 vs. 142 medical conditions/person), they had fewer unique medical conditions (4 vs. 10 unique conditions/person).</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Demographics and Clinical Characteristics of the UAB LDS N3C Cohort.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th colspan="3" align="center">UAB LDS N3C cohort (<italic>n</italic>&#x20;&#x3d; 7,262)</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td colspan="3" align="left">COVID-19 testing:</td>
</tr>
<tr>
<td align="left">&#x2003;COVID-19 results</td>
<td align="center">Positive (<italic>n</italic>&#x20;&#x3d; 912)</td>
<td align="center">Negative (<italic>n</italic>&#x20;&#x3d; 6,350)</td>
</tr>
<tr>
<td align="left">&#x2003;Total COVID tests</td>
<td align="center">1,328</td>
<td align="center">7,596</td>
</tr>
<tr>
<td align="left">&#x2003;COVID Tests/Person</td>
<td align="center">1.46</td>
<td align="center">1.20</td>
</tr>
<tr>
<td colspan="3" align="left">All medical tests:</td>
</tr>
<tr>
<td align="left">&#x2003;All tests</td>
<td align="center">1,951,404</td>
<td align="center">17,395,613</td>
</tr>
<tr>
<td align="left">&#x2003;All tests/person</td>
<td align="center">2,139</td>
<td align="center">2,739</td>
</tr>
<tr>
<td align="left">&#xa0;&#xa0;Age</td>
<td align="center">mean &#x3d; 52 (10&#x2013;119)</td>
<td align="center">mean &#x3d; 52 (&#x3c;1&#x2013;119)</td>
</tr>
<tr>
<td colspan="3" align="left">Gender:</td>
</tr>
<tr>
<td align="left">&#x2003;Male (%)</td>
<td align="center">394 (43%)</td>
<td align="center">3,035 (48%)</td>
</tr>
<tr>
<td align="left">&#x2003;Female (%)</td>
<td align="center">516 (57%)</td>
<td align="center">3,314 (52%)</td>
</tr>
<tr>
<td align="left">&#x2003;Unknown (%)</td>
<td align="center">2 (0%)</td>
<td align="center">1 (0%)</td>
</tr>
<tr>
<td colspan="3" align="left">Race:</td>
</tr>
<tr>
<td align="left">&#x2003;White (%)</td>
<td align="center">337 (37%)</td>
<td align="center">3,441 (54%)</td>
</tr>
<tr>
<td align="left">&#x2003;Black (%)</td>
<td align="center">416 (46%)</td>
<td align="center">2,497 (39%)</td>
</tr>
<tr>
<td align="left">&#x2003;Asian (%)</td>
<td align="center">27 (3%)</td>
<td align="center">70 (1%)</td>
</tr>
<tr>
<td align="left">&#x2003;Hispanic (%)</td>
<td align="center">28 (3%)</td>
<td align="center">174 (3%)</td>
</tr>
<tr>
<td align="left">&#x2003;Others (%)</td>
<td align="center">104 (11%)</td>
<td align="center">168 (3%)</td>
</tr>
<tr>
<td colspan="3" align="left">Conditions:</td>
</tr>
<tr>
<td align="left">&#x2003;Total conditions</td>
<td align="center">129,091</td>
<td align="center">1,133,396</td>
</tr>
<tr>
<td align="left">&#x2003;Unique conditions</td>
<td align="center">9,224</td>
<td align="center">24,101</td>
</tr>
<tr>
<td align="left">&#x2003;&#x23;Conditions/Person</td>
<td align="center">142</td>
<td align="center">178</td>
</tr>
<tr>
<td align="left">&#x2003;&#x23;Unique conditions/Person</td>
<td align="center">10</td>
<td align="center">4</td>
</tr>
<tr>
<td colspan="3" align="left">Smoking:</td>
</tr>
<tr>
<td align="left">&#x2003;Current smoker</td>
<td align="center">81 (9%)</td>
<td align="center">1,602 (25%)</td>
</tr>
<tr>
<td align="left">&#x2003;Former smoker</td>
<td align="center">196 (21.5%)</td>
<td align="center">1,625 (26%)</td>
</tr>
<tr>
<td align="left">&#x2003;Never smoker</td>
<td align="center">368 (40%)</td>
<td align="center">2,589 (41%)</td>
</tr>
<tr>
<td align="left">&#x2003;Unknown</td>
<td align="center">13 (1%)</td>
<td align="center">64 (1%)</td>
</tr>
<tr>
<td colspan="3" align="left">Substance use:</td>
</tr>
<tr>
<td align="left">&#x2003;Current substance abuse</td>
<td align="center">27 (3%)</td>
<td align="center">895 (14%)</td>
</tr>
<tr>
<td align="left">&#x2003;No substance abuse</td>
<td align="center">632 (69%)</td>
<td align="center">4,716 (74%)</td>
</tr>
<tr>
<td align="left">&#x2003;Former substance abuse</td>
<td align="center">32 (3.5%)</td>
<td align="center">402 (6%)</td>
</tr>
<tr>
<td align="left">&#x2003;Unknown</td>
<td align="center">15 (1.6%)</td>
<td align="center">74 (1%)</td>
</tr>
<tr>
<td colspan="3" align="left">Alcohol use:</td>
</tr>
<tr>
<td align="left">&#x2003;Current alcohol</td>
<td align="center">273 (30%)</td>
<td align="center">1954 (31%)</td>
</tr>
<tr>
<td align="left">&#x2003;Former alcohol</td>
<td align="center">58 (6%)</td>
<td align="center">652 (10%)</td>
</tr>
<tr>
<td align="left">&#x2003;No alcohol</td>
<td align="center">379 (41.5%)</td>
<td align="center">3,459 (54.5%)</td>
</tr>
<tr>
<td align="left">&#x2003;Unknown</td>
<td align="center">12 (1.3%)</td>
<td align="center">80 (1%)</td>
</tr>
<tr>
<td colspan="3" align="left">Weight:</td>
</tr>
<tr>
<td align="left">&#x2003;Underweight (BMI &#x3c; 19)</td>
<td align="center">20 (2%)</td>
<td align="center">271 (4%)</td>
</tr>
<tr>
<td align="left">&#x2003;Normal weight (BMI &#x3d; 20&#x2013;25)</td>
<td align="center">49 (5%)</td>
<td align="center">563 (9%)</td>
</tr>
<tr>
<td align="left">&#x2003;Overweight (BMI &#x3d; 25&#x2013;40)</td>
<td align="center">320 (35%)</td>
<td align="center">2,439 (38%)</td>
</tr>
<tr>
<td align="left">&#x2003;Obese (BMI &#x3e; 40)</td>
<td align="center">120 (13%)</td>
<td align="center">773 (12%)</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The workflow to build the predictive model for COVID-19 diagnosis based on EHR data is summarized in <xref ref-type="fig" rid="F1">Figure&#x20;1</xref>. We used condition data extracted from ICD-9/ICD-10 codes from two different timeframes to assess how the timing of medical symptoms and conditions may affect our COVID-19 risk predictions. The first timeframe considers the data reported within a 2-week window of testing/diagnosis while the second timeframe retains all condition data prior to a COVID-19 test or diagnosis. Such condition data suffer from collinearity issues in that a group of medical conditions tends to be reported together, and different providers may use inconsistent codes for the same conditions. To address these collinearity issues, we utilized two different regularized regression techniques, LASSO and Elastic-Net. Applying these two methods on the two data timeframes yielded four different models with reasonable discriminatory power, as judged by performance metrics on testing data. With LASSO, we achieved 0.75 accuracy and 0.84 [CI: 0.81&#x2013;0.87] AUC for the 2-week data and 0.74 accuracy and 0.80 [CI: 076&#x2013;0.83] AUC for all-time data (<xref ref-type="fig" rid="F2">Figure&#x20;2</xref>; <xref ref-type="table" rid="T2">Table&#x20;2</xref>). Elastic-Net models also performed with a similar accuracy of 0.76 and AUC of 0.84 [CI: 0.81&#x2013;0.87] for the 2-week data and an accuracy of 0.74 and AUC of 0.79 [CI: 0.76&#x2013;0.83] for the all-time data (<xref ref-type="fig" rid="F2">Figure&#x20;2</xref>; <xref ref-type="table" rid="T2">Table&#x20;2</xref>).</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>Overview of workflow.</p>
</caption>
<graphic xlink:href="fdata-04-675882-g001.tif"/>
</fig>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>LASSO vs Elastic-Net model performance on two sets of data Receiver operating characteristic (ROC) curves are shown for the final model for each of the four assessed techniques <bold>(A,B)</bold>, and the corresponding areas under curves (AUC) are presented in the figure legend. By AUC on hold out data (0.815), the models built on data filtered by two-week before COVID (non)diagnosis perform the best (B).</p>
</caption>
<graphic xlink:href="fdata-04-675882-g002.tif"/>
</fig>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>Model metrics Evaluation of four models (LASSO and Elastic-Net with patient&#x2019;s conditions information from two timeframes) while training and testing (i.e.,&#x20;holdout) data set. For each model, the accuracy, F-Score, and AUC with 95% CI using DeLong&#x2019;s method (<xref ref-type="bibr" rid="B4">DeLong et&#x20;al., 1988</xref>) are shown. The accuracy metric indicates the percent of correct predictions. F-score is the harmonic mean of precision and recall. Area under receiver operating curve (AUC) is the area under the curve resulting from plotting the true positive against the false positive&#x20;rate.</p>
</caption>
<table>
<thead>
<tr>
<th colspan="4" align="center">Training metrics</th>
</tr>
<tr>
<td colspan="2" align="center">All-Time &#x2b; LASSO</td>
<td colspan="2" align="left">All-Time &#x2b; Elastic-Net</td>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Accuracy</td>
<td align="center">0.746</td>
<td align="center">Accuracy</td>
<td align="center">0.755</td>
</tr>
<tr>
<td align="left">F-Score</td>
<td align="center">0.834</td>
<td align="center">F-Score</td>
<td align="center">0.840</td>
</tr>
<tr>
<td align="left">AUC</td>
<td align="center">0.838</td>
<td align="center">AUC</td>
<td align="center">0.840</td>
</tr>
<tr>
<td align="left">95% AUC CI</td>
<td align="center">[0.82 0.86]</td>
<td align="center">95% AUC CI</td>
<td align="center">[0.82 0.86]</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<td colspan="2" align="center">2-Week &#x2b; LASSO</td>
<td colspan="2" align="center">2-Week &#x2b; Elastic-Net</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Accuracy</td>
<td align="center">0.774</td>
<td align="center">Accuracy</td>
<td align="center">0.775</td>
</tr>
<tr>
<td align="left">F-Score</td>
<td align="center">0.847</td>
<td align="center">F-Score</td>
<td align="center">0.848</td>
</tr>
<tr>
<td align="left">AUC</td>
<td align="center">0.848</td>
<td align="center">AUC</td>
<td align="center">0.848</td>
</tr>
<tr>
<td align="left">95% AUC CI</td>
<td align="center">[0.83 0.87]</td>
<td align="center">95% AUC CI</td>
<td align="center">[0.83 0.87]</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<td colspan="4" align="center">Testing Metrics</td>
</tr>
<tr>
<td colspan="2" align="center">All-time &#x2b; LASSO</td>
<td colspan="2" align="center">All-time &#x2b; Elastic-Net</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Accuracy</td>
<td align="center">0.741</td>
<td align="center">Accuracy</td>
<td align="center">0.744</td>
</tr>
<tr>
<td align="left">F-Score</td>
<td align="center">0.832</td>
<td align="center">F-Score</td>
<td align="center">0.834</td>
</tr>
<tr>
<td align="left">AUC</td>
<td align="center">0.796</td>
<td align="center">AUC</td>
<td align="center">0.794</td>
</tr>
<tr>
<td align="left">95% AUC CI</td>
<td align="center">[0.76 0.83]</td>
<td align="center">95% AUC CI</td>
<td align="center">[0.76 0.83]</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<td colspan="2" align="center">2-Week &#x2b; LASSO</td>
<td colspan="2" align="center">2-Week &#x2b; Elastic-Net</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Accuracy</td>
<td align="center">0.753</td>
<td align="center">Accuracy</td>
<td align="center">0.755</td>
</tr>
<tr>
<td align="left">F-Score</td>
<td align="center">0.833</td>
<td align="center">F-Score</td>
<td align="center">0.835</td>
</tr>
<tr>
<td align="left">AUC</td>
<td align="center">0.837</td>
<td align="center">AUC</td>
<td align="center">0.837</td>
</tr>
<tr>
<td align="left">95% AUC CI</td>
<td align="center">[0.81 0.87]</td>
<td align="center">95% AUC CI</td>
<td align="center">[0.81 0.87]</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Using LASSO, a more stringent regularization method where many variables are eliminated through shrinkage, after filtering, 30 out of the 58 features were retained (<xref ref-type="sec" rid="s12">Supplemental Table S1</xref>) in the 2-week data, and 93 out of 212 features were retained in the all-time data (<xref ref-type="sec" rid="s12">Supplemental Table S2</xref>). Within two weeks before a COVID-19 diagnosis, features that predict higher risks for this disease were cough (R05), abnormalities of breathing (R06), pain in throat and chest (R07), abnormal findings on diagnostic imaging of lung (R91), respiratory disorder (J98), disorders of fluid, electrolyte and acid-base balance (E87), nicotine dependence (F17), major depressive disorder (F32) and overweight and obesity (E66) (<xref ref-type="sec" rid="s12">Supplemental Table S1</xref>). The LASSO model on all-time data identified similar features from the 2-week data such as cough (R05), but it also delineated other important features related to acute respiratory infections such as fever (R50), pain (R52), acute upper respiratory infections (J06), respiratory failure (J96), respiratory disorder (J98), pneumonia (J18), vasomotor and allergic rhinitis (J30), and other disorders of nose and nasal sinuses (J34). Most notably, the all-time model uncovered several chronic conditions in other organ systems besides the respiratory system including neurological disorders e.g. postviral fatigue syndrome (G93, R41), kidney diseases (I12, I13, N17), diseases of the heart and circulation including hypertension and kidney failure (I49, I51, J95) and fibrosis/cirrhosis of the liver (K74), suggesting that long-term chronic conditions in other organ systems may increase the risks for contracting an acute respiratory illness such as COVID-19.</p>
<p>Even though LASSO is an effective method to handle collinearity issues, it may not work well with multicollinearity where several features are correlated among each other, as observed in our condition data. Considering that LASSO may eliminate important features through the stringent shrinkage process, we also implemented the Elastic-Net regularization method as a less stringent variable selector. This approach retained more features than the LASSO with 43 features remained for the 2-week data and 179 features for the all-time data. All features selected from the LASSO method also remained in the Elastic-Net method. Several new predictive features emerged from the 2-week data including primary hypertension (I10) and gastro-esophageal reflux disease (K21). In the all-time data, many distinct yet similar conditions from the LASSO model also appeared such as acute myocardial infarction (I21), cardiomyopathy (I42), other cardiac arrhythmias (I49), cerebral infarction (I63), complications and ill-defined descriptions of heart disease (I51), peripheral vascular diseases (I73), and other cerebrovascular diseases (I67), pointing to vascular disorders. Other medical conditions also emerged including viral hepatitis (B19), bacterial infection (B96), thrombocytopenia (D69), epilepsy and recurrent seizures (G40), although the predictive powers of these variables were&#x20;low.</p>
<p>Among the four candidate models we generated based on the UAB-i2b2 data, the LASSO method on the 2-week filtered data retained the fewest variables while achieving similar performance with other more complex models (<xref ref-type="fig" rid="F2">Figures 2</xref>, <xref ref-type="fig" rid="F3">3</xref>; <xref ref-type="table" rid="T2">Table&#x20;2</xref>; <xref ref-type="sec" rid="s12">Supplemental Tables S1&#x2013;S4</xref>). For this reason, we believed this is a superior model and selected it as the model for our web application. This interactive web application (<xref ref-type="fig" rid="F4">Figure&#x20;4</xref>) gathers participant questionnaire inputs and generates a risk prediction score of having COVID-19. The Scorecard distribution based on the logistic regression model can be found in <xref ref-type="sec" rid="s12">Supplemental Figure S1</xref>. This tool can be used for individuals to check their risks based on their symptoms or conditions, or for organizations to build questionnaires to perform COVID-19 screening for building entries. An example questionnaire from our final model is provided in <xref ref-type="table" rid="T3">Table&#x20;3</xref>.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Confusion matrices Confusion matrices using training <bold>(A&#x2013;D)</bold> and holdout <bold>(E&#x2013;H)</bold> data are shown for the final model for each of the four assessed techniques. Considering that these models are built to recommend COVID-19 testing, we sought to avoid False Negative predictions while being more lenient towards False Positive errors.</p>
</caption>
<graphic xlink:href="fdata-04-675882-g003.tif"/>
</fig>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Web application demonstration Four representative snapshots with different scorings from the COVID-19 risk predictor web application are shown. Scores were calculated based on participant answers to questions related to their symptoms and conditions using the Credit Scorecard method.</p>
</caption>
<graphic xlink:href="fdata-04-675882-g004.tif"/>
</fig>
<table-wrap id="T3" position="float">
<label>TABLE 3</label>
<caption>
<p>Example questionnaire Example questionnaire built using our selected model using the UAB-i2b2 data&#x2014;the LASSO method on the 2-week filtered data. Base score is 320 and the risk increases/decreases based on the answers in the questionnaire. Any score between 450 and 696 is considered high risk for infection. Disclaimer: This questionnaire is intended only as an example output from a model built using our pipeline. It is not itself a diagnostic&#x20;tool.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Questions</th>
<th align="center">Yes</th>
<th align="center">No</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Do you have chronic kidney disease?</td>
<td align="char" char=".">36</td>
<td align="char" char=".">&#x2212;6</td>
</tr>
<tr>
<td align="left">Do you have cough?</td>
<td align="char" char=".">36</td>
<td align="char" char=".">&#x2212;44</td>
</tr>
<tr>
<td align="left">Have you delivered a baby?</td>
<td align="char" char=".">35</td>
<td align="char" char=".">&#x2212;2</td>
</tr>
<tr>
<td align="left">Are you having acute upper respiratory infections?</td>
<td align="char" char=".">30</td>
<td align="char" char=".">&#x2212;6</td>
</tr>
<tr>
<td align="left">Do you have fever?</td>
<td align="char" char=".">24</td>
<td align="char" char=".">&#x2212;5</td>
</tr>
<tr>
<td align="left">Are you having depression, anxiety, problems with cognitive functions or other brain disorders?</td>
<td align="char" char=".">17</td>
<td align="char" char=".">&#x2212;4</td>
</tr>
<tr>
<td align="left">Are you having pneumonia?</td>
<td align="char" char=".">17</td>
<td align="char" char=".">&#x2212;3</td>
</tr>
<tr>
<td align="left">Are you having respiratory failure?</td>
<td align="char" char=".">16</td>
<td align="char" char=".">&#x2212;3</td>
</tr>
<tr>
<td align="left">Are you dependent on nicotine?</td>
<td align="char" char=".">14</td>
<td align="char" char=".">&#x2212;4</td>
</tr>
<tr>
<td align="left">Do you have allergic rhinitis?</td>
<td align="char" char=".">14</td>
<td align="char" char=".">&#x2212;2</td>
</tr>
<tr>
<td align="left">Do you have retention of urine?</td>
<td align="char" char=".">14</td>
<td align="char" char=".">&#x2212;1</td>
</tr>
<tr>
<td align="left">Do you have pain?</td>
<td align="char" char=".">14</td>
<td align="char" char=".">&#x2212;1</td>
</tr>
<tr>
<td align="left">Do you have hernia?</td>
<td align="char" char=".">13</td>
<td align="char" char=".">&#x2212;1</td>
</tr>
<tr>
<td align="left">Do you have liver fibrosis/cirrhosis?</td>
<td align="char" char=".">13</td>
<td align="char" char=".">&#x2212;1</td>
</tr>
<tr>
<td align="left">Do you have disturbances of skin sensation?</td>
<td align="char" char=".">12</td>
<td align="char" char=".">&#x2212;2</td>
</tr>
<tr>
<td align="left">Are you having anemia?</td>
<td align="char" char=".">10</td>
<td align="char" char=".">&#x2212;1</td>
</tr>
<tr>
<td align="left">Are you having bacterial infection?</td>
<td align="char" char=".">9</td>
<td align="char" char=".">&#x2212;1</td>
</tr>
<tr>
<td align="left">Do you have complications from heart disease?</td>
<td align="char" char=".">8</td>
<td align="char" char=".">&#x2212;2</td>
</tr>
<tr>
<td align="left">Do you have hypotension?</td>
<td align="char" char=".">8</td>
<td align="char" char=".">&#x2212;1</td>
</tr>
<tr>
<td align="left">Do you have complications of cardiac and vascular prosthetic devices, implants and grafts?</td>
<td align="char" char=".">6</td>
<td align="char" char=".">0</td>
</tr>
<tr>
<td align="left">Are you vitamin D deficient?</td>
<td align="char" char=".">2</td>
<td align="char" char=".">0</td>
</tr>
<tr>
<td align="left">Do you have cardiac arrhythmias?</td>
<td align="char" char=".">2</td>
<td align="char" char=".">0</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec sec-type="discussion" id="s7">
<title>Discussion</title>
<p>In this project, we built a data processing and predictive analytics workflow to predict the risks for COVID-19 diagnosis using patients&#x2019; longitudinal medical conditions encoded by the ICD-9/ICD-10 classification system. We tested the implications of applying different time windows and alternative variable regularization methods to extract the most predictive features from the condition&#x20;data.</p>
<p>Although the all-time data model selected more features with implications about pre-existing chronic medical conditions increasing the risk of contracting COVID-19, we determined that it was prone to capturing spurious correlations with distant historical data and had weaker performance than the 2-week models (<xref ref-type="fig" rid="F2">Figures 2</xref>, <xref ref-type="fig" rid="F3">3</xref>; <xref ref-type="table" rid="T2">Table&#x20;2</xref>; <xref ref-type="sec" rid="s12">Supplemental Tables S1&#x2013;S4</xref>). With regards to modeling techniques, we found that a more stringent regularized regression approach such as LASSO resulted in simpler models and still achieved high performance as compared to more complex models built from the Elastic-Net method (<xref ref-type="fig" rid="F2">Figures 2</xref>, <xref ref-type="fig" rid="F3">3</xref>; <xref ref-type="table" rid="T2">Table&#x20;2</xref>; <xref ref-type="sec" rid="s12">Supplemental Tables S1&#x2013;S4</xref>). As simpler models tend to be more generalizable, more interpretable, and less likely to be overfit, we consider the LASSO model using the 2-week data filter the superior model for its parsimony without sacrificing performance. Many COVID-19 risk prediction studies also employed LASSO (<xref ref-type="bibr" rid="B40">Alballa and Al-Turaiki, 2021</xref>) with a few other studies used Elastic-Net (<xref ref-type="bibr" rid="B43">Heldt et&#x20;al., 2021</xref>; <xref ref-type="bibr" rid="B44">Hu et&#x20;al., 2021</xref>; <xref ref-type="bibr" rid="B45">Huang et&#x20;al., 2021</xref>) as feature selection methods. A COVID-19 diagnostic prediction study by (<xref ref-type="bibr" rid="B42">Feng et&#x20;al., 2021</xref>) compared the performance of four different feature selection methods including LASSO, Ridge, Decision Tree and AdaBoost also found LASSO produced the best performance in both the testing and the validation&#x20;set.</p>
<p>While our workflow focuses on automatically extracting predictive features from ICD9/10 codes, the majority of COVID-19 prediction studies selected features from a wide-range of additional clinical data components such as chest computed tomography (CT) scan results, laboratory blood tests, which includes complete blood count (e.g., leukocyte, erythrocyte, platelet count, and hematocrit), metabolic factors (e.g., glucose, sodium, potassium, creatinine, urea, albumin, and bilirubin), clotting factors (e.g., prothrombin and fibrinogen), inflammation markers such as C-reactive protein and interleukin 6 (IL-6) (<xref ref-type="bibr" rid="B40">Alballa and Al-Turaiki, 2021</xref>). Furthermore, whereas some studies selected the initial sets of features from EHR data based on expert opinions (<xref ref-type="bibr" rid="B5">Estiri et&#x20;al., 2021</xref>; <xref ref-type="bibr" rid="B42">Feng et&#x20;al., 2021</xref>; <xref ref-type="bibr" rid="B27">Schwab et&#x20;al., 2021</xref>) and/or literature review (<xref ref-type="bibr" rid="B53">Joshi et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B27">Schwab et&#x20;al., 2021</xref>), we took an unbiased approach to use ICD9/10 codes along with demographic information as the initial set of features. Our data wrangling workflow is limited to the data available in the OMOP common data model, which facilitates scaling up the analyses when we have access to more data of the same format in the future.</p>
<p>Our results showed several COVID-19 predictive features that overlapped with existing published findings. For example, several respiratory symptoms such as cough, abnormalities of breath, and chest pain prioritized by our models&#x2014;particularly within the 2-week timeframe&#x2014;are well-known symptoms of COVID-19 (<xref ref-type="bibr" rid="B7">Fu et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B12">Huang et&#x20;al., 2020</xref>). Other chronic conditions selected from our models have also been reported to increase COVID-19 risks such as obesity (<xref ref-type="bibr" rid="B25">Popkin et&#x20;al., 2020</xref>), allergic rhinitis (<xref ref-type="bibr" rid="B34">Yang et&#x20;al., 2020</xref>), cardiovascular diseases (<xref ref-type="bibr" rid="B20">Nishiga et&#x20;al., 2020</xref>) and kidney diseases (<xref ref-type="bibr" rid="B1">Adapa et&#x20;al., 2020</xref>) while there are still on-going debates about the role of nicotine and smoking in COVID-19 risks (<xref ref-type="bibr" rid="B24">Polosa and Caci, 2020</xref>). Similar to other studies, we found that major depressive disorder is associated with COVID-19 diagnoses. However, it is unclear whether severe mental health problems are the cause, the effect, or the confounding factors with other features associated with COVID-19 (<xref ref-type="bibr" rid="B6">Ettman et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B18">Nami et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B28">Skoda et&#x20;al., 2020</xref>).</p>
<p>A major limitation in our predictive modeling pipeline relates to the fact that our model is based entirely on correlations between medical conditions and COVID-19 testing/diagnosis. Therefore, by design, this workflow cannot establish causal relationships. As examples, there are several medical conditions associated with lower risks for COVID-19 (<xref ref-type="sec" rid="s12">Supplemental Tables S1&#x2013;S4</xref>) which may highlight distinct features in our negative cohort but may not directly affect COVID-19 risks. This problem, however, is inevitable in predictive analytic workflows that derive inferences from retrospective data. Similar to all studies that apply machine learning methods to model COVID-19 diagnosis, our classifier is prone to imbalanced class distribution where there the positive COVID-19 instances are underrepresented in the training data (<xref ref-type="bibr" rid="B40">Alballa and Al-Turaiki, 2021</xref>). However, we addressed this class imbalance issue by weighing each observation inversely proportional to the size of its class (see the Methods <italic>Variable (Feature) Selections</italic>). Finally; we choose a generalized linear model approach where we assume linear relationships on a logistic scale between medical conditions and COVID-19 risks, and consequently, potential non-linear relationships are not considered.</p>
<p>Although our workflow is straightforward to implement, there are substantial trade-offs by using the ICD-9/ICD-10 standard vocabulary system as opposed to alternative text mining approaches to extract medical conditions from EHR data. ICD code accuracy is a major problem in some cases with classification error rates as high as 80% (<xref ref-type="bibr" rid="B21">O&#x27;Malley et&#x20;al., 2005</xref>). The sources of these errors are wide-ranging including poor communication between patients and providers, clinician&#x2019; mistakes or biases, transcription/scanning errors, coders&#x2019; experience, and intentional or unintentional biases (e.g., upcoding and unbundling for higher billing/reimbursement value) (<xref ref-type="bibr" rid="B21">O&#x27;Malley et&#x20;al., 2005</xref>). Inconsistent, incomplete, systemic and random errors in ICD coding (<xref ref-type="bibr" rid="B3">Cox et&#x20;al., 2009</xref>) introduce noise in the dataset, which is another limitation of our workflow.</p>
<p>Despite these inherent limitations, our study shows the promising utility of incorporating the ICD-10 system in an unbiased manner for novel inferences of EHR data, particularly to study medical symptoms and conditions that influence the risks for COVID-19. Future studies can consider incorporating other standard vocabularies available in EHR data such as Systemized Nomenclature of Medicine (SNOMED), Current Procedural Terminology (CPT), Logical Observation Identifiers Names and Codes (LOINC) as well as adding additional datasets such as patient&#x2019; medication uses to further understand the risks and the long-term consequences of COVID-19.</p>
</sec>
</body>
<back>
<sec id="s8">
<title>Data Availability Statement</title>
<p>The data analyzed in this study is subject to the following licenses/restrictions: All restrictions of the Limited Data Set (LDS) from the UAB i2b2 system apply to this dataset. Requests to access these datasets should be directed to <ext-link ext-link-type="uri" xlink:href="https://www.uab.edu/ccts/research-commons/berd/55-research-commons/informatics/325-i2b2">https://www.uab.edu/ccts/research-commons/berd/55-research-commons/informatics/325-i2b2</ext-link>.</p>
</sec>
<sec id="s9">
<title>Ethics Statement</title>
<p>Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.</p>
</sec>
<sec id="s10">
<title>Author Contributions</title>
<p>All authors listed have made direct and substantial contribution to the article and approved the submission of this article.</p>
</sec>
<sec sec-type="COI-statement" id="s11">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<ack>
<p>This COVID-19 risk prediction project was initiated and partially executed during a 3-day Hackathon event at UAB. We thank UAB Informatics Institute for providing data management support with U-BRITE (<ext-link ext-link-type="uri" xlink:href="http://www.ubrite.org/">http://www.ubrite.org/</ext-link>), which made this work possible. We also thank the UAB IT Research Computing who maintains the Cheaha Supercomputer resources, which was supported in part by the National Science Foundation under Grants Nos. OAC-1541310, the University of Alabama at Birmingham, and the Alabama Innovation Fund. The authors also sincerely thank Jelai Wang and Matt Wyatt for providing access to UAB N3C dataset and Ryan C. Godwin for his encouragement on the viability of our pipeline and models.</p>
</ack>
<sec id="s12">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fdata.2021.675882/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fdata.2021.675882/full&#x23;supplementary-material</ext-link>
</p>
<supplementary-material xlink:href="datasheet1.docx" id="SM1" mimetype="application/docx" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Adapa</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Chenna</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Balla</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Merugu</surname>
<given-names>G. P.</given-names>
</name>
<name>
<surname>Koduri</surname>
<given-names>N. M.</given-names>
</name>
<name>
<surname>Daggubati</surname>
<given-names>S. R.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>COVID-19 Pandemic Causing Acute Kidney Injury and Impact on Patients with Chronic Kidney Disease and Renal Transplantation</article-title>. <source>J.&#x20;Clin. Med. Res.</source> <volume>12</volume> (<issue>6</issue>), <fpage>352</fpage>&#x2013;<lpage>361</lpage>. <pub-id pub-id-type="doi">10.14740/jocmr4200</pub-id> </citation>
</ref>
<ref id="B41">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Alballa</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Al-Turaiki</surname>
<given-names>I.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Machine Learning Approaches in COVID-19 diagnosis, Mortality, and Severity Risk Prediction: A Review</article-title>. <source>Inform. Med.</source> <volume>24</volume>, <fpage>100564</fpage>. <pub-id pub-id-type="doi">10.1016/j.imu.2021.100564</pub-id>
</citation>
</ref>
<ref id="B2">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Bailey</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2006</year>). <source>Practical Credit Scoring: Issues and Techniques</source>. <publisher-loc>Bristol, United Kingdom</publisher-loc>: <publisher-name>White Box Publishing</publisher-name>.</citation>
</ref>
<ref id="B53">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bishop</surname>
<given-names>C. M.</given-names>
</name>
</person-group> (<year>2016</year>). <source>Pattern Recognition and Machine Learning</source>
<publisher-name>Springer</publisher-name>.</citation>
</ref>
<ref id="B39">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Blacketer</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Chapter 4. The Common Data Model [Online]</article-title>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="https://ohdsi.github.io/TheBookOfOhdsi/CommonDataModel.html">https://ohdsi.github.io/TheBookOfOhdsi/CommonDataModel.html</ext-link>
</comment>
</citation>
</ref>
<ref id="B38">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bowman</surname>
<given-names>S. E.</given-names>
</name>
</person-group> (<year>2005</year>). <article-title>Coordination of SNOMED-CT and ICD-10: Getting the Most out of Electronic Health Record Systems</article-title>. <source>Perspectives in Health Information Management</source> [Online]. <comment>
<ext-link ext-link-type="uri" xlink:href="http://library.ahima.org/doc?oid=106578#.YDXOMGNMEXx">http://library.ahima.org/doc?oid&#x003D;106578#.YDXOMGNMEXx</ext-link>
</comment>
</citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cox</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Martin</surname>
<given-names>B. C.</given-names>
</name>
<name>
<surname>Van Staa</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Garbe</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Siebert</surname>
<given-names>U.</given-names>
</name>
<name>
<surname>Johnson</surname>
<given-names>M. L.</given-names>
</name>
</person-group> (<year>2009</year>). <article-title>Good Research Practices for Comparative Effectiveness Research: Approaches to Mitigate Bias and Confounding in the Design of Nonrandomized Studies of Treatment Effects Using Secondary Data Sources: The International Society for Pharmacoeconomics and Outcomes Research Good Research Practices for Retrospective Database Analysis Task Force Report-Part II</article-title>. <source>Value in Health</source>. <volume>12</volume> (<issue>8</issue>), <fpage>1053</fpage>&#x2013;<lpage>1061</lpage>. <pub-id pub-id-type="doi">10.1111/j.1524-4733.2009.00601.x</pub-id> </citation>
</ref>
<ref id="B37">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dagliati</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Malovini</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Tibollo</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Bellazzi</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Health Informatics and EHR to Support Clinical Research in the COVID-19 PANDEMIc: An Overview</article-title>. <source>Brief Bioinform</source> <volume>22</volume> (<issue>2</issue>), <fpage>812</fpage>&#x2013;<lpage>822</lpage>. <pub-id pub-id-type="doi">10.1093/bib/bbaa418</pub-id>
</citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>DeLong</surname>
<given-names>E. R.</given-names>
</name>
<name>
<surname>DeLong</surname>
<given-names>D. M.</given-names>
</name>
<name>
<surname>Clarke-Pearson</surname>
<given-names>D. L.</given-names>
</name>
</person-group> (<year>1988</year>). <article-title>Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: a Nonparametric Approach</article-title>. <source>Biometrics</source> <volume>44</volume> (<issue>3</issue>), <fpage>837</fpage>&#x2013;<lpage>845</lpage>. <pub-id pub-id-type="doi">10.2307/2531595</pub-id> </citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Estiri</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Strasser</surname>
<given-names>Z. H.</given-names>
</name>
<name>
<surname>Klann</surname>
<given-names>J.&#x20;G.</given-names>
</name>
<name>
<surname>Naseri</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Wagholikar</surname>
<given-names>K. B.</given-names>
</name>
<name>
<surname>Murphy</surname>
<given-names>S. N.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Predicting COVID-19 Mortality with Electronic Medical Records</article-title>. <source>Npj&#x20;Digit. Med.</source> <volume>4</volume> (<issue>1</issue>), <fpage>15</fpage>. <pub-id pub-id-type="doi">10.1038/s41746-021-00383-x</pub-id> </citation>
</ref>
<ref id="B49">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Filho</surname>
<given-names>A. C.</given-names>
</name>
<name>
<surname>de Moraes Batista</surname>
<given-names>A. F.</given-names>
</name>
<name>
<surname>dos Santos</surname>
<given-names>H. G.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Data Leakage in Health Outcomes Prediction With Machine Learning. Comment on &#x201C;Prediction of Incident Hypertension Within the Next Year: Prospective Study Using Statewide Electronic Health Records and Machine Learning&#x201D;</article-title>. <source>J. Med. Internet. Res.</source> <volume>23</volume>, <fpage>1</fpage>&#x2013;<lpage>3</lpage>. <pub-id pub-id-type="doi">10.2196/10969</pub-id>
</citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ettman</surname>
<given-names>C. K.</given-names>
</name>
<name>
<surname>Abdalla</surname>
<given-names>S. M.</given-names>
</name>
<name>
<surname>Cohen</surname>
<given-names>G. H.</given-names>
</name>
<name>
<surname>Sampson</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Vivier</surname>
<given-names>P. M.</given-names>
</name>
<name>
<surname>Galea</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Prevalence of Depression Symptoms in US Adults before and during the COVID-19 Pandemic</article-title>. <source>JAMA Netw. Open</source> <volume>3</volume> (<issue>9</issue>), <fpage>e2019686</fpage>. <pub-id pub-id-type="doi">10.1001/jamanetworkopen.2020.19686</pub-id> </citation>
</ref>
<ref id="B42">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Feng</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Zhai</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>H.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>A Novel Artificial Intelligence-Assisted Triage Tool to aid in the Diagnosis of Suspected COVID-19 Pneumonia Cases in Fever Clinics</article-title>. <source>Ann. Transl. Med.</source> <volume>9</volume> (<issue>3</issue>), <fpage>201</fpage>.</citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fu</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Yuan</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Ao</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Fitzpatrick</surname>
<given-names>T.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Clinical Characteristics of Coronavirus Disease 2019 (COVID-19) in China: A Systematic Review and Meta-Analysis</article-title>. <source>J.&#x20;Infect.</source> <volume>80</volume> (<issue>6</issue>), <fpage>656</fpage>&#x2013;<lpage>665</lpage>. <pub-id pub-id-type="doi">10.1016/j.jinf.2020.03.041</pub-id> </citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gong</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Ou</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Qiu</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Jie</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Yuan</surname>
<given-names>L.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>A Tool for Early Prediction of Severe Coronavirus Disease 2019 (COVID-19): A Multicenter Study Using the Risk Nomogram in Wuhan and Guangdong, China</article-title>. <source>Clin. Infect. Dis.</source> <volume>71</volume> (<issue>15</issue>), <fpage>833</fpage>&#x2013;<lpage>840</lpage>. <pub-id pub-id-type="doi">10.1093/cid/ciaa443</pub-id> </citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Halalau</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Imam</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Karabon</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Mankuzhy</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Shaheen</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Tu</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>External Validation of a Clinical Risk Score to Predict Hospital Admission and In-Hospital Mortality in COVID-19 Patients</article-title>. <source>Ann. Med.</source> <volume>53</volume> (<issue>1</issue>), <fpage>78</fpage>&#x2013;<lpage>86</lpage>. <pub-id pub-id-type="doi">10.1080/07853890.2020.1828616</pub-id> </citation>
</ref>
<ref id="B10">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Hanratty</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2019</year>). <source>ICD9CMtoICD10CM [Online]</source> (<comment>Accessed March, 2, 2021</comment>)</citation>
</ref>
<ref id="B11">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Hastie</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Tibshirani</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Friedman</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2001</year>). <source>The Elements of Statistical Learning</source>. <publisher-loc>New York</publisher-loc>: <publisher-name>Springer</publisher-name>
</citation>
</ref>
<ref id="B43">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Heldt</surname>
<given-names>F. S.</given-names>
</name>
<name>
<surname>Vizcaychipi</surname>
<given-names>M. P.</given-names>
</name>
<name>
<surname>Peacock</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Cinelli</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>McLachlan</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Andreotti</surname>
<given-names>F.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Early Risk Assessment for COVID-19 Patients From Emergency Department Data Using Machine Learning</article-title>. <source>Sci. Rep.</source> <volume>11</volume> (<issue>1</issue>), <fpage>4200</fpage>. <pub-id pub-id-type="doi">10.1038/s41598-021-83784-y</pub-id>
</citation>
</ref>
<ref id="B51">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hothorn</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Hornik</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Zeileis</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2006</year>). <article-title>Unbiased Recursive Partitioning: A Conditional Inference Framework</article-title>. <source>J. Comput. Graphic. Stat.</source> <volume>15</volume> (<issue>3</issue>), <fpage>651</fpage>&#x2013;<lpage>674</lpage>. <pub-id pub-id-type="doi">10.1198/106186006X133933</pub-id>
</citation>
</ref>
<ref id="B44">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hu</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>K.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Early Prediction of Mortality Risk Among Patients With Severe COVID-19, Using Machine Learning</article-title>. <source>Int. J. Epidemiol.</source> <volume>49</volume> (<issue>6</issue>), <fpage>1918</fpage>&#x2013;<lpage>1929</lpage>. <pub-id pub-id-type="doi">10.1093/ije/dyaa171</pub-id>
</citation>
</ref>
<ref id="B45">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Radenkovic</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Perez</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Nadeau</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Verdin</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Furman</surname>
<given-names>D.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Modeling Predictive Age-Dependent and Age-Independent Symptoms and Comorbidities of Patients Seeking Treatment for COVID-19: Model Development and Validation Study</article-title>. <source>J. Med. Internet Res.</source> <volume>23</volume> (<issue>3</issue>), <fpage>e25696</fpage>. <pub-id pub-id-type="doi">10.2196/25696</pub-id>
</citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huang</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>Y.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Clinical Features of Patients Infected with 2019 Novel Coronavirus in Wuhan, China</article-title>. <source>Lancet</source> <volume>395</volume> (<issue>10223</issue>), <fpage>497</fpage>&#x2013;<lpage>506</lpage>. <pub-id pub-id-type="doi">10.1016/S0140-6736(20)30183-5</pub-id> </citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jehi</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Ji</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Milinovich</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Erzurum</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Merlino</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Gordon</surname>
<given-names>S.</given-names>
</name>
<etal/>
</person-group> (<year>2020a</year>). <article-title>Development and Validation of a Model for Individualized Prediction of Hospitalization Risk in 4,536 Patients with COVID-19</article-title>. <source>PLoS One</source> <volume>15</volume> (<issue>8</issue>), <fpage>e0237419</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0237419</pub-id> </citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jehi</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Ji</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Milinovich</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Erzurum</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Rubin</surname>
<given-names>B. P.</given-names>
</name>
<name>
<surname>Gordon</surname>
<given-names>S.</given-names>
</name>
<etal/>
</person-group> (<year>2020b</year>). <article-title>Individualizing Risk Prediction for Positive Coronavirus Disease 2019 Testing</article-title>. <source>Chest</source> <volume>158</volume> (<issue>4</issue>), <fpage>1364</fpage>&#x2013;<lpage>1375</lpage>. <pub-id pub-id-type="doi">10.1016/j.chest.2020.05.580</pub-id> </citation>
</ref>
<ref id="B54">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Joshi</surname>
<given-names>R. P.</given-names>
</name>
<name>
<surname>Pejaver</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Hammarlund</surname>
<given-names>N. E.</given-names>
</name>
<name>
<surname>Sung</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>S. K.</given-names>
</name>
<name>
<surname>Furmanchuk</surname>
<given-names>A.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>A predictive Tool for Identification of SARS-CoV-2 PCR-Negative Emergency Department Patients Using Routine Test Results.</article-title> <source>J Clin. Virol.</source> <volume>129</volume>, <fpage>104502</fpage>. <pub-id pub-id-type="doi">10.1016/j.jcv.2020.104502</pub-id>
</citation>
</ref>
<ref id="B46">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kohavi</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Brodley</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Frasca</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2000</year>). <article-title>KDD-Cup 2000 Organizers&#x2019; Report: Peeling the Onion</article-title>. <source>ACM SIGKDD Explorations Newsletter</source> <volume>2</volume> (<issue>2</issue>), <fpage>86</fpage>&#x2013;<lpage>98</lpage>. <pub-id pub-id-type="doi">10.1145/380995.381033</pub-id>
</citation>
</ref>
<ref id="B48">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kaufman</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Rosset</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Perlich</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Stitelman</surname>
<given-names>O.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>Leakage in Data Mining: Formulation, Detection, and Avoidance.</article-title> <source>ACM Trans Knowl Discov Data</source> <volume>6</volume>, <fpage>563</fpage>&#x2013;<lpage>556</lpage>. <pub-id pub-id-type="doi">10.1145/2382577.2382579</pub-id>
</citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kullar</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Marcelin</surname>
<given-names>J.&#x20;R.</given-names>
</name>
<name>
<surname>Swartz</surname>
<given-names>T. H.</given-names>
</name>
<name>
<surname>Piggott</surname>
<given-names>D. A.</given-names>
</name>
<name>
<surname>Macias Gil</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Mathew</surname>
<given-names>T. A.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Racial Disparity of Coronavirus Disease 2019 in African American Communities</article-title>. <source>J.&#x20;Infect. Dis.</source> <volume>222</volume> (<issue>6</issue>), <fpage>890</fpage>&#x2013;<lpage>893</lpage>. <pub-id pub-id-type="doi">10.1093/infdis/jiaa372</pub-id> </citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liang</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Liang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Ou</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>C.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Development and Validation of a Clinical Risk Score to Predict the Occurrence of Critical Illness in Hospitalized Patients with COVID-19</article-title>. <source>JAMA Intern. Med.</source> <volume>180</volume> (<issue>8</issue>), <fpage>1081</fpage>&#x2013;<lpage>1089</lpage>. <pub-id pub-id-type="doi">10.1001/jamainternmed.2020.2033</pub-id> </citation>
</ref>
<ref id="B17">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Nie</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Comorbid Chronic Diseases Are Strongly Correlated with Disease Severity Among COVID-19 Patients: A Systematic Review and Meta-Analysis</article-title>. <source>Aging Dis.</source> <volume>11</volume> (<issue>3</issue>), <fpage>668</fpage>&#x2013;<lpage>678</lpage>. <pub-id pub-id-type="doi">10.14336/AD.2020.0502</pub-id> </citation>
</ref>
<ref id="B40">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mitchell</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>1997</year>). <source>Machine Learning</source>. <publisher-loc>New York</publisher-loc>:<publisher-name>McGraw Hill</publisher-name>.</citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nami</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Gadad</surname>
<given-names>B. S.</given-names>
</name>
<name>
<surname>Chong</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Ghumman</surname>
<given-names>U.</given-names>
</name>
<name>
<surname>Misra</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Gadad</surname>
<given-names>S. S.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>The Interrelation of Neurological and Psychological Symptoms of COVID-19: Risks and Remedies</article-title>. <source>J&#x20;Clin Med.</source> <volume>9</volume> (<issue>8</issue>), <fpage>2624</fpage>. <pub-id pub-id-type="doi">10.3390/jcm9082624</pub-id> </citation>
</ref>
<ref id="B19">
<citation citation-type="book">
<collab>NCATS</collab> (<year>2020</year>). <source>COVID-19 Clinical Data Warehouse Data Dictionary Based on OMOP Common Data Model Specifications</source>. <comment>Version 5.3.1</comment>
</citation>
</ref>
<ref id="B47">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nisbet</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Elder</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Miner</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2009</year>). <source>Handbook of Statistical Analysis and Data Mining Applications</source>
<publisher-name>Academic Press</publisher-name>.</citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nishiga</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>D. W.</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Lewis</surname>
<given-names>D. B.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>J.&#x20;C.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>COVID-19 and Cardiovascular Disease: from Basic Mechanisms to Clinical Perspectives</article-title>. <source>Nat. Rev. Cardiol.</source> <volume>17</volume> (<issue>9</issue>), <fpage>543</fpage>&#x2013;<lpage>558</lpage>. <pub-id pub-id-type="doi">10.1038/s41569-020-0413-9</pub-id> </citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>O&#x27;Malley</surname>
<given-names>K. J.</given-names>
</name>
<name>
<surname>Cook</surname>
<given-names>K. F.</given-names>
</name>
<name>
<surname>Price</surname>
<given-names>M. D.</given-names>
</name>
<name>
<surname>Wildes</surname>
<given-names>K. R.</given-names>
</name>
<name>
<surname>Hurdle</surname>
<given-names>J.&#x20;F.</given-names>
</name>
<name>
<surname>Ashton</surname>
<given-names>C. M.</given-names>
</name>
</person-group> (<year>2005</year>). <article-title>Measuring Diagnoses: ICD Code Accuracy</article-title>. <source>Health Serv. Res.</source> <volume>40</volume> (<issue>5 Pt 2</issue>), <fpage>1620</fpage>&#x2013;<lpage>1639</lpage>. <pub-id pub-id-type="doi">10.1111/j.1475-6773.2005.00444.x</pub-id> </citation>
</ref>
<ref id="B22">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Oetjens</surname>
<given-names>M. T.</given-names>
</name>
<name>
<surname>Luo</surname>
<given-names>J.&#x20;Z.</given-names>
</name>
<name>
<surname>Chang</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Leader</surname>
<given-names>J.&#x20;B.</given-names>
</name>
<name>
<surname>Hartzel</surname>
<given-names>D. N.</given-names>
</name>
<name>
<surname>Moore</surname>
<given-names>B. S.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Electronic Health Record Analysis Identifies Kidney Disease as the Leading Risk Factor for Hospitalization in Confirmed COVID-19 Patients</article-title>. <source>PLoS One</source> <volume>15</volume> (<issue>11</issue>), <fpage>e0242182</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0242182</pub-id> </citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Osborne</surname>
<given-names>T. F.</given-names>
</name>
<name>
<surname>Veigulis</surname>
<given-names>Z. P.</given-names>
</name>
<name>
<surname>Arreola</surname>
<given-names>D. M.</given-names>
</name>
<name>
<surname>R&#xf6;&#xf6;sli</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Curtin</surname>
<given-names>C. M.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Automated EHR Score to Predict COVID-19 Outcomes at US Department of Veterans Affairs</article-title>. <source>PLoS One</source> <volume>15</volume> (<issue>7</issue>), <fpage>e0236554</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0236554</pub-id> </citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Polosa</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Caci</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>COVID-19: Counter-intuitive Data on Smoking Prevalence and Therapeutic Implications for Nicotine</article-title>. <source>Intern. Emerg. Med.</source> <volume>15</volume> (<issue>5</issue>), <fpage>853</fpage>&#x2013;<lpage>856</lpage>. <pub-id pub-id-type="doi">10.1007/s11739-020-02361-9</pub-id> </citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Popkin</surname>
<given-names>B. M.</given-names>
</name>
<name>
<surname>Du</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Green</surname>
<given-names>W. D.</given-names>
</name>
<name>
<surname>Beck</surname>
<given-names>M. A.</given-names>
</name>
<name>
<surname>Algaith</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Herbst</surname>
<given-names>C. H.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Individuals with Obesity and COVID&#x2010;19: A Global Perspective on the Epidemiology and Biological Relationships</article-title>. <source>Obes. Rev.</source> <volume>21</volume> (<issue>11</issue>), <fpage>e13128</fpage>. <pub-id pub-id-type="doi">10.1111/obr.13128</pub-id> </citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rashedi</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Mahdavi Poor</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Asgharzadeh</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Pourostadi</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Samadi Kafil</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Vegari</surname>
<given-names>A.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Risk Factors for COVID-19</article-title>. <source>Infez Med.</source> <volume>28</volume> (<issue>4</issue>), <fpage>469</fpage>&#x2013;<lpage>474</lpage>. </citation>
</ref>
<ref id="B27">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schwab</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Mehrjou</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Parbhoo</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Celi</surname>
<given-names>L. A.</given-names>
</name>
<name>
<surname>Hetzel</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Hofer</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Real-time Prediction of COVID-19 Related Mortality Using Electronic Health Records</article-title>. <source>Nat. Commun.</source> <volume>12</volume> (<issue>1</issue>), <fpage>1058</fpage>. <pub-id pub-id-type="doi">10.1038/s41467-020-20816-7</pub-id> </citation>
</ref>
<ref id="B28">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Skoda</surname>
<given-names>E.-M.</given-names>
</name>
<name>
<surname>B&#xe4;uerle</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Schweda</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>D&#xf6;rrie</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Musche</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Hetkamp</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Severely Increased Generalized Anxiety, but Not COVID-19-Related Fear in Individuals with Mental Illnesses: A Population Based Cross-Sectional Study in Germany</article-title>. <source>Int. J.&#x20;Soc. Psychiatry.</source>, <fpage>20764020960773</fpage>. <pub-id pub-id-type="doi">10.1177/0020764020960773</pub-id> </citation>
</ref>
<ref id="B52">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Szepannek</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>An Overview on the Landscape of R Packages for Credit Scoring.</article-title> <source>arXiv XX</source>, <fpage>1</fpage>&#x2013;<lpage>25</lpage>.</citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tibshirani</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>1996</year>). <article-title>Regression Shrinkage and Selection via the Lasso</article-title>. <source>J.&#x20;R. Stat. Soc. Ser. B (Methodological)</source> <volume>58</volume> (<issue>1</issue>), <fpage>267</fpage>&#x2013;<lpage>288</lpage>. <pub-id pub-id-type="doi">10.1111/j.2517-6161.1996.tb02080.x</pub-id> </citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vaid</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Somani</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Russak</surname>
<given-names>A. J.</given-names>
</name>
<name>
<surname>De Freitas</surname>
<given-names>J.&#x20;K.</given-names>
</name>
<name>
<surname>Chaudhry</surname>
<given-names>F. F.</given-names>
</name>
<name>
<surname>Paranjpe</surname>
<given-names>I.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Machine Learning to Predict Mortality and Critical Events in a Cohort of Patients with COVID-19 in New York City: Model Development and Validation</article-title>. <source>J.&#x20;Med. Internet Res.</source> <volume>22</volume> (<issue>11</issue>), <fpage>e24018</fpage>. <pub-id pub-id-type="doi">10.2196/24018</pub-id> </citation>
</ref>
<ref id="B31">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Davis</surname>
<given-names>P. B.</given-names>
</name>
<name>
<surname>Gurney</surname>
<given-names>M. E.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2021a</year>). <article-title>COVID&#x2010;19 and Dementia: Analyses of Risk, Disparity, and Outcomes from Electronic Health Records in the US</article-title>. <source>Alzheimer&#x27;s Demen.</source> <pub-id pub-id-type="doi">10.1002/alz.12296</pub-id> </citation>
</ref>
<ref id="B32">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Davis</surname>
<given-names>P. B.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2021b</year>). <article-title>COVID-19 Risk, Disparities and Outcomes in Patients with Chronic Liver Disease in the United&#x20;States</article-title>. <source>EClinicalMedicine</source> <volume>31</volume>, <fpage>100688</fpage>. <pub-id pub-id-type="doi">10.1016/j.eclinm.2020.100688</pub-id> </citation>
</ref>
<ref id="B33">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wynants</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Van Calster</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Collins</surname>
<given-names>G. S.</given-names>
</name>
<name>
<surname>Riley</surname>
<given-names>R. D.</given-names>
</name>
<name>
<surname>Heinze</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Schuit</surname>
<given-names>E.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Prediction Models for Diagnosis and Prognosis of Covid-19: Systematic Review and Critical Appraisal</article-title>. <source>BMJ</source> <volume>369</volume>, <fpage>m1328</fpage>. <pub-id pub-id-type="doi">10.1136/bmj.m1328</pub-id> </citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yang</surname>
<given-names>J.&#x20;M.</given-names>
</name>
<name>
<surname>Koh</surname>
<given-names>H. Y.</given-names>
</name>
<name>
<surname>Moon</surname>
<given-names>S. Y.</given-names>
</name>
<name>
<surname>Yoo</surname>
<given-names>I. K.</given-names>
</name>
<name>
<surname>Ha</surname>
<given-names>E. K.</given-names>
</name>
<name>
<surname>You</surname>
<given-names>S.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Allergic Disorders and Susceptibility to and Severity of COVID-19: A Nationwide Cohort Study</article-title>. <source>J.&#x20;Allergy Clin. Immunol.</source> <volume>146</volume> (<issue>4</issue>), <fpage>790</fpage>&#x2013;<lpage>798</lpage>. <pub-id pub-id-type="doi">10.1016/j.jaci.2020.08.008</pub-id> </citation>
</ref>
<ref id="B50">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zdravevski</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Lameski</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Kulakov</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2011</year>). <article-title>Weight of Evidence as a tool for Attribute Transformation in the Preprocessing Stage of Supervised Learning Algorithms</article-title> in: <source>The 2011 International Joint Conference on Neural Networks</source>, <fpage>181</fpage>&#x2013;<lpage>188</lpage>.</citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Hou</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Graham</surname>
<given-names>J.&#x20;M.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Richman</surname>
<given-names>P. S.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Prediction Model and Risk Scores of ICU Admission and Mortality in COVID-19</article-title>. <source>PLoS One</source> <volume>15</volume> (<issue>7</issue>), <fpage>e0236618</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0236618</pub-id> </citation>
</ref>
<ref id="B36">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zou</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Hastie</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2005</year>). <article-title>Regularization and Variable Selection via the Elastic Net</article-title>. <source>J.&#x20;R. Stat. Soc B</source> <volume>67</volume> (<issue>2</issue>), <fpage>301</fpage>&#x2013;<lpage>320</lpage>. <pub-id pub-id-type="doi">10.1111/j.1467-9868.2005.00503.x</pub-id> </citation>
</ref>
</ref-list>
</back>
</article>