<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Chem.</journal-id>
<journal-title>Frontiers in Chemistry</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Chem.</abbrev-journal-title>
<issn pub-type="epub">2296-2646</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">1239467</article-id>
<article-id pub-id-type="doi">10.3389/fchem.2023.1239467</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Chemistry</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>FP-MAP: an extensive library of fingerprint-based molecular activity prediction tools</article-title>
<alt-title alt-title-type="left-running-head">Venkatraman</alt-title>
<alt-title alt-title-type="right-running-head">
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/fchem.2023.1239467">10.3389/fchem.2023.1239467</ext-link>
</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Venkatraman</surname>
<given-names>Vishwesh</given-names>
</name>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/719321/overview"/>
</contrib>
</contrib-group>
<aff>
<institution>Department of Chemistry</institution>, <institution>Norwegian University of Science and Technology</institution>, <addr-line>Trondheim</addr-line>, <country>Norway</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1474248/overview">Ganna Gryn&#x2019;ova</ext-link>, Heidelberg University, Germany</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1262524/overview">Xiaohua Zhang</ext-link>, Lawrence Livermore National Laboratory (DOE), United States</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1960083/overview">Maryam Salahinejad</ext-link>, Nuclear Science and Technology Research Institute (NSTRI), Iran</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Vishwesh Venkatraman, <email>vishwesh.venkatraman@ntnu.no</email>
</corresp>
</author-notes>
<pub-date pub-type="epub">
<day>15</day>
<month>08</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>11</volume>
<elocation-id>1239467</elocation-id>
<history>
<date date-type="received">
<day>13</day>
<month>06</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>31</day>
<month>07</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2023 Venkatraman.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Venkatraman</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>Discovering new drugs for disease treatment is challenging, requiring a multidisciplinary effort as well as time, and resources. With a view to improving hit discovery and lead compound identification, machine learning (ML) approaches are being increasingly used in the decision-making process. Although a number of ML-based studies have been published, most studies only report fragments of the wider range of bioactivities wherein each model typically focuses on a particular disease. This study introduces FP-MAP, an extensive atlas of fingerprint-based prediction models that covers a diverse range of activities including neglected tropical diseases (caused by viral, bacterial and parasitic pathogens) as well as other targets implicated in diseases such as Alzheimer&#x2019;s. To arrive at the best predictive models, performance of &#x2248;4,000 classification/regression models were evaluated on different bioactivity data sets using 12 different molecular fingerprints. The best performing models that achieved test set AUC values of 0.62&#x2013;0.99 have been integrated into an easy-to-use graphical user interface that can be downloaded from <ext-link ext-link-type="uri" xlink:href="https://gitlab.com/vishsoft/fpmap">https://gitlab.com/vishsoft/fpmap</ext-link>.</p>
</abstract>
<kwd-group>
<kwd>fingerprints</kwd>
<kwd>random forests</kwd>
<kwd>neglected diseases</kwd>
<kwd>classification</kwd>
<kwd>regression</kwd>
<kwd>graph neural networks</kwd>
</kwd-group>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Theoretical and Computational Chemistry</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>Development of therapeutic drugs is an expensive affair with expected costs ranging from $1 billion to more than $2 billion (<xref ref-type="bibr" rid="B64">Schlander et al., 2021</xref>) depending on the therapeutic area and disease complexity. The molecular universe is very large with some estimates placing their number at over 10<sup>60</sup> different drug-like molecules (<xref ref-type="bibr" rid="B55">Reymond and Awale, 2012</xref>). There now exist virtual databases such as SAVI (<xref ref-type="bibr" rid="B49">Patel et al., 2020</xref>), ZINC (<xref ref-type="bibr" rid="B29">Irwin et al., 2020</xref>), ENAMINE (<xref ref-type="bibr" rid="B63">Sadybekov et al., 2021</xref>) and the GDB (<xref ref-type="bibr" rid="B55">Reymond and Awale, 2012</xref>), that contain hundreds-of-millions to billions of diverse molecules that can be queried to find novel molecules of interest. Since making and testing all the interesting compounds is out of question, there is a need to weed out molecules that are not relevant to drug discovery, i.e., exclude those that exhibit less than acceptable biological activity. However, despite recent efforts (<xref ref-type="bibr" rid="B25">Gorgulla et al., 2020</xref>; <xref ref-type="bibr" rid="B7">Bender et al., 2021</xref>; <xref ref-type="bibr" rid="B24">Glaser et al., 2021</xref>; <xref ref-type="bibr" rid="B23">Gentile et al., 2022</xref>; <xref ref-type="bibr" rid="B42">Luttens et al., 2022</xref>) reliable simulation methods for large scale activity prediction still remain elusive.</p>
<p>To circumvent some of the challenges, machine learning (ML) approaches are being increasingly used for the prediction of biological activities (<xref ref-type="bibr" rid="B17">Cova and Pais, 2019</xref>; <xref ref-type="bibr" rid="B39">Lane et al., 2020</xref>; <xref ref-type="bibr" rid="B21">Elbadawi et al., 2021</xref>). Here, a wide variety of ML algorithms are trained to identify quantitative structure-activity relationships (<xref ref-type="bibr" rid="B82">Wu et al., 2020</xref>; <xref ref-type="bibr" rid="B50">Pillai et al., 2022</xref>) that are then used to generate predictions that are subsequently used to select the next screening subset, thereby facilitating more efficient use of time and resources (<xref ref-type="bibr" rid="B19">Dreiman et al., 2021</xref>; <xref ref-type="bibr" rid="B26">Graff et al., 2021</xref>). Key to the success of the models is the quality and amount of data, the molecular representation and the ML method. Although annotated data remains limited, public databases such as ChemBL (<xref ref-type="bibr" rid="B22">Gaulton et al., 2016</xref>) and concerted efforts to make data open access (<xref ref-type="bibr" rid="B13">Capuzzi et al., 2017</xref>; <xref ref-type="bibr" rid="B81">Wu et al., 2018</xref>; <xref ref-type="bibr" rid="B32">Kexin Huang, 2020</xref>) have spawned a number of machine learning projects (<xref ref-type="bibr" rid="B44">Mayr et al., 2018</xref>; <xref ref-type="bibr" rid="B39">Lane et al., 2020</xref>). Molecular representation plays a crucial role in machine learning and is problem-specific (<xref ref-type="bibr" rid="B18">David et al., 2020</xref>; <xref ref-type="bibr" rid="B54">Raghunathan and Priyakumar, 2021</xref>) with popular choices being fingerprints (bit string indicating absence/presence of features), molecular graphs (network of nodes and edges) and molecular embeddings (<xref ref-type="bibr" rid="B30">Jaeger et al., 2018</xref>). While a wide array of ML algorithms have been employed, there is no clear winner, although ensemble learning has been shown to yield good results across many data sets (<xref ref-type="bibr" rid="B82">Wu et al., 2020</xref>; <xref ref-type="bibr" rid="B62">Sabando et al., 2021</xref>).</p>
<p>To help researchers ease their way into drug discovery and carry out screening experiments, automated ML platforms and web-based tools have gained significant traction in recent years (<xref ref-type="bibr" rid="B41">Liu et al., 2019</xref>; <xref ref-type="bibr" rid="B67">Singh et al., 2020</xref>; <xref ref-type="bibr" rid="B72">Togo et al., 2022</xref>). While a great number of software and web tools are devoted to physicochemical properties, ADMET and ADMET-related filtering (<xref ref-type="bibr" rid="B75">Venkatraman, 2021</xref>; <xref ref-type="bibr" rid="B83">Xiong et al., 2021</xref>), prediction software that cover a broad range of biological activities are relatively fewer (<xref ref-type="bibr" rid="B65">Scotti et al., 2022</xref>). In many cases, the prediction software are limited to a single disease or class and largely operate as online prediction services that are not easily amenable to large scale screening (see <xref ref-type="table" rid="T1">Table 1</xref> for a short summary of recently published software tools that provide online prediction services). Furthermore, in spite of a large number of published models, only a few are publicly accessible while many are part of proprietary collections (<xref ref-type="bibr" rid="B43">Ma et al., 2015</xref>; <xref ref-type="bibr" rid="B3">Aleksi&#x107; et al., 2021</xref>). Cheminformatics web services and software for bioactivity prediction is indeed growing (<xref ref-type="bibr" rid="B60">Ruusmann et al., 2015</xref>) and a great many software and services such as VCCLab (<xref ref-type="bibr" rid="B70">Tetko et al., 2005</xref>) and DPubchem (<xref ref-type="bibr" rid="B61">Soufan et al., 2018</xref>) offer a platform for calculations of a comprehensive series of molecular properties and data analysis. Other services such as AssayCentral (<ext-link ext-link-type="uri" xlink:href="http://www.collaborationspharma.com/assay-central">www.collaborationspharma.com/assay-central</ext-link>) focus on allowing pharmaceuticals or individuals to leverage their internal databases. In a recent study, over 5,000 machine learning models built from data sets extracted from ChemBL have been made available on the AssayCentral platform (<xref ref-type="bibr" rid="B39">Lane et al., 2020</xref>).</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Table lists several open access software for drug activity prediction.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">Software</th>
<th align="center">Description</th>
<th align="center">Distribution</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">HergSPred (<xref ref-type="bibr" rid="B88">Zhang et al., 2022b</xref>)</td>
<td align="center">hERG Blockers/Nonblocker</td>
<td align="center">Web</td>
</tr>
<tr>
<td align="center">MolPredictX (<xref ref-type="bibr" rid="B65">Scotti et al., 2022</xref>)</td>
<td align="center">predictions for 27 diseases</td>
<td align="center">Web</td>
</tr>
<tr>
<td align="center">mycoCSM (<xref ref-type="bibr" rid="B51">Pires and Ascher, 2020</xref>)</td>
<td align="center">screen hits against Mycobacteria</td>
<td align="center">Web</td>
</tr>
<tr>
<td align="center">pdCSM-PPI (<xref ref-type="bibr" rid="B57">Rodrigues et al., 2021</xref>)</td>
<td align="center">Protein-Protein Interaction Inhibitors</td>
<td align="center">Web</td>
</tr>
<tr>
<td align="center">pdCSM-GPCR (<xref ref-type="bibr" rid="B73">Velloso et al., 2021</xref>)</td>
<td align="center">GPCR inhibitors</td>
<td align="center">Web</td>
</tr>
<tr>
<td align="center">cardioToxCSM (<xref ref-type="bibr" rid="B28">Iftkhar et al., 2022</xref>)</td>
<td align="center">Cardiotoxicity</td>
<td align="center">Web</td>
</tr>
<tr>
<td align="center">pdCSM-cancer (<xref ref-type="bibr" rid="B2">Al-Jarf et al., 2021</xref>)</td>
<td align="center">Cancer drugs</td>
<td align="center">Web</td>
</tr>
<tr>
<td align="center">ChemBC (<xref ref-type="bibr" rid="B27">He et al., 2021</xref>)</td>
<td align="center">Breast Cancer</td>
<td align="center">Web/Standalone</td>
</tr>
<tr>
<td align="center">ChemTB (<xref ref-type="bibr" rid="B85">Ye et al., 2021</xref>)</td>
<td align="center">
<italic>Mycobacterium tuberculosis</italic>
</td>
<td align="center">Web</td>
</tr>
<tr>
<td align="center">MAIP (<xref ref-type="bibr" rid="B9">Bosc et al., 2021</xref>)</td>
<td align="center">blood-stage malaria inhibitors</td>
<td align="center">Web</td>
</tr>
<tr>
<td align="center">S2DV (<xref ref-type="bibr" rid="B66">Shao et al., 2022</xref>)</td>
<td align="center">anti-hepatitis B drug screening</td>
<td align="center">Web</td>
</tr>
<tr>
<td align="center">HRGCN (<xref ref-type="bibr" rid="B79">Wu et al., 2021a</xref>)</td>
<td align="center">Toxicity, HIV and BACE inhibitor</td>
<td align="center">Web</td>
</tr>
<tr>
<td align="center">MolPMoFiT (<xref ref-type="bibr" rid="B71">Tinivella et al., 2021</xref>)</td>
<td align="center">HIV and BBB penetration</td>
<td align="center">Standalone</td>
</tr>
<tr>
<td align="center">HIVprotI (<xref ref-type="bibr" rid="B52">Qureshi et al., 2018</xref>)</td>
<td align="center">HIV protein inhibitors</td>
<td align="center">Web</td>
</tr>
<tr>
<td align="center">EBOLApred (<xref ref-type="bibr" rid="B1">Adams et al., 2022</xref>)</td>
<td align="center">Ebola virus cell entry inhibitors</td>
<td align="center">Web</td>
</tr>
<tr>
<td align="center">embryoTox (<xref ref-type="bibr" rid="B4">Aljarf et al., 2023</xref>)</td>
<td align="center">Teratogenicity of Small Molecules</td>
<td align="center">Web</td>
</tr>
<tr>
<td align="center">InflamNat (<xref ref-type="bibr" rid="B87">Zhang et al., 2022a</xref>)</td>
<td align="center">anti-inflammatory drug screening</td>
<td align="center">Web</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>This article presents FP-MAP, a fast fingerprint-based bioactivity prediction tool to help identify active molecules for a number of pharmaceutically relevant targets. In particular FP-MAP sets out to assemble predictive models for diseases and targets for which there are currently no publicly available software. In order to build the models, 12 different fingerprints were trialled and the best-performing models (based on 5-fold cross-validated statistics) were retained. A pre-assessment step was carried out wherein the predictive ability of the fingerprint models was found to be comparable or an improvement over previously reported results for multiple data sets. For the different classification models computed for severely imbalanced data sets, moderate to high area under the ROC curve (AUC) values of 0.61&#x2013;0.95 were obtained. FP-MAP currently offers 24 different classification models for rapid screening of compounds against a number of diseases caused by bacteria and parasites such as schistosomiasis, cholera and malaria as well as other targets implicated in diseases such as Alzheimer&#x2019;s, cancer and cardiomyopathy. To facilitate the use of the models, the software has been made available as an easy to use graphical user interface and can be accessed from <ext-link ext-link-type="uri" xlink:href="https://gitlab.com/vishsoft/fpmap">https://gitlab.com/vishsoft/fpmap</ext-link>.</p>
</sec>
<sec sec-type="materials|methods" id="s2">
<title>2 Materials and methods</title>
<sec id="s2-1">
<title>2.1 Data sets studied</title>
<p>In order to assess the predictive ability of the fingerprint-based machine learning models, multiple data set were analysed. A set of 79 pharmacologically important biological targets were initially used as a means to benchmark performance, proceeding which model performance was assessed on more challenging targets that are described briefly in the following sections.</p>
<sec id="s2-1-1">
<title>2.1.1 Chemical toxicology</title>
<p>The toxicology data set includes 79 pharmacologically important biological targets (see <xref ref-type="sec" rid="s11">Supplementary Table S1</xref> in the SI). The compounds were extracted from ChemBL and ToxCast and were categorized as binders if the reported activities against the human protein targets (K<sub>
<italic>i</italic>
</sub>/K<sub>
<italic>d</italic>
</sub>/IC<sub>50</sub>/EC<sub>50</sub>) were &#x2264;10&#xa0;<italic>&#x3bc;</italic>M and as non-binders if activities were <inline-formula id="inf1">
<mml:math id="m1">
<mml:mo>&#x3e;</mml:mo>
</mml:math>
</inline-formula> 10&#xa0;<italic>&#x3bc;</italic>M (<xref ref-type="bibr" rid="B5">Allen et al., 2020</xref>). For the data sets, deep learning neural networks yielded test data accuracies of 92% &#xb1; 4%.</p>
</sec>
<sec id="s2-1-2">
<title>2.1.2 ExcapeDB</title>
<p>The ExcapeDB (<xref ref-type="bibr" rid="B69">Sun et al., 2017</xref>) database comprises activity data of chemical compounds on an array of protein targets. The data were extracted from publicly available databases such as PubChem and ChEMBL. A set of 12 gene targets were evaluated in this study.</p>
</sec>
<sec id="s2-1-3">
<title>2.1.3 PubChem</title>
<p>An important source of data is the PubChem Bioassay (<xref ref-type="bibr" rid="B35">Kim et al., 2022</xref>) which contains small-molecule screening data. This study analyses multiple data sets drawn from the PubChem archive where the focus is primarily on rare diseases related to genetic disorders and neglected tropical diseases.</p>
<sec id="s2-1-3-1">
<title>2.1.3.1 Bubonic plague</title>
<p>YopH (<italic>Yersinia</italic> outer protein H) is a protein essential for the virulence of <italic>yersinia pestis</italic> (Bubonic plague). The data set consists of &#x223c;140,000 compounds that were part of a high throughput screening assay (<ext-link ext-link-type="uri" xlink:href="https://pubchem.ncbi.nlm.nih.gov/bioassay/898">https://pubchem.ncbi.nlm.nih.gov/bioassay/898</ext-link>) to identify compounds that can interfere with YopH functionality. Actives were defined as those with inhibition &#x2265;50%.</p>
</sec>
<sec id="s2-1-3-2">
<title>2.1.3.2 Potassium channel blockers</title>
<p>The KCNQ1 (Potassium Voltage-Gated Channel Subfamily Q Member 1) gene codes for the potassium channel protein which is critical for electrical signaling in cells. In an effort to identify compounds that inhibit KCNQ1 potassium channels, a little over 300,000 compounds were assayed (<ext-link ext-link-type="uri" xlink:href="https://pubchem.ncbi.nlm.nih.gov/bioassay/2642">https://pubchem.ncbi.nlm.nih.gov/bioassay/2642</ext-link>).</p>
</sec>
<sec id="s2-1-3-3">
<title>2.1.3.3 Trypanosoma brucei hexokinase</title>
<p>
<italic>Trypanosoma brucei</italic> is a protozoan parasite that causes African sleeping sickness. Glucose metabolism is essential for the parasite, and hexokinases have been considered as important therapeutic targets. The data set consists of a little over 220,000 compounds (<ext-link ext-link-type="uri" xlink:href="https://pubchem.ncbi.nlm.nih.gov/bioassay/1430">https://pubchem.ncbi.nlm.nih.gov/bioassay/1430</ext-link>) where the goal was to identify specific inhibitors of <italic>Trypanosoma brucei</italic> hexokinase activity (<xref ref-type="bibr" rid="B45">Morris et al., 2006</xref>). Compounds with more than 50% inhibition are considered to be active.</p>
</sec>
<sec id="s2-1-3-4">
<title>2.1.3.4 Antimalarials</title>
<p>The MMV St. Jude malaria data set (<xref ref-type="bibr" rid="B76">Verras et al., 2017</xref>) contains a set of 305,810 compounds that were assayed for malaria blood stage inhibitory activity.</p>
</sec>
<sec id="s2-1-3-5">
<title>2.1.3.5 Leishmania</title>
<p>Leishmaniasis is a neglected disease caused by protozoan parasites. Currently no safe vaccines exist. The data set earlier studied by <xref ref-type="bibr" rid="B14">Casanova-Alvarez et al. (2021)</xref>, includes &#x223c; 196,000 compounds that have been tested for leishmania parasite growth and viability inhibition against <italic>Leishmania major</italic> promastigotes.</p>
</sec>
<sec id="s2-1-3-6">
<title>2.1.3.6 Activators of kallikrein-7</title>
<p>The chymotrypsin-like serine protease kallikrein-7 (K7) zymogen has been shown to play critical roles in skin diseases and tumour progression. K7 expression was significantly decreased in the brains of Alzheimer&#x2019;s disease (AD) patients (<xref ref-type="bibr" rid="B33">Kidana et al., 2018</xref>). Compounds that can directly activate K7 without a requirement for proteolytic processing can enable development of new therapeutics for cancer, skin diseases, and AD. The data set contains over 350,000 compounds (<ext-link ext-link-type="uri" xlink:href="https://pubchem.ncbi.nlm.nih.gov/bioassay/652039">https://pubchem.ncbi.nlm.nih.gov/bioassay/652039</ext-link>).</p>
</sec>
<sec id="s2-1-3-7">
<title>2.1.3.7 Dengue</title>
<p>Antiviral drugs against dengue infection are much needed with an estimated 4 billion people living in areas with a risk of dengue (<ext-link ext-link-type="uri" xlink:href="https://www.who.int/news-room/fact-sheets/detail/dengue-and-severe-dengue">https://www.who.int/news-room/fact-sheets/detail/dengue-and-severe-dengue</ext-link>). The data set consists of over 10,000 compounds (<ext-link ext-link-type="uri" xlink:href="https://pubchem.ncbi.nlm.nih.gov/bioassay/540333">https://pubchem.ncbi.nlm.nih.gov/bioassay/540333</ext-link>) wherein active compounds showed inhibition of cytopathic effect-based assay greater than 13.25%.</p>
</sec>
<sec id="s2-1-3-8">
<title>2.1.3.8 VIM2 inhibitors</title>
<p>Antibiotic resistance caused by <italic>&#x3b2;</italic>-lactamase production presents significant challenges to the efficacy of <italic>&#x3b2;</italic>-lactam antibiotics. Given the paucity of new antibiotics, high throughput screening assay to identify inhibitors of the Verona Integron-Encoded Metallo-<italic>&#x3b2;</italic>-Lactamase 2 (VIM-2) have been carried out.</p>
</sec>
<sec id="s2-1-3-9">
<title>2.1.3.9 Cholera</title>
<p>Cholera is acute diarrhoeal disease caused by infection of the intestine with <italic>Vibrio cholerae</italic> bacteria. Due to the prevalence of multi-drug resistance in these bacteria new drugs to combat these pathogens are required. The data set contains over 130,000 compounds (<ext-link ext-link-type="uri" xlink:href="https://pubchem.ncbi.nlm.nih.gov/bioassay/504770">https://pubchem.ncbi.nlm.nih.gov/bioassay/504770</ext-link>) of which 350 compounds showed potent cidal activity against <italic>V. cholerae</italic>.</p>
</sec>
<sec id="s2-1-3-10">
<title>2.1.3.10 Schistosomiasis</title>
<p>Caused by parasitic worms (such as Schistosoma mansoni), Schistosomiasis is prevalent in tropical and subtropical areas particularly among poor and rural communities with &#x2248;90% of those requiring treatment living in Africa (<ext-link ext-link-type="uri" xlink:href="https://www.who.int/news-room/fact-sheets/detail/schistosomiasis">https://www.who.int/news-room/fact-sheets/detail/schistosomiasis</ext-link>). Owing to the parasite becoming drug resistant and lack of suitable alternative therapies, new targets and drugs for schistosomiasis treatment are foremost importance. The data set contains over 300,000 compounds tested for inhibition of Thioredoxin glutathione reductase (<ext-link ext-link-type="uri" xlink:href="https://pubchem.ncbi.nlm.nih.gov/bioassay/485364">https://pubchem.ncbi.nlm.nih.gov/bioassay/485364</ext-link>). Compounds defined as inconclusive were excluded from further analysis.</p>
</sec>
<sec id="s2-1-3-11">
<title>2.1.3.11 Glucocerebrosidase</title>
<p>The deficiency of <italic>&#x3b2;</italic>-glucocerebrosidase results in Gaucher disease, a rare genetic disorder for which there is no cure but can be controlled using drugs. The PubChem assay (<ext-link ext-link-type="uri" xlink:href="https://pubchem.ncbi.nlm.nih.gov/bioassay/360">https://pubchem.ncbi.nlm.nih.gov/bioassay/360</ext-link>) screens for small molecule inhibitors that could potentially act as molecular chaperones on the mutant forms <italic>&#x3b2;</italic>-glucocerebrosidase.</p>
</sec>
<sec id="s2-1-3-12">
<title>2.1.3.12 Leishmania</title>
<p>Available leishmaniasis treatments are limited and increasingly confronted by issues such as toxic side effects and chemoresistance. The data set includes close to 200,000 compounds assayed for Leishmania parasite growth inhibition <ext-link ext-link-type="uri" xlink:href="https://pubchem.ncbi.nlm.nih.gov/bioassay/1063">https://pubchem.ncbi.nlm.nih.gov/bioassay/1063</ext-link>.</p>
</sec>
</sec>
</sec>
<sec id="s2-2">
<title>2.2 Molecular fingerprint representations</title>
<p>Molecular fingerprints have a long history of having been used in similarity searching (<xref ref-type="bibr" rid="B46">Muegge and Hu, 2022</xref>). Their popularity can be largely attributed to their ability to evaluate vast libraries of compounds using just a fraction of the resources and time (<xref ref-type="bibr" rid="B74">Venkatraman et al., 2022</xref>) that would otherwise be used with more compute intensive approaches. The fingerprint representations used in this study can be grouped into:<list list-type="simple">
<list-item>
<p>1. Those based on pre-defined generic substructures/keys (<xref ref-type="bibr" rid="B6">Bender et al., 2009</xref>) such as PUBCHEM (<xref ref-type="bibr" rid="B47">NCBI, 2009</xref>), Klekota-Roth (<xref ref-type="bibr" rid="B36">Klekota and Roth, 2008</xref>) and MACCS (<xref ref-type="bibr" rid="B20">Durant et al., 2002</xref>)</p>
</list-item>
<list-item>
<p>2. Circular topological fingerprints (<xref ref-type="bibr" rid="B58">Rogers and Hahn, 2010</xref>) that represent molecular structures using circular atom neighborhoods (defined by a radius). The extended connectivity fingerprints (ECFP) and feature-class fingerprints belongs to this group.</p>
</list-item>
<list-item>
<p>3. Topological path-based fingerprints in which linear/branched paths up to a certain length are enumerated and encoded. Here, RDKit topological fingerprints (<xref ref-type="bibr" rid="B38">Landrum, 2022</xref>) of path sizes 5, 6, and 7 bonds have been used.</p>
</list-item>
</list>
</p>
<p>
<xref ref-type="table" rid="T2">Table 2</xref> provides a summary of the fingerprints used for predictive modelling. Machine learning models for a total of 12 different fingerprints adapted from a set of fingerprints studied earlier by <xref ref-type="bibr" rid="B56">Riniker and Landrum (2013)</xref> were evaluated. These fingerprints have been widely used as molecular representations with applications in similarity searching and modelling structure-activity relationships (<xref ref-type="bibr" rid="B86">Zagidullin et al., 2021</xref>; <xref ref-type="bibr" rid="B46">Muegge and Hu, 2022</xref>; <xref ref-type="bibr" rid="B48">Orosz et al., 2022</xref>). The fingerprints were generated using available routines in open source cheminformatics software such as RDKit (<xref ref-type="bibr" rid="B38">Landrum, 2022</xref>) and the Chemistry Development Kit (<xref ref-type="bibr" rid="B77">Willighagen et al., 2017</xref>).</p>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>Molecular fingerprints used for predictive modelling.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">Fingerprint</th>
<th align="center">Group</th>
<th align="center">Size (bits)</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">ECFP2 <xref ref-type="bibr" rid="B58">Rogers and Hahn (2010)</xref>; <xref ref-type="bibr" rid="B77">Willighagen et al. (2017)</xref>
</td>
<td align="center">Circular</td>
<td align="center">1,024</td>
</tr>
<tr>
<td align="center">ECFP4 (<xref ref-type="bibr" rid="B58">Rogers and Hahn, 2010</xref>; <xref ref-type="bibr" rid="B77">Willighagen et al., 2017</xref>)</td>
<td align="center">Circular</td>
<td align="center">1,024</td>
</tr>
<tr>
<td align="center">ECFP6 (<xref ref-type="bibr" rid="B58">Rogers and Hahn, 2010</xref>; <xref ref-type="bibr" rid="B77">Willighagen et al., 2017</xref>)</td>
<td align="center">Circular</td>
<td align="center">1,024</td>
</tr>
<tr>
<td align="center">FCFP2 (<xref ref-type="bibr" rid="B58">Rogers and Hahn, 2010</xref>; <xref ref-type="bibr" rid="B77">Willighagen et al., 2017</xref>)</td>
<td align="center">Circular</td>
<td align="center">1,024</td>
</tr>
<tr>
<td align="center">FCFP4 (<xref ref-type="bibr" rid="B58">Rogers and Hahn, 2010</xref>; <xref ref-type="bibr" rid="B77">Willighagen et al., 2017</xref>)</td>
<td align="center">Circular</td>
<td align="center">1,024</td>
</tr>
<tr>
<td align="center">FCFP6 (<xref ref-type="bibr" rid="B58">Rogers and Hahn, 2010</xref>; <xref ref-type="bibr" rid="B77">Willighagen et al., 2017</xref>)</td>
<td align="center">Circular</td>
<td align="center">1,024</td>
</tr>
<tr>
<td align="center">MACCS (<xref ref-type="bibr" rid="B20">Durant et al., 2002</xref>)</td>
<td align="center">Substructure</td>
<td align="center">166</td>
</tr>
<tr>
<td align="center">PUBCHEM (<xref ref-type="bibr" rid="B47">NCBI, 2009</xref>; <xref ref-type="bibr" rid="B77">Willighagen et al., 2017</xref>)</td>
<td align="center">Substructure</td>
<td align="center">881</td>
</tr>
<tr>
<td align="center">AVALON (<xref ref-type="bibr" rid="B38">Landrum, 2022</xref>)</td>
<td align="center">Substructure</td>
<td align="center">1,024</td>
</tr>
<tr>
<td align="center">RDK5 (<xref ref-type="bibr" rid="B38">Landrum, 2022</xref>)</td>
<td align="center">Path</td>
<td align="center">1,024</td>
</tr>
<tr>
<td align="center">RDK6 (<xref ref-type="bibr" rid="B38">Landrum, 2022</xref>)</td>
<td align="center">Path</td>
<td align="center">1,024</td>
</tr>
<tr>
<td align="center">RDK7 (<xref ref-type="bibr" rid="B38">Landrum, 2022</xref>)</td>
<td align="center">Path</td>
<td align="center">1,024</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>For the extended connectivity fingerprints (ECFP) and functional class fingerprints (FCFP), the values of 2, 4, and 6 indicate the diameters of the atom neighbourhoods. For RDKit fingerprints the values of 5, 6, and 7 indicate the size (in bonds) of the paths considered.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="s2-3">
<title>2.3 Modelling</title>
<p>Prior to modelling, a data cleaning step was followed wherein the SMILES were standardized and cleaned using MayaChemTools (<xref ref-type="bibr" rid="B68">Sud, 2016</xref>). Subsequently, for each data set, the available data was randomly split into calibration (80%) and test sets (20%). Model training was performed using random forests (<xref ref-type="bibr" rid="B10">Breiman, 2001</xref>) (RF) where the number of trees was set to 500. A 5-fold cross-validation on the training set was carried out to tune the parameter &#x201c;mtry&#x201d; (number of input features that will be randomly sampled at each split when creating the tree models). Prediction performances were subsequently assessed on the test set. The train/test splitting (80:20 ratio) was repeated 3 times to assess variability of the prediction performance and to rule out any significant impact on performance owing to selection. The RF models were built using the <italic>caret</italic> (<xref ref-type="bibr" rid="B37">Kuhn, 2022</xref>) and <italic>ranger</italic> (<xref ref-type="bibr" rid="B78">Wright and Ziegler, 2017</xref>) packages in R (<xref ref-type="bibr" rid="B53">R Core Team, 2022</xref>). The classification models were evaluated using the balanced accuracy score (<xref ref-type="bibr" rid="B31">Kelleher et al., 2015</xref>) which accounts for the skewness of the class distributions<disp-formula id="e1">
<mml:math id="m2">
<mml:mi>B</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi>C</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>v</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>y</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>S</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>f</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:mfrac>
</mml:math>
<label>(1)</label>
</disp-formula>Here, the sensitivity <inline-formula id="inf2">
<mml:math id="m3">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> and specificity <inline-formula id="inf3">
<mml:math id="m4">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> are defined in terms of the counts of true positive (TP), true negative (TN), false positive (FP) and false negative (FN). For comparison, other metrics such as the area under the curve (AUC) are also reported.</p>
<p>In order to address the issue of applicability domain of the models, outlier detection using isolation forest (<xref ref-type="bibr" rid="B40">Liu et al., 2008</xref>) has been employed. Here, a test compound is assessed for its tendency to separate from the majority of samples using an isolation forest constructed from binary trees. Isolation forests make use of decision tree (are an unsupervised version of random forests) and work on the assumption that for non-outlier points, it takes a large number of splits to separate them into individual buckets (i.e., number of partitions that it takes to isolate a point). By contrast, anomalous points are likely to take much shorter paths for isolation. In this study, the isofor package in R was used to identify potential outliers.</p>
</sec>
</sec>
<sec sec-type="results|discussion" id="s3">
<title>3 Results and discussion</title>
<sec id="s3-1">
<title>3.1 Performance benchmarking</title>
<p>The performance of the fingerprint models was first assessed on the 79 targets (data summary in <xref ref-type="sec" rid="s11">Supplementary Table S1</xref> in SI) earlier studied by <xref ref-type="bibr" rid="B5">Allen et al. (2020)</xref>. The heatmap of the balanced accuracies in <xref ref-type="sec" rid="s11">Supplementary Figure S1</xref> in the SI shows that with the exception of some selected targets such as MAPK1, PTPN11 and hERG, the fingerprint models perform quite well with average accuracies (average of the BACC values across all targets) close to 0.90 for most targets (see <xref ref-type="sec" rid="s11">Supplementary Figure S2</xref> in the SI). The prediction results for the fingerprint models compare favourably with the metrics reported for deep learning neural networks (<xref ref-type="bibr" rid="B5">Allen et al., 2020</xref>) and can be attributed to the fact that the data sets are relatively balanced (positive data percentage of &#x2248;50%). The fingerprint models were also evaluated against six types of cardiac toxicity outcomes: arrhythmia, cardiac failure, heart block, hERG toxicity, hypertension, and myocardial infarction (see <xref ref-type="sec" rid="s11">Supplementary Table S1</xref> in the SI). These data sets were previously studied by <xref ref-type="bibr" rid="B28">Iftkhar et al. (2022)</xref> who used a combination of graph-based signatures and fingerprints to identify models capable of identifying molecules likely to be toxic. <xref ref-type="fig" rid="F1">Figure 1</xref> summarizes the performance of the fingerprint models which as can be seen, achieve relatively better predictive performance in terms of the AUC.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>Plot shows the average AUC values for each fingerprint model averaged over 6 cardiac toxicity related outcomes. Error bars indicate the variability (standard deviation) of the obtained AUCs. Individual prediction performances of the models can be seen in <xref ref-type="sec" rid="s11">Supplementary Figuree S3</xref> in the SI.</p>
</caption>
<graphic xlink:href="fchem-11-1239467-g001.tif"/>
</fig>
<p>As further validation of the fingerprint models, predictive performance on a series of structurally diverse datasets consisting of 33,757 active and 21,152 inactive compounds for different breast cancer cell lines was also evaluated. The data sets were earlier studied by <xref ref-type="bibr" rid="B27">He et al. (2021)</xref>, where a number of descriptor-based machine learning models such as na&#xef;ve Bayes (NB), support vector machine (SVM), <italic>k</italic>-nearest Neighbors (KNN), extreme gradient boosting (XGB) as well as deep learning methods were tested. Comparison of the metrics obtained for fingerprint models with those reported by <xref ref-type="bibr" rid="B27">He et al. (2021)</xref> shows that the former achieve higher predictive accuracies with BACC <inline-formula id="inf4">
<mml:math id="m5">
<mml:mo>&#x3e;</mml:mo>
<mml:mn>0.70</mml:mn>
</mml:math>
</inline-formula> (see <xref ref-type="fig" rid="F2">Figure 2</xref>).</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>Plot shows the average BACC values for each fingerprint model averaged over 14 breast cancer cell lines. Error bars indicate the variability (standard deviation) of the obtained accuracies. A target-wise summary of the prediction performances of the models can be seen in <xref ref-type="sec" rid="s11">Supplementary Figure S4</xref> in the SI.</p>
</caption>
<graphic xlink:href="fchem-11-1239467-g002.tif"/>
</fig>
<p>Overall, the performance on multiple data sets clearly shows that fingerprints have good predictive power. The majority of the data however, has minimal skew, i.e., near equal numbers of actives and inactives with some even displaying greater bias towards active compounds. Most machine learning approaches are likely to yield strong performances for such balanced data distributions. Data sets drawn from PubChem have typically strongly imbalance and the question is whether fingerprints can yield robust structure&#x2013;activity relationship models for such data.</p>
</sec>
<sec id="s3-2">
<title>3.2 Performance evaluation of selected bioactivity data sets</title>
<p>Encouraged by the performance of the fingerprints on the different targets, model performance was further assessed on 24 different bioactivity data sets. <xref ref-type="table" rid="T3">Table 3</xref> lists the balanced accuracies for the calibration/test sets (average of 3 independent trials) obtained for the targets. Although the performance varies, it is generally seen that the fingerprint models yield reasonable results even for cases with severe imbalance. The heatmap in <xref ref-type="fig" rid="F3">Figure 3</xref> shows that in a number of cases such as potassium channel inhibitors, KDM4E, LMNA and TARDBP, the selected fingerprints show only a marginal difference in performance with balanced accuracies &#x2248;0.55. Among the fingerprints evaluated in this study, best results were frequently seen to perform well include AVALON, ECFP2/FCFP4/FCFP6 and RDK5.</p>
<table-wrap id="T3" position="float">
<label>TABLE 3</label>
<caption>
<p>Table summarizes the random forest classification performance for the various data sets studied.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">Disease/Target</th>
<th align="center">Source</th>
<th align="center">&#x23;Active/&#x23;Inactive</th>
<th align="center">Fingerprint</th>
<th align="center">BACC (Cal/Val)</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">Malaria</td>
<td align="center">St Jude</td>
<td align="center">2507/303303</td>
<td align="center">FCFP6</td>
<td align="center">0.636/0.640</td>
</tr>
<tr>
<td align="center">Kallikrein-7 activator</td>
<td align="center">PubChem<xref ref-type="table-fn" rid="Tfn1">
<sup>a</sup>
</xref>
</td>
<td align="center">3324/365562</td>
<td align="center">RDK5</td>
<td align="center">0.683/0.689</td>
</tr>
<tr>
<td align="center">Hepatitis</td>
<td align="center">PubChem<xref ref-type="table-fn" rid="Tfn2">
<sup>b</sup>
</xref>
</td>
<td align="center">8443/200362</td>
<td align="center">FCFP4</td>
<td align="center">0.594/0.605</td>
</tr>
<tr>
<td align="center">VIM2 Inhibitor</td>
<td align="center">PubChem<xref ref-type="table-fn" rid="Tfn3">
<sup>c</sup>
</xref>
</td>
<td align="center">2575/288127</td>
<td align="center">FCFP4</td>
<td align="center">0.646/0.648</td>
</tr>
<tr>
<td align="center">Leishmania</td>
<td align="center">PubChem<xref ref-type="table-fn" rid="Tfn4">
<sup>d</sup>
</xref>
</td>
<td align="center">17630/178543</td>
<td align="center">FCFP4</td>
<td align="center">0.638/0.647</td>
</tr>
<tr>
<td align="center">Schistosomiasis</td>
<td align="center">PubChem<xref ref-type="table-fn" rid="Tfn5">
<sup>e</sup>
</xref>
</td>
<td align="center">10701/331424</td>
<td align="center">RDK5</td>
<td align="center">0.686/0.706</td>
</tr>
<tr>
<td align="center">Potassium Channels</td>
<td align="center">PubChem<xref ref-type="table-fn" rid="Tfn6">
<sup>f</sup>
</xref>
</td>
<td align="center">3878/301707</td>
<td align="center">RDK5</td>
<td align="center">0.547/0.550</td>
</tr>
<tr>
<td align="center">T.Brucei Hexo Kinase</td>
<td align="center">PubChem<xref ref-type="table-fn" rid="Tfn7">
<sup>g</sup>
</xref>
</td>
<td align="center">239/220096</td>
<td align="center">AVALON</td>
<td align="center">0.536/0.521</td>
</tr>
<tr>
<td align="center">Bubonic Plague</td>
<td align="center">PubChem<xref ref-type="table-fn" rid="Tfn8">
<sup>h</sup>
</xref>
</td>
<td align="center">223/139693</td>
<td align="center">RDK5</td>
<td align="center">0.598/0.572</td>
</tr>
<tr>
<td align="center">
<italic>Vibrio cholerae</italic>
</td>
<td align="center">PubChem<xref ref-type="table-fn" rid="Tfn9">
<sup>i</sup>
</xref>
</td>
<td align="center">350/132090</td>
<td align="center">PUBCHEM</td>
<td align="center">0.557/0.578</td>
</tr>
<tr>
<td align="center">Dengue</td>
<td align="center">PubChem<xref ref-type="table-fn" rid="Tfn10">
<sup>j</sup>
</xref>
</td>
<td align="center">318/9920</td>
<td align="center">AVALON</td>
<td align="center">0.532/0.540</td>
</tr>
<tr>
<td align="center">Glucocerebrosidase</td>
<td align="center">PubChem<xref ref-type="table-fn" rid="Tfn11">
<sup>k</sup>
</xref>
</td>
<td align="center">549/45729</td>
<td align="center">FCFP4</td>
<td align="center">0.571/0.547</td>
</tr>
<tr>
<td align="center">HSD17B10</td>
<td align="center">ExcapeDB</td>
<td align="center">3408/11510</td>
<td align="center">AVALON</td>
<td align="center">0.592/0.593</td>
</tr>
<tr>
<td align="center">KDM4E</td>
<td align="center">ExcapeDB</td>
<td align="center">3938/35058</td>
<td align="center">FCFP4</td>
<td align="center">0.553/0.552</td>
</tr>
<tr>
<td align="center">TARDBP</td>
<td align="center">ExcapeDB</td>
<td align="center">12128/387760</td>
<td align="center">RDK5</td>
<td align="center">0.518/0.510</td>
</tr>
<tr>
<td align="center">TDP1</td>
<td align="center">ExcapeDB</td>
<td align="center">23083/276558</td>
<td align="center">AVALON</td>
<td align="center">0.679/0.692</td>
</tr>
<tr>
<td align="center">DRD2</td>
<td align="center">ExcapeDB</td>
<td align="center">8323/343206</td>
<td align="center">ECFP2</td>
<td align="center">0.947/0.949</td>
</tr>
<tr>
<td align="center">FEN1</td>
<td align="center">ExcapeDB</td>
<td align="center">1041/381446</td>
<td align="center">AVALON</td>
<td align="center">0.556/0.548</td>
</tr>
<tr>
<td align="center">GSK3B</td>
<td align="center">ExcapeDB</td>
<td align="center">3268/300183</td>
<td align="center">ECFP2</td>
<td align="center">0.843/0.833</td>
</tr>
<tr>
<td align="center">HDAC3</td>
<td align="center">ExcapeDB</td>
<td align="center">354/311367</td>
<td align="center">ECFP2</td>
<td align="center">0.864/0.900</td>
</tr>
<tr>
<td align="center">JAK2</td>
<td align="center">ExcapeDB</td>
<td align="center">2135/213875</td>
<td align="center">FCFP6</td>
<td align="center">0.851/0.866</td>
</tr>
<tr>
<td align="center">LMNA</td>
<td align="center">ExcapeDB</td>
<td align="center">14742/171388</td>
<td align="center">AVALON</td>
<td align="center">0.525/0.515</td>
</tr>
<tr>
<td align="center">POLK</td>
<td align="center">ExcapeDB</td>
<td align="center">823/392317</td>
<td align="center">MACCS</td>
<td align="center">0.623/0.613</td>
</tr>
<tr>
<td align="center">ALOX15</td>
<td align="center">ExcapeDB</td>
<td align="center">1925/110264</td>
<td align="center">AVALON</td>
<td align="center">0.592/0.588</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>The final column shows the mean (repeated 3 times) balanced accuracy achieved for the best performing fingerprint across the calibration (80%) and test sets (20%). See also <xref ref-type="fig" rid="F3">Figure 3</xref>.</p>
</fn>
<fn id="Tfn1">
<label>
<sup>a</sup>
</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://pubchem.ncbi.nlm.nih.gov/bioassay/652039">https://pubchem.ncbi.nlm.nih.gov/bioassay/652039</ext-link>
</p>
</fn>
<fn id="Tfn2">
<label>
<sup>b</sup>
</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://pubchem.ncbi.nlm.nih.gov/bioassay/651820">https://pubchem.ncbi.nlm.nih.gov/bioassay/651820</ext-link>
</p>
</fn>
<fn id="Tfn3">
<label>
<sup>c</sup>
</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://pubchem.ncbi.nlm.nih.gov/bioassay/1527">https://pubchem.ncbi.nlm.nih.gov/bioassay/1527</ext-link>
</p>
</fn>
<fn id="Tfn4">
<label>
<sup>d</sup>
</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://pubchem.ncbi.nlm.nih.gov/bioassay/1063">https://pubchem.ncbi.nlm.nih.gov/bioassay/1063</ext-link>
</p>
</fn>
<fn id="Tfn5">
<label>
<sup>e</sup>
</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://pubchem.ncbi.nlm.nih.gov/bioassay/485364">https://pubchem.ncbi.nlm.nih.gov/bioassay/485364</ext-link>
</p>
</fn>
<fn id="Tfn6">
<label>
<sup>f</sup>
</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://pubchem.ncbi.nlm.nih.gov/bioassay/2642">https://pubchem.ncbi.nlm.nih.gov/bioassay/2642</ext-link>
</p>
</fn>
<fn id="Tfn7">
<label>
<sup>g</sup>
</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://pubchem.ncbi.nlm.nih.gov/bioassay/1430">https://pubchem.ncbi.nlm.nih.gov/bioassay/1430</ext-link>
</p>
</fn>
<fn id="Tfn8">
<label>
<sup>h</sup>
</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://pubchem.ncbi.nlm.nih.gov/bioassay/898">https://pubchem.ncbi.nlm.nih.gov/bioassay/898</ext-link>
</p>
</fn>
<fn id="Tfn9">
<label>
<sup>i</sup>
</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://pubchem.ncbi.nlm.nih.gov/bioassay/504770">https://pubchem.ncbi.nlm.nih.gov/bioassay/504770</ext-link>
</p>
</fn>
<fn id="Tfn10">
<label>
<sup>j</sup>
</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://pubchem.ncbi.nlm.nih.gov/bioassay/540333">https://pubchem.ncbi.nlm.nih.gov/bioassay/540333</ext-link>
</p>
</fn>
<fn id="Tfn11">
<label>
<sup>k</sup>
</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://pubchem.ncbi.nlm.nih.gov/bioassay/360">https://pubchem.ncbi.nlm.nih.gov/bioassay/360</ext-link>
</p>
</fn>
</table-wrap-foot>
</table-wrap>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Heatmap of the 5-fold cross validated balanced accuracies (mean of 3 runs) achieved by the different fingerprint models.</p>
</caption>
<graphic xlink:href="fchem-11-1239467-g003.tif"/>
</fig>
<p>The fingerprint performance was compared with that of a graph isomorphism network (<xref ref-type="bibr" rid="B84">Xu et al., 2019</xref>; <xref ref-type="bibr" rid="B80">Wu et al., 2021b</xref>) (GIN) which is a powerful graph neural network (GNN) for graph classification (<xref ref-type="bibr" rid="B34">Kim and Ye, 2020</xref>). Using the torchdrug (<xref ref-type="bibr" rid="B89">Zhu et al., 2022</xref>) machine learning framework, the GIN was built with 4 hidden layers (number of hidden units set to 256), using an Adam optimizer and binary cross entropy loss function with batch normalization applied to every hidden layer. The model was subsequently trained for 100 epochs with the splits for train/valid/test sets set to 60%, 20% and 20% respectively. The barplots in <xref ref-type="fig" rid="F4">Figure 4</xref> show the comparison of the test set AUCs (mean of 3 independent runs) achieved by the RF and GNN models. As can be seen from the plots, for the majority of the data sets, RF models achieve relatively better metrics while for others the performances are comparable.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Comparison of random forest fingerprint models with graph isomorphism networks for the test sets (average of 3 random selections).</p>
</caption>
<graphic xlink:href="fchem-11-1239467-g004.tif"/>
</fig>
<p>For all data sets, isolation forest (built using 500 trees) based outlier scores were calculated. Here, values closer to 1 indicate potential outliers while those around 0.50 typically suggest average outlierness. Values closer to 0 are more difficult to categorize. <xref ref-type="sec" rid="s11">Supplementary Figure S5</xref> in the SI shows the histograms of the distributions of the calculated scores. Examination of the plots show that for most of the data sets studied here, a cutoff of 0.5 (for some a lower value is recommended) may be used as a decision threshold to identify potential outliers (see <xref ref-type="sec" rid="s11">Supplementary Figure S6</xref> in the SI). Compared with other distance based approaches [such as the local outlier factor (<xref ref-type="bibr" rid="B11">Breunig et al., 2000</xref>) and one-class support vector machines (<xref ref-type="bibr" rid="B15">Chen et al., 2013</xref>)] where the algorithms typically try to fit the regions where the training data is the heavily concentrated, isolation forests do not use any distance metrics and instead rely on the concept that an ensemble of random trees are likely to produce shorter path lengths for outliers.</p>
<p>The model performance although encouraging for some does need significant improvement especially for data sets where the availability of actives is quite low. While a case for balanced data sets can be made, the skewed ratio between active and inactive compounds is a realistic representation of the high-throughput screening hit rates that are typically <inline-formula id="inf5">
<mml:math id="m6">
<mml:mo>&#x3c;</mml:mo>
</mml:math>
</inline-formula> 1% (<xref ref-type="bibr" rid="B19">Dreiman et al., 2021</xref>). For some data sets, improved performances were seen with substructure fingerprints such as AVALON that are based on pre-defined generic substructure patterns. For others, fingerprints such as ECFP/FCFP that take into account the neighborhood of each atom yielded slightly better classification models. Nonetheless, for many of the data sets (see <xref ref-type="fig" rid="F3">Figure 3</xref>), the model metrics showed only marginal differences. In an earlier study, <xref ref-type="bibr" rid="B56">Riniker and Landrum (2013)</xref> observed strong correlations between the fingerprints which may explain the similarities in the obtained metrics. Overall, the choice of which fingerprint to use for modelling is far from trivial and is to a large extent dependent on the target. In this study, Avalon and FCFP4 fingerprints are generally seen standout as useful descriptors and may serve as useful starting points for future benchmarking studies. A potential avenue for improvement in prediction performance could be to combine 2D fingerprints with structure-based graph representations (<xref ref-type="bibr" rid="B16">Choo et al., 2023</xref>). Alternatively, one may look towards language representations which have recently been shown to yield good results on several classification and regression benchmarks (<xref ref-type="bibr" rid="B59">Ross et al., 2022</xref>).</p>
</sec>
<sec id="s3-3">
<title>3.3 Performance on regression tasks</title>
<p>Given the relative success of the fingerprint-based RF classification models, an immediate question is whether the performances can be replicated for regression tasks. To this end, RF regression models were computed for a number of previously analysed data sets that used graph based signatures and other auxiliary attributes to identify potential candidates against <italic>mycobacterium tuberculosis</italic> (<xref ref-type="bibr" rid="B51">Pires and Ascher, 2020</xref>), cancer (<xref ref-type="bibr" rid="B2">Al-Jarf et al., 2021</xref>), and G protein-coupled receptors (<xref ref-type="bibr" rid="B73">Velloso et al., 2021</xref>) (GPCRs). A total of 1904 fingerprint-RF models were computed, spanning 36 different GPCRs, 8 organism-specific Mycobacteria species (<italic>M. avium, M. caseum, M. kansasii, M. phlei, M. tuberculosis, M. bovis, M. fortuitum, M. smegmatis</italic> and <italic>M. intracellulare</italic>) and 74 distinct cancer cell lines corresponding to 9 tumor types (renal, breast, CNS, colon, leukemia melanoma, non small cell lung, ovarian, prostate, and small cell lung). <xref ref-type="sec" rid="s11">Supplementary Figures S7&#x2013;S9</xref> in the SI summarize the regression performances of the different fingerprints. When compared with the graph signature based approaches, although marginal improvements were seen for some cases, the overall performance measured in terms of the squared Pearson correlation (<italic>R</italic>
<sup>2</sup>) was largely found to be comparable, with only models for tuberculosis yielding slightly lower <italic>R</italic>
<sup>2</sup> values (see <xref ref-type="fig" rid="F5">Figure 5</xref>). The fingerprint performance observed for these data sets mirrors the trends seen for a number of ADMET-related responses that were studied in a previous article [see (<xref ref-type="bibr" rid="B75">Venkatraman, 2021</xref>)] and suggest that purely fingerprint-based models may have low predictive utility for regression.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>Mean <italic>R</italic>
<sup>2</sup> obtained by the fingerprint models for different data sets <bold>(A)</bold> tuberculosis (<xref ref-type="bibr" rid="B51">Pires and Ascher, 2020</xref>) <bold>(B)</bold> GPCR (<xref ref-type="bibr" rid="B73">Velloso et al., 2021</xref>) and <bold>(C)</bold> Cancer <xref ref-type="bibr" rid="B2">Al-Jarf et al. (2021)</xref>.</p>
</caption>
<graphic xlink:href="fchem-11-1239467-g005.tif"/>
</fig>
</sec>
</sec>
<sec id="s4">
<title>4 Software implementation and usage</title>
<p>Fingerprint calculations were carried out using the CDK (<xref ref-type="bibr" rid="B77">Willighagen et al., 2017</xref>) and RDKit (<xref ref-type="bibr" rid="B38">Landrum, 2022</xref>) libraries. Random forests models were built using the R (<xref ref-type="bibr" rid="B53">R Core Team, 2022</xref>) package ranger (<xref ref-type="bibr" rid="B78">Wright and Ziegler, 2017</xref>). The models were subsequently converted to predictive model markup language (PMML) which is an XML format that facilitates sharing of models between PMML compliant applications. For ease-of-use, a Java-based graphical user interface (see <xref ref-type="fig" rid="F6">Figure 6</xref>) has been created which integrates the Java Evaluator API (<ext-link ext-link-type="uri" xlink:href="https://github.com/jpmml">https://github.com/jpmml</ext-link>) for model evaluation. In addition to the GUI, FP-MAP has also been made available as a command line interface.</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption>
<p>Graphical user interface for FPMAP. Users can upload a SMILES file (&#x201c;Batch processing&#x201d;) or alternatively enter a single SMILES string for evaluation. Prediction results are written to the output file specified.</p>
</caption>
<graphic xlink:href="fchem-11-1239467-g006.tif"/>
</fig>
</sec>
<sec sec-type="conclusion" id="s5">
<title>5 Conclusion</title>
<p>This article sets out to assemble a comprehensive catalogue of predictive models for small molecules with potential bioactivity against various targets and diseases. Previous studies have provided only fragments of the large spectrum of molecule pharmacodynamics and bioactivity prediction models, many of which are not easily accessible. Encouraged by the initial predictive performance of the fingerprints on over 80 targets for which close to 1,000 models were computed, machine learning algorithms were applied to a number of important targets for which freely accessible prediction models are not available (to the best of the author&#x2019;s knowledge). For the 24 data sets included in the current release of the software, the fingerprint-based binary classification performances for severely imbalanced datasets ranged from moderate (AUC &#x2248;0.61) to high (AUC &#x3e;0.90) and outperform alternative approaches. FP-MAP provides a simple and easy to use platform for predicting activity of novel compounds as well as for benchmarking studies. As more and more curated data sets emerge (<xref ref-type="bibr" rid="B8">B&#xe9;quignon et al., 2023</xref>; <xref ref-type="bibr" rid="B12">Buterez et al., 2023</xref>), future efforts will focus on expanding the palette of targets and diseases.</p>
</sec>
</body>
<back>
<sec sec-type="data-availability" id="s6">
<title>Data availability statement</title>
<p>Publicly available datasets were analyzed in this study. This data can be found here: <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.7983590">https://doi.org/10.5281/zenodo.7983590</ext-link>.</p>
</sec>
<sec id="s7">
<title>Author contributions</title>
<p>VV conceived and designed the study, performed the data analysis and wrote the paper.</p>
</sec>
<sec id="s8">
<title>Funding</title>
<p>VV acknowledges financial support from the Research Council of Norway (Grant No. 262152).</p>
</sec>
<sec sec-type="COI-statement" id="s9">
<title>Conflict of interest</title>
<p>The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s10">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<sec id="s11">
<title>Supplementary material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fchem.2023.1239467/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fchem.2023.1239467/full&#x23;supplementary-material</ext-link>
</p>
<supplementary-material xlink:href="DataSheet1.PDF" id="SM1" mimetype="application/PDF" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Adams</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Agyenkwa-Mawuli</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Agyapong</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Wilson</surname>
<given-names>M. D.</given-names>
</name>
<name>
<surname>Kwofie</surname>
<given-names>S. K.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>EBOLApred: a machine learning-based web application for predicting cell entry inhibitors of the ebola virus</article-title>. <source>Comput. Biol. Chem.</source> <volume>101</volume>, <fpage>107766</fpage>. <pub-id pub-id-type="doi">10.1016/j.compbiolchem.2022.107766</pub-id>
</citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Al-Jarf</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>de S&#xe1;</surname>
<given-names>A. G. C.</given-names>
</name>
<name>
<surname>Pires</surname>
<given-names>D. E. V.</given-names>
</name>
<name>
<surname>Ascher</surname>
<given-names>D. B.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>pdCSM-cancer: using graph-based signatures to identify small molecules with anticancer properties</article-title>. <source>J. Chem. Inf. Model.</source> <volume>61</volume>, <fpage>3314</fpage>&#x2013;<lpage>3322</lpage>. <pub-id pub-id-type="doi">10.1021/acs.jcim.1c00168</pub-id>
</citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Aleksi&#x107;</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Seeliger</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Brown</surname>
<given-names>J. B.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>ADMET predictability at boehringer ingelheim: state-of-the-art, and do bigger datasets or algorithms make a difference?</article-title> <source>Mol. Inf.</source> <volume>41</volume>, <fpage>2100113</fpage>. <pub-id pub-id-type="doi">10.1002/minf.202100113</pub-id>
</citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Aljarf</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Pires</surname>
<given-names>D. E. V.</given-names>
</name>
<name>
<surname>Ascher</surname>
<given-names>D. B.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>embryotox: using graph-based signatures to predict the teratogenicity of small molecules</article-title>. <source>J. Chem. Inf. Model.</source> <volume>63</volume> (<issue>2</issue>), <fpage>432</fpage>&#x2013;<lpage>441</lpage>. <pub-id pub-id-type="doi">10.1021/acs.jcim.2c00824</pub-id>
</citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Allen</surname>
<given-names>T. E. H.</given-names>
</name>
<name>
<surname>Wedlake</surname>
<given-names>A. J.</given-names>
</name>
<name>
<surname>Gel&#x17e;inyt&#x117;</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Gong</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Goodman</surname>
<given-names>J. M.</given-names>
</name>
<name>
<surname>Gutsell</surname>
<given-names>S.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Neural network activation similarity: a new measure to assist decision making in chemical toxicology</article-title>. <source>Chem. Sci.</source> <volume>11</volume>, <fpage>7335</fpage>&#x2013;<lpage>7348</lpage>. <pub-id pub-id-type="doi">10.1039/d0sc01637c</pub-id>
</citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bender</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Jenkins</surname>
<given-names>J. L.</given-names>
</name>
<name>
<surname>Scheiber</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Sukuru</surname>
<given-names>S. C. K.</given-names>
</name>
<name>
<surname>Glick</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Davies</surname>
<given-names>J. W.</given-names>
</name>
</person-group> (<year>2009</year>). <article-title>How similar are similarity searching methods? A principal component analysis of molecular descriptor space</article-title>. <source>J. Chem. Inf. Model.</source> <volume>49</volume>, <fpage>108</fpage>&#x2013;<lpage>119</lpage>. <pub-id pub-id-type="doi">10.1021/ci800249s</pub-id>
</citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bender</surname>
<given-names>B. J.</given-names>
</name>
<name>
<surname>Gahbauer</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Luttens</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Lyu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Webb</surname>
<given-names>C. M.</given-names>
</name>
<name>
<surname>Stein</surname>
<given-names>R. M.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>A practical guide to large-scale docking</article-title>. <source>Nat. Protoc.</source> <volume>16</volume>, <fpage>4799</fpage>&#x2013;<lpage>4832</lpage>. <pub-id pub-id-type="doi">10.1038/s41596-021-00597-z</pub-id>
</citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>B&#xe9;quignon</surname>
<given-names>O. J. M.</given-names>
</name>
<name>
<surname>Bongers</surname>
<given-names>B. J.</given-names>
</name>
<name>
<surname>Jespers</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Ijzerman</surname>
<given-names>A. P.</given-names>
</name>
<name>
<surname>van der Water</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>van Westen</surname>
<given-names>G. J. P.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Papyrus: a large-scale curated dataset aimed at bioactivity predictions</article-title>. <source>J. Cheminformatics</source> <volume>15</volume>, <fpage>3</fpage>. <pub-id pub-id-type="doi">10.1186/s13321-022-00672-x</pub-id>
</citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bosc</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Felix</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Arcila</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Mendez</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Saunders</surname>
<given-names>M. R.</given-names>
</name>
<name>
<surname>Green</surname>
<given-names>D. V. S.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Maip: a web service for predicting blood-stage malaria inhibitors</article-title>. <source>J. Cheminf</source> <volume>13</volume>, <fpage>13</fpage>. <pub-id pub-id-type="doi">10.1186/s13321-021-00487-2</pub-id>
</citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Breiman</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2001</year>). <article-title>Random forests</article-title>. <source>Mach. Learn.</source> <volume>45</volume>, <fpage>5</fpage>&#x2013;<lpage>32</lpage>. <pub-id pub-id-type="doi">10.1023/a:1010933404324</pub-id>
</citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Breunig</surname>
<given-names>M. M.</given-names>
</name>
<name>
<surname>Kriegel</surname>
<given-names>H.-P.</given-names>
</name>
<name>
<surname>Ng</surname>
<given-names>R. T.</given-names>
</name>
<name>
<surname>Sander</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2000</year>). <article-title>Lof: identifying density-based local outliers</article-title>. <source>ACM SIGMOD Rec.</source> <volume>29</volume>, <fpage>93</fpage>&#x2013;<lpage>104</lpage>. <pub-id pub-id-type="doi">10.1145/335191.335388</pub-id>
</citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Buterez</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Janet</surname>
<given-names>J. P.</given-names>
</name>
<name>
<surname>Kiddle</surname>
<given-names>S. J.</given-names>
</name>
<name>
<surname>Li&#xf2;</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>MF-PCBA: multifidelity high-throughput screening benchmarks for drug discovery and machine learning</article-title>. <source>J. Chem. Inf. Model.</source> <volume>63</volume>, <fpage>2667</fpage>&#x2013;<lpage>2678</lpage>. <pub-id pub-id-type="doi">10.1021/acs.jcim.2c01569</pub-id>
</citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Capuzzi</surname>
<given-names>S. J.</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>I. S.-J.</given-names>
</name>
<name>
<surname>Lam</surname>
<given-names>W. I.</given-names>
</name>
<name>
<surname>Thornton</surname>
<given-names>T. E.</given-names>
</name>
<name>
<surname>Muratov</surname>
<given-names>E. N.</given-names>
</name>
<name>
<surname>Pozefsky</surname>
<given-names>D.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). <article-title>Chembench: a publicly accessible, integrated cheminformatics portal</article-title>. <source>J. Chem. Inf. Model.</source> <volume>57</volume>, <fpage>105</fpage>&#x2013;<lpage>108</lpage>. <pub-id pub-id-type="doi">10.1021/acs.jcim.6b00462</pub-id>
</citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Casanova-Alvarez</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Morales-Helguera</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Cabrera-P&#xe9;rez</surname>
<given-names>M. &#xc1;.</given-names>
</name>
<name>
<surname>Molina-Ruiz</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Molina</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>A novel automated framework for QSAR modeling of highly imbalanced leishmania high-throughput screening data</article-title>. <source>J. Chem. Inf. Model.</source> <volume>61</volume>, <fpage>3213</fpage>&#x2013;<lpage>3231</lpage>. <pub-id pub-id-type="doi">10.1021/acs.jcim.0c01439</pub-id>
</citation>
</ref>
<ref id="B15">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Qian</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Saligrama</surname>
<given-names>V.</given-names>
</name>
</person-group> &#x201c;<article-title>A new one-class SVM for anomaly detection</article-title>,&#x201d; in <conf-name>Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE)</conf-name>, <conf-loc>Vancouver, BC, Canada</conf-loc>, <conf-date>May 2013</conf-date>. <pub-id pub-id-type="doi">10.1109/icassp.2013.6638322</pub-id>
</citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Choo</surname>
<given-names>H. Y.</given-names>
</name>
<name>
<surname>Wee</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Xia</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Fingerprint-enhanced graph attention network (FinGAT) model for antibiotic discovery</article-title>. <source>J. Chem. Inf. Model.</source> <volume>63</volume>, <fpage>2928</fpage>&#x2013;<lpage>2935</lpage>. <pub-id pub-id-type="doi">10.1021/acs.jcim.3c00045</pub-id>
</citation>
</ref>
<ref id="B17">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cova</surname>
<given-names>T. F. G. G.</given-names>
</name>
<name>
<surname>Pais</surname>
<given-names>A. A. C. C.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Deep learning for deep chemistry: optimizing the prediction of chemical patterns</article-title>. <source>Front. Chem.</source> <volume>7</volume>, <fpage>809</fpage>. <pub-id pub-id-type="doi">10.3389/fchem.2019.00809</pub-id>
</citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>David</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Thakkar</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Mercado</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Engkvist</surname>
<given-names>O.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Molecular representations in AI-driven drug discovery: a review and practical guide</article-title>. <source>J. Cheminf</source> <volume>12</volume>, <fpage>56</fpage>. <pub-id pub-id-type="doi">10.1186/s13321-020-00460-5</pub-id>
</citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dreiman</surname>
<given-names>G. H.</given-names>
</name>
<name>
<surname>Bictash</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Fish</surname>
<given-names>P. V.</given-names>
</name>
<name>
<surname>Griffin</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Svensson</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Changing the HTS paradigm: AI-driven iterative screening for hit finding</article-title>. <source>SLAS Discov.</source> <volume>26</volume>, <fpage>257</fpage>&#x2013;<lpage>262</lpage>. <pub-id pub-id-type="doi">10.1177/2472555220949495</pub-id>
</citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Durant</surname>
<given-names>J. L.</given-names>
</name>
<name>
<surname>Leland</surname>
<given-names>B. A.</given-names>
</name>
<name>
<surname>Henry</surname>
<given-names>D. R.</given-names>
</name>
<name>
<surname>Nourse</surname>
<given-names>J. G.</given-names>
</name>
</person-group> (<year>2002</year>). <article-title>Reoptimization of MDL keys for use in drug discovery</article-title>. <source>J. Chem. Inf. Model.</source> <volume>42</volume>, <fpage>1273</fpage>&#x2013;<lpage>1280</lpage>. <pub-id pub-id-type="doi">10.1021/ci010132r</pub-id>
</citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Elbadawi</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Gaisford</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Basit</surname>
<given-names>A. W.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Advanced machine-learning techniques in drug discovery</article-title>. <source>Drug Discov.</source> <volume>26</volume>, <fpage>769</fpage>&#x2013;<lpage>777</lpage>. <pub-id pub-id-type="doi">10.1016/j.drudis.2020.12.003</pub-id>
</citation>
</ref>
<ref id="B22">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gaulton</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Hersey</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Nowotka</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Bento</surname>
<given-names>A. P.</given-names>
</name>
<name>
<surname>Chambers</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Mendez</surname>
<given-names>D.</given-names>
</name>
<etal/>
</person-group> (<year>2016</year>). <article-title>The ChEMBL database in 2017</article-title>. <source>Nucleic Acids Res.</source> <volume>45</volume>, <fpage>D945</fpage>&#x2013;<lpage>D954</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkw1074</pub-id>
</citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gentile</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Yaacoub</surname>
<given-names>J. C.</given-names>
</name>
<name>
<surname>Gleave</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Fernandez</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Ton</surname>
<given-names>A.-T.</given-names>
</name>
<name>
<surname>Ban</surname>
<given-names>F.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <article-title>Artificial intelligence&#x2013;enabled virtual screening of ultra-large chemical libraries with deep docking</article-title>. <source>Nat. Protoc.</source> <volume>17</volume>, <fpage>672</fpage>&#x2013;<lpage>697</lpage>. <pub-id pub-id-type="doi">10.1038/s41596-021-00659-2</pub-id>
</citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Glaser</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Vermaas</surname>
<given-names>J. V.</given-names>
</name>
<name>
<surname>Rogers</surname>
<given-names>D. M.</given-names>
</name>
<name>
<surname>Larkin</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>LeGrand</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Boehm</surname>
<given-names>S.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>High-throughput virtual laboratory for drug discovery using massive datasets</article-title>. <source>Int. J. High. Perform. Comput. Appl.</source> <volume>35</volume>, <fpage>452</fpage>&#x2013;<lpage>468</lpage>. <pub-id pub-id-type="doi">10.1177/10943420211001565</pub-id>
</citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gorgulla</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Boeszoermenyi</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Z.-F.</given-names>
</name>
<name>
<surname>Fischer</surname>
<given-names>P. D.</given-names>
</name>
<name>
<surname>Coote</surname>
<given-names>P. W.</given-names>
</name>
<name>
<surname>Das</surname>
<given-names>K. M. P.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>An open-source drug discovery platform enables ultra-large virtual screens</article-title>. <source>Nature</source> <volume>580</volume>, <fpage>663</fpage>&#x2013;<lpage>668</lpage>. <pub-id pub-id-type="doi">10.1038/s41586-020-2117-z</pub-id>
</citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Graff</surname>
<given-names>D. E.</given-names>
</name>
<name>
<surname>Shakhnovich</surname>
<given-names>E. I.</given-names>
</name>
<name>
<surname>Coley</surname>
<given-names>C. W.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Accelerating high-throughput virtual screening through molecular pool-based active learning</article-title>. <source>Chem. Sci.</source> <volume>12</volume>, <fpage>7866</fpage>&#x2013;<lpage>7881</lpage>. <pub-id pub-id-type="doi">10.1039/d0sc06805e</pub-id>
</citation>
</ref>
<ref id="B27">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>He</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Ling</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Cai</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Cai</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Machine learning enables accurate and rapid prediction of active molecules against breast cancer cells</article-title>. <source>Front. Pharmacol.</source> <volume>12</volume>, <fpage>796534</fpage>. <pub-id pub-id-type="doi">10.3389/fphar.2021.796534</pub-id>
</citation>
</ref>
<ref id="B28">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Iftkhar</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>de S&#xe1;</surname>
<given-names>A. G. C.</given-names>
</name>
<name>
<surname>Velloso</surname>
<given-names>J. P. L.</given-names>
</name>
<name>
<surname>Aljarf</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Pires</surname>
<given-names>D. E. V.</given-names>
</name>
<name>
<surname>Ascher</surname>
<given-names>D. B.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>cardiotoxcsm: a web server for predicting cardiotoxicity of small molecules</article-title>. <source>J. Chem. Inf. Model.</source> <volume>62</volume>, <fpage>4827</fpage>&#x2013;<lpage>4836</lpage>. <pub-id pub-id-type="doi">10.1021/acs.jcim.2c00822</pub-id>
</citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Irwin</surname>
<given-names>J. J.</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>K. G.</given-names>
</name>
<name>
<surname>Young</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Dandarchuluun</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>B. R.</given-names>
</name>
<name>
<surname>Khurelbaatar</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>ZINC20&#x2014;A free ultralarge-scale chemical database for ligand discovery</article-title>. <source>J. Chem. Inf. Model.</source> <volume>60</volume>, <fpage>6065</fpage>&#x2013;<lpage>6073</lpage>. <pub-id pub-id-type="doi">10.1021/acs.jcim.0c00675</pub-id>
</citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jaeger</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Fulle</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Turk</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Mol2vec: unsupervised machine learning approach with chemical intuition</article-title>. <source>J. Chem. Inf. Model.</source> <volume>58</volume>, <fpage>27</fpage>&#x2013;<lpage>35</lpage>. <pub-id pub-id-type="doi">10.1021/acs.jcim.7b00616</pub-id>
</citation>
</ref>
<ref id="B31">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Kelleher</surname>
<given-names>J. D.</given-names>
</name>
<name>
<surname>Namee</surname>
<given-names>B. M.</given-names>
</name>
<name>
<surname>D&#x2019;Arcy</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2015</year>). <source>Fundamentals of machine learning for predictive data analytics: algorithms, worked examples, and case studies</source>. <publisher-loc>Cambridge, Massachusetts, United States</publisher-loc>: <publisher-name>The MIT Press</publisher-name>.</citation>
</ref>
<ref id="B32">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Kexin Huang</surname>
<given-names>T. F.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Therapeutics data commons</article-title>. <comment>Avaliable At: <ext-link ext-link-type="uri" xlink:href="https://tdcommons.ai">https://tdcommons.ai</ext-link>
</comment>.</citation>
</ref>
<ref id="B33">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kidana</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Tatebe</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Ito</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Hara</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Kakita</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Saito</surname>
<given-names>T.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>Loss of kallikrein-related peptidase 7 exacerbates amyloid pathology in alzheimer&#x2019;s disease model mice</article-title>. <source>EMBO Mol. Med.</source> <volume>10</volume>, <fpage>e8184</fpage>. <pub-id pub-id-type="doi">10.15252/emmm.201708184</pub-id>
</citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kim</surname>
<given-names>B.-H.</given-names>
</name>
<name>
<surname>Ye</surname>
<given-names>J. C.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Understanding graph isomorphism network for rs-fMRI functional connectivity analysis</article-title>. <source>Front. Neurosci.</source> <volume>14</volume>, <fpage>630</fpage>. <pub-id pub-id-type="doi">10.3389/fnins.2020.00630</pub-id>
</citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kim</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Cheng</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Gindulyte</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>He</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>He</surname>
<given-names>S.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <article-title>PubChem 2023 update</article-title>. <source>Nucleic Acids Res.</source> <volume>51</volume>, <fpage>D1373</fpage>&#x2013;<lpage>D1380</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkac956</pub-id>
</citation>
</ref>
<ref id="B36">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Klekota</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Roth</surname>
<given-names>F. P.</given-names>
</name>
</person-group> (<year>2008</year>). <article-title>Chemical substructures that enrich for biological activity</article-title>. <source>Bioinformatics</source> <volume>24</volume>, <fpage>2518</fpage>&#x2013;<lpage>2525</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btn479</pub-id>
</citation>
</ref>
<ref id="B37">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Kuhn</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>caret: classification and Regression Training. R package version 6.0-93</article-title>. <ext-link ext-link-type="uri" xlink:href="https://github.com/topepo/caret/">https://github.com/topepo/caret/</ext-link>.</citation>
</ref>
<ref id="B38">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Landrum</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Rdkit: open-source cheminformatics</article-title>. <ext-link ext-link-type="uri" xlink:href="https://www.rdkit.org.%20Release:%202022.03.5">https://www.rdkit.org. Release: 2022.03.5</ext-link>.</citation>
</ref>
<ref id="B39">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lane</surname>
<given-names>T. R.</given-names>
</name>
<name>
<surname>Foil</surname>
<given-names>D. H.</given-names>
</name>
<name>
<surname>Minerali</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Urbina</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Zorn</surname>
<given-names>K. M.</given-names>
</name>
<name>
<surname>Ekins</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Bioactivity comparison across multiple machine learning algorithms using over 5000 datasets for drug discovery</article-title>. <source>Mol. Pharm.</source> <volume>18</volume>, <fpage>403</fpage>&#x2013;<lpage>415</lpage>. <pub-id pub-id-type="doi">10.1021/acs.molpharmaceut.0c01013</pub-id>
</citation>
</ref>
<ref id="B40">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>F. T.</given-names>
</name>
<name>
<surname>Ting</surname>
<given-names>K. M.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>Z.-H.</given-names>
</name>
</person-group> &#x201c;<article-title>Isolation forest</article-title>,&#x201d; in <conf-name>Proceedings of the 2008 Eighth IEEE International Conference on Data Mining</conf-name>, <conf-loc>Pisa, Italy</conf-loc>, <conf-date>December 2008</conf-date>, <fpage>413</fpage>&#x2013;<lpage>422</lpage>. <pub-id pub-id-type="doi">10.1109/ICDM.2008.17</pub-id>
</citation>
</ref>
<ref id="B41">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Du</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Fang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Yin</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Xie</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>DeepScreening: a deep learning-based screening web server for accelerating drug discovery</article-title>. <source>Database</source> <volume>2019</volume>, <fpage>baz104</fpage>. <pub-id pub-id-type="doi">10.1093/database/baz104</pub-id>
</citation>
</ref>
<ref id="B42">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Luttens</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Gullberg</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Abdurakhmanov</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Vo</surname>
<given-names>D. D.</given-names>
</name>
<name>
<surname>Akaberi</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Talibov</surname>
<given-names>V. O.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <article-title>Ultralarge virtual screening identifies SARS-CoV-2 main protease inhibitors with broad-spectrum activity against coronaviruses</article-title>. <source>J. Am. Chem. Soc.</source> <volume>144</volume>, <fpage>2905</fpage>&#x2013;<lpage>2920</lpage>. <pub-id pub-id-type="doi">10.1021/jacs.1c08402</pub-id>
</citation>
</ref>
<ref id="B43">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ma</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Sheridan</surname>
<given-names>R. P.</given-names>
</name>
<name>
<surname>Liaw</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Dahl</surname>
<given-names>G. E.</given-names>
</name>
<name>
<surname>Svetnik</surname>
<given-names>V.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Deep neural nets as a method for quantitative structure activity relationships</article-title>. <source>J. Chem. Inf. Model.</source> <volume>55</volume>, <fpage>263</fpage>&#x2013;<lpage>274</lpage>. <pub-id pub-id-type="doi">10.1021/ci500747n</pub-id>
</citation>
</ref>
<ref id="B44">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mayr</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Klambauer</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Unterthiner</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Steijaert</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Wegner</surname>
<given-names>J. K.</given-names>
</name>
<name>
<surname>Ceulemans</surname>
<given-names>H.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>Large-scale comparison of machine learning methods for drug target prediction on ChEMBL</article-title>. <source>Chem. Sci.</source> <volume>9</volume>, <fpage>5441</fpage>&#x2013;<lpage>5451</lpage>. <pub-id pub-id-type="doi">10.1039/c8sc00148k</pub-id>
</citation>
</ref>
<ref id="B45">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Morris</surname>
<given-names>M. T.</given-names>
</name>
<name>
<surname>DeBruin</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Chambers</surname>
<given-names>J. W.</given-names>
</name>
<name>
<surname>Smith</surname>
<given-names>K. S.</given-names>
</name>
<name>
<surname>Morris</surname>
<given-names>J. C.</given-names>
</name>
</person-group> (<year>2006</year>). <article-title>Activity of a second trypanosoma brucei hexokinase is controlled by an 18-amino-acid c-terminal tail</article-title>. <source>Eukaryot. Cell</source> <volume>5</volume>, <fpage>2014</fpage>&#x2013;<lpage>2023</lpage>. <pub-id pub-id-type="doi">10.1128/ec.00146-06</pub-id>
</citation>
</ref>
<ref id="B46">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Muegge</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>How do we further enhance 2d fingerprint similarity searching for novel drug discovery?</article-title> <source>Expert Opin. Drug Discov.</source> <volume>17</volume>, <fpage>1173</fpage>&#x2013;<lpage>1176</lpage>. <pub-id pub-id-type="doi">10.1080/17460441.2022.2128332</pub-id>
</citation>
</ref>
<ref id="B47">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Ncbi</surname>
</name>
</person-group> (<year>2009</year>). <article-title>Pubchem subgraph fingerprint</article-title>. <ext-link ext-link-type="uri" xlink:href="https://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt.%20Version:%201.3">https://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt. Version: 1.3</ext-link>.</citation>
</ref>
<ref id="B48">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Orosz</surname>
<given-names>&#xc1;.</given-names>
</name>
<name>
<surname>H&#xe9;berger</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>R&#xe1;cz</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Comparison of descriptor- and fingerprint sets in machine learning models for ADME-tox targets</article-title>. <source>Front. Chem.</source> <volume>10</volume>, <fpage>852893</fpage>. <pub-id pub-id-type="doi">10.3389/fchem.2022.852893</pub-id>
</citation>
</ref>
<ref id="B49">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Patel</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Ihlenfeldt</surname>
<given-names>W.-D.</given-names>
</name>
<name>
<surname>Judson</surname>
<given-names>P. N.</given-names>
</name>
<name>
<surname>Moroz</surname>
<given-names>Y. S.</given-names>
</name>
<name>
<surname>Pevzner</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Peach</surname>
<given-names>M. L.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>SAVI, <italic>in silico</italic> generation of billions of easily synthesizable compounds through expert-system type rules</article-title>. <source>Sci. Data</source> <volume>7</volume>, <fpage>384</fpage>. <pub-id pub-id-type="doi">10.1038/s41597-020-00727-4</pub-id>
</citation>
</ref>
<ref id="B50">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pillai</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Dasgupta</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Sudsakorn</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Fretland</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Mavroudis</surname>
<given-names>P. D.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Machine learning guided early drug discovery of small molecules</article-title>. <source>Drug Discov.</source> <volume>27</volume>, <fpage>2209</fpage>&#x2013;<lpage>2215</lpage>. <pub-id pub-id-type="doi">10.1016/j.drudis.2022.03.017</pub-id>
</citation>
</ref>
<ref id="B51">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pires</surname>
<given-names>D. E. V.</given-names>
</name>
<name>
<surname>Ascher</surname>
<given-names>D. B.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>mycoCSM: using graph-based signatures to identify safe potent hits against mycobacteria</article-title>. <source>J. Chem. Inf. Model.</source> <volume>60</volume>, <fpage>3450</fpage>&#x2013;<lpage>3456</lpage>. <pub-id pub-id-type="doi">10.1021/acs.jcim.0c00362</pub-id>
</citation>
</ref>
<ref id="B52">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Qureshi</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Rajput</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Kaur</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Kumar</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>HIVprotI: an integrated web based platform for prediction and design of HIV proteins inhibitors</article-title>. <source>J. Cheminf</source> <volume>10</volume>, <fpage>12</fpage>. <pub-id pub-id-type="doi">10.1186/s13321-018-0266-y</pub-id>
</citation>
</ref>
<ref id="B53">
<citation citation-type="book">
<collab>R Core Team</collab> (<year>2022</year>). <source>R: a language and environment for statistical computing</source>. <publisher-loc>Vienna, Austria</publisher-loc>: <publisher-name>R Foundation for Statistical Computing</publisher-name>.</citation>
</ref>
<ref id="B54">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Raghunathan</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Priyakumar</surname>
<given-names>U. D.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Molecular representations for machine learning applications in chemistry</article-title>. <source>Int. J. Quant. Chem.</source> <volume>122</volume>. <pub-id pub-id-type="doi">10.1002/qua.26870</pub-id>
</citation>
</ref>
<ref id="B55">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Reymond</surname>
<given-names>J.-L.</given-names>
</name>
<name>
<surname>Awale</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>Exploring chemical space for drug discovery using the chemical universe database</article-title>. <source>ACS Chem. Neurosci.</source> <volume>3</volume>, <fpage>649</fpage>&#x2013;<lpage>657</lpage>. <pub-id pub-id-type="doi">10.1021/cn3000422</pub-id>
</citation>
</ref>
<ref id="B56">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Riniker</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Landrum</surname>
<given-names>G. A.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>Open-source platform to benchmark fingerprints for ligand-based virtual screening</article-title>. <source>J. Cheminformatics</source> <volume>5</volume>, <fpage>26</fpage>. <pub-id pub-id-type="doi">10.1186/1758-2946-5-26</pub-id>
</citation>
</ref>
<ref id="B57">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rodrigues</surname>
<given-names>C. H. M.</given-names>
</name>
<name>
<surname>Pires</surname>
<given-names>D. E. V.</given-names>
</name>
<name>
<surname>Ascher</surname>
<given-names>D. B.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>pdCSM-PPI: using graph-based signatures to identify protein-protein interaction inhibitors</article-title>. <source>J. Chem. Inf. Model.</source> <volume>61</volume>, <fpage>5438</fpage>&#x2013;<lpage>5445</lpage>. <pub-id pub-id-type="doi">10.1021/acs.jcim.1c01135</pub-id>
</citation>
</ref>
<ref id="B58">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rogers</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Hahn</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>Extended-connectivity fingerprints</article-title>. <source>J. Chem. Inf. Model.</source> <volume>50</volume>, <fpage>742</fpage>&#x2013;<lpage>754</lpage>. <pub-id pub-id-type="doi">10.1021/ci100050t</pub-id>
</citation>
</ref>
<ref id="B59">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ross</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Belgodere</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Chenthamarakshan</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Padhi</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Mroueh</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Das</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Large-scale chemical language representations capture molecular structure and properties</article-title>. <source>Nat. Mach. Intell.</source> <volume>4</volume>, <fpage>1256</fpage>&#x2013;<lpage>1264</lpage>. <pub-id pub-id-type="doi">10.1038/s42256-022-00580-7</pub-id>
</citation>
</ref>
<ref id="B60">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ruusmann</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Sild</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Maran</surname>
<given-names>U.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>QSAR DataBank repository: open and linked qualitative and quantitative structure activity relationship models</article-title>. <source>J. Cheminf</source> <volume>7</volume>, <fpage>32</fpage>. <pub-id pub-id-type="doi">10.1186/s13321-015-0082-6</pub-id>
</citation>
</ref>
<ref id="B61">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Soufan</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Ba-alawi</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Magana-Mora</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Essack</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Bajic</surname>
<given-names>V. B.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>DPubChem: a web tool for QSAR modeling and high-throughput virtual screening</article-title>. <source>Sci. Rep.</source> <volume>8</volume>, <fpage>9110</fpage>. <pub-id pub-id-type="doi">10.1038/s41598-018-27495-x</pub-id>
</citation>
</ref>
<ref id="B62">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sabando</surname>
<given-names>M. V.</given-names>
</name>
<name>
<surname>Ponzoni</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Milios</surname>
<given-names>E. E.</given-names>
</name>
<name>
<surname>Soto</surname>
<given-names>A. J.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Using molecular embeddings in QSAR modeling: does it make a difference?</article-title> <source>Brief. Bioinform</source> <volume>23</volume>, <fpage>bbab365</fpage>. <pub-id pub-id-type="doi">10.1093/bib/bbab365</pub-id>
</citation>
</ref>
<ref id="B63">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sadybekov</surname>
<given-names>A. A.</given-names>
</name>
<name>
<surname>Sadybekov</surname>
<given-names>A. V.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Iliopoulos-Tsoutsouvas</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>X.-P.</given-names>
</name>
<name>
<surname>Pickett</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Synthon-based ligand discovery in virtual libraries of over 11 billion compounds</article-title>. <source>Nature</source> <volume>601</volume>, <fpage>452</fpage>&#x2013;<lpage>459</lpage>. <pub-id pub-id-type="doi">10.1038/s41586-021-04220-9</pub-id>
</citation>
</ref>
<ref id="B64">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schlander</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Hernandez-Villafuerte</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Cheng</surname>
<given-names>C.-Y.</given-names>
</name>
<name>
<surname>Mestre-Ferrandiz</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Baumann</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>How much does it cost to research and develop a new drug? A systematic review and assessment</article-title>. <source>PharmacoEconomics</source> <volume>39</volume>, <fpage>1243</fpage>&#x2013;<lpage>1269</lpage>. <pub-id pub-id-type="doi">10.1007/s40273-021-01065-y</pub-id>
</citation>
</ref>
<ref id="B65">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Scotti</surname>
<given-names>M. T.</given-names>
</name>
<name>
<surname>Herrera-Acevedo</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>de Menezes</surname>
<given-names>R. P. B.</given-names>
</name>
<name>
<surname>Martin</surname>
<given-names>H.-J.</given-names>
</name>
<name>
<surname>Muratov</surname>
<given-names>E. N.</given-names>
</name>
<name>
<surname>de Souza Silva</surname>
<given-names>&#xc1;. &#xcd;.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <article-title>MolPredictX: online biological activity predictions by machine learning models</article-title>. <source>Mol. Inf.</source> <volume>41</volume>, <fpage>2200133</fpage>. <pub-id pub-id-type="doi">10.1002/minf.202200133</pub-id>
</citation>
</ref>
<ref id="B66">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shao</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Gong</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Yin</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Pan</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Pandiyan</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>S2dv: converting SMILES to a drug vector for predicting the activity of anti-HBV small molecules</article-title>. <source>Brief. Bioinform.</source> <volume>23</volume>, <fpage>bbab593</fpage>. <pub-id pub-id-type="doi">10.1093/bib/bbab593</pub-id>
</citation>
</ref>
<ref id="B67">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Singh</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Chaput</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Villoutreix</surname>
<given-names>B. O.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Virtual screening web servers: designing chemical probes and drug candidates in the cyberspace</article-title>. <source>Brief. Bioinform.</source> <volume>22</volume>, <fpage>1790</fpage>&#x2013;<lpage>1818</lpage>. <pub-id pub-id-type="doi">10.1093/bib/bbaa034</pub-id>
</citation>
</ref>
<ref id="B68">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sud</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>MayaChemTools: an open source package for computational drug discovery</article-title>. <source>J. Chem. Inf. Model.</source> <volume>56</volume>, <fpage>2292</fpage>&#x2013;<lpage>2297</lpage>. <pub-id pub-id-type="doi">10.1021/acs.jcim.6b00505</pub-id>
</citation>
</ref>
<ref id="B69">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sun</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Jeliazkova</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Chupakhin</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Golib-Dzib</surname>
<given-names>J.-F.</given-names>
</name>
<name>
<surname>Engkvist</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Carlsson</surname>
<given-names>L.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). <article-title>ExCAPE-DB: an integrated large scale dataset facilitating big data analysis in chemogenomics</article-title>. <source>J. Cheminf</source> <volume>9</volume>, <fpage>17</fpage>. <pub-id pub-id-type="doi">10.1186/s13321-017-0203-5</pub-id>
</citation>
</ref>
<ref id="B70">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tetko</surname>
<given-names>I. V.</given-names>
</name>
<name>
<surname>Gasteiger</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Todeschini</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Mauri</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Livingstone</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Ertl</surname>
<given-names>P.</given-names>
</name>
<etal/>
</person-group> (<year>2005</year>). <article-title>Virtual computational chemistry laboratory &#x2013; design and description</article-title>. <source>J. Computer-Aided Mol. Des.</source> <volume>19</volume>, <fpage>453</fpage>&#x2013;<lpage>463</lpage>. <pub-id pub-id-type="doi">10.1007/s10822-005-8694-y</pub-id>
</citation>
</ref>
<ref id="B71">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tinivella</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Pinzi</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Rastelli</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Prediction of activity and selectivity profiles of human carbonic anhydrase inhibitors using machine learning classification models</article-title>. <source>J. Cheminf</source> <volume>13</volume>, <fpage>18</fpage>. <pub-id pub-id-type="doi">10.1186/s13321-021-00499-y</pub-id>
</citation>
</ref>
<ref id="B72">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Togo</surname>
<given-names>M. V.</given-names>
</name>
<name>
<surname>Mastrolorito</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Ciriaco</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Trisciuzzi</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Tondo</surname>
<given-names>A. R.</given-names>
</name>
<name>
<surname>Gambacorta</surname>
<given-names>N.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <article-title>Tiresia: an eXplainable artificial intelligence platform for predicting developmental toxicity</article-title>. <source>J. Chem. Inf. Model.</source> <volume>63</volume>, <fpage>56</fpage>&#x2013;<lpage>66</lpage>. <pub-id pub-id-type="doi">10.1021/acs.jcim.2c01126</pub-id>
</citation>
</ref>
<ref id="B73">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Velloso</surname>
<given-names>J. P. L.</given-names>
</name>
<name>
<surname>Ascher</surname>
<given-names>D. B.</given-names>
</name>
<name>
<surname>Pires</surname>
<given-names>D. E. V.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>pdCSM-GPCR: predicting potent GPCR ligands with graph-based signatures</article-title>. <source>Bioinform. Adv.</source> <volume>1</volume>, <fpage>vbab031</fpage>. <pub-id pub-id-type="doi">10.1093/bioadv/vbab031</pub-id>
</citation>
</ref>
<ref id="B74">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Venkatraman</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Colligan</surname>
<given-names>T. H.</given-names>
</name>
<name>
<surname>Lesica</surname>
<given-names>G. T.</given-names>
</name>
<name>
<surname>Olson</surname>
<given-names>D. R.</given-names>
</name>
<name>
<surname>Gaiser</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Copeland</surname>
<given-names>C. J.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <article-title>Drugsniffer: an open source workflow for virtually screening billions of molecules for binding affinity to protein targets</article-title>. <source>Front. Pharmacol.</source> <volume>13</volume>, <fpage>874746</fpage>. <pub-id pub-id-type="doi">10.3389/fphar.2022.874746</pub-id>
</citation>
</ref>
<ref id="B75">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Venkatraman</surname>
<given-names>V.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>FP-ADMET: a compendium of fingerprint-based ADMET prediction models</article-title>. <source>J. Cheminf</source> <volume>13</volume>, <fpage>75</fpage>. <pub-id pub-id-type="doi">10.1186/s13321-021-00557-5</pub-id>
</citation>
</ref>
<ref id="B76">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Verras</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Waller</surname>
<given-names>C. L.</given-names>
</name>
<name>
<surname>Gedeck</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Green</surname>
<given-names>D. V. S.</given-names>
</name>
<name>
<surname>Kogej</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Raichurkar</surname>
<given-names>A.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). <article-title>Shared consensus machine learning models for predicting blood stage malaria inhibition</article-title>. <source>J. Chem. Inf. Model.</source> <volume>57</volume>, <fpage>445</fpage>&#x2013;<lpage>453</lpage>. <pub-id pub-id-type="doi">10.1021/acs.jcim.6b00572</pub-id>
</citation>
</ref>
<ref id="B77">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Willighagen</surname>
<given-names>E. L.</given-names>
</name>
<name>
<surname>Mayfield</surname>
<given-names>J. W.</given-names>
</name>
<name>
<surname>Alvarsson</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Berg</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Carlsson</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Jeliazkova</surname>
<given-names>N.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). <article-title>The chemistry development kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching</article-title>. <source>J. Cheminf</source> <volume>9</volume>, <fpage>33</fpage>. <pub-id pub-id-type="doi">10.1186/s13321-017-0220-4</pub-id>
</citation>
</ref>
<ref id="B78">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wright</surname>
<given-names>M. N.</given-names>
</name>
<name>
<surname>Ziegler</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>ranger: a fast implementation of random forests for high dimensional data in C&#x2b;&#x2b; and R</article-title>. <source>J. Stat. Soft.</source> <volume>77</volume>, <fpage>1</fpage>&#x2013;<lpage>17</lpage>. <pub-id pub-id-type="doi">10.18637/jss.v077.i01</pub-id>
</citation>
</ref>
<ref id="B79">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Hsieh</surname>
<given-names>C.-Y.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Liao</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Cao</surname>
<given-names>D.</given-names>
</name>
<etal/>
</person-group> (<year>2021a</year>). <article-title>Hyperbolic relational graph convolution networks plus: a simple but highly efficient QSAR-modeling method</article-title>. <source>Brief. Bioinform.</source> <volume>22</volume>, <fpage>bbab112</fpage>. <pub-id pub-id-type="doi">10.1093/bib/bbab112</pub-id>
</citation>
</ref>
<ref id="B80">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Pan</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Long</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>P. S.</given-names>
</name>
</person-group> (<year>2021b</year>). <article-title>A comprehensive survey on graph neural networks</article-title>. <source>IEEE Trans. Neural Netw. Learn. Syst.</source> <volume>32</volume>, <fpage>4</fpage>&#x2013;<lpage>24</lpage>. <pub-id pub-id-type="doi">10.1109/tnnls.2020.2978386</pub-id>
</citation>
</ref>
<ref id="B81">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Ramsundar</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Feinberg</surname>
<given-names>E. N.</given-names>
</name>
<name>
<surname>Gomes</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Geniesse</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Pappu</surname>
<given-names>A. S.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>MoleculeNet: a benchmark for molecular machine learning</article-title>. <source>Chem. Sci.</source> <volume>9</volume>, <fpage>513</fpage>&#x2013;<lpage>530</lpage>. <pub-id pub-id-type="doi">10.1039/c7sc02664a</pub-id>
</citation>
</ref>
<ref id="B82">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Kang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Leung</surname>
<given-names>E. L.-H.</given-names>
</name>
<name>
<surname>Lei</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>C.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Do we need different machine learning algorithms for qsar modeling? A comprehensive assessment of 16 machine learning algorithms on 14 qsar data sets</article-title>. <source>Brief. Bioinform.</source> <volume>22</volume>, <fpage>bbaa321</fpage>. <pub-id pub-id-type="doi">10.1093/bib/bbaa321</pub-id>
</citation>
</ref>
<ref id="B83">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Xiong</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Yi</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Fu</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Hsieh</surname>
<given-names>C.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>ADMETlab 2.0: an integrated online platform for accurate and comprehensive predictions of ADMET properties</article-title>. <source>Nucleic Acids Res.</source> <volume>49</volume>, <fpage>W5</fpage>&#x2013;<lpage>W14</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkab255</pub-id>
</citation>
</ref>
<ref id="B84">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Xu</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Leskovec</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Jegelka</surname>
<given-names>S.</given-names>
</name>
</person-group> &#x201c;<article-title>How powerful are graph neural networks?</article-title>,&#x201d; in <conf-name>Proceedings of the 7th International Conference on Learning Representations, ICLR 2019</conf-name>, <conf-loc>New Orleans, LA, USA</conf-loc>, <conf-date>May 2019</conf-date>.</citation>
</ref>
<ref id="B85">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ye</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Chai</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Identification of active molecules against mycobacterium tuberculosis through machine learning</article-title>. <source>Brief. Bioinform.</source> <volume>22</volume>, <fpage>bbab068</fpage>. <pub-id pub-id-type="doi">10.1093/bib/bbab068</pub-id>
</citation>
</ref>
<ref id="B86">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zagidullin</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Guan</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Pitk&#xe4;nen</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Comparative analysis of molecular fingerprints in prediction of drug combination effects</article-title>. <source>Briefings Bioinforma.</source> <volume>22</volume>, <fpage>bbab291</fpage>. <pub-id pub-id-type="doi">10.1093/bib/bbab291</pub-id>
</citation>
</ref>
<ref id="B87">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Dai</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2022a</year>). <article-title>InflamNat: web-based database and predictor of anti-inflammatory natural products</article-title>. <source>J. Cheminf</source> <volume>14</volume>, <fpage>30</fpage>. <pub-id pub-id-type="doi">10.1186/s13321-022-00608-5</pub-id>
</citation>
</ref>
<ref id="B88">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Mao</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Qi</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>J. Z. H.</given-names>
</name>
</person-group> (<year>2022b</year>). <article-title>HergSPred: accurate classification of hERG blockers/nonblockers with machine-learning models</article-title>. <source>J. Chem. Inf. Model.</source> <volume>62</volume>, <fpage>1830</fpage>&#x2013;<lpage>1839</lpage>. <pub-id pub-id-type="doi">10.1021/acs.jcim.2c00256</pub-id>
</citation>
</ref>
<ref id="B89">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zhu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Yuan</surname>
<given-names>X.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <source>Torchdrug: a powerful and flexible machine learning platform for drug discovery</source>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2202.08320">https://arxiv.org/abs/2202.08320</ext-link>
</citation>
</ref>
</ref-list>
</back>
</article>