<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Genet.</journal-id>
<journal-title>Frontiers in Genetics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Genet.</abbrev-journal-title>
<issn pub-type="epub">1664-8021</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">1523015</article-id>
<article-id pub-id-type="doi">10.3389/fgene.2025.1523015</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Genetics</subject>
<subj-group>
<subject>Technology and Code</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Research on the optimization model of anti-breast cancer candidate drugs based on machine learning</article-title>
<alt-title alt-title-type="left-running-head">Dong et al.</alt-title>
<alt-title alt-title-type="right-running-head">
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/fgene.2025.1523015">10.3389/fgene.2025.1523015</ext-link>
</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Dong</surname>
<given-names>Zhou</given-names>
</name>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/2883453/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/data-curation/"/>
<role content-type="https://credit.niso.org/contributor-roles/methodology/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
<role content-type="https://credit.niso.org/contributor-roles/Writing - review &#x26; editing/"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Chen</surname>
<given-names>Hong</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/2986505/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/funding-acquisition/"/>
<role content-type="https://credit.niso.org/contributor-roles/resources/"/>
<role content-type="https://credit.niso.org/contributor-roles/Writing - review &#x26; editing/"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Yang</surname>
<given-names>Yuchen</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/2981123/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/data-curation/"/>
<role content-type="https://credit.niso.org/contributor-roles/formal-analysis/"/>
<role content-type="https://credit.niso.org/contributor-roles/methodology/"/>
<role content-type="https://credit.niso.org/contributor-roles/visualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Hao</surname>
<given-names>Hairong</given-names>
</name>
<role content-type="https://credit.niso.org/contributor-roles/data-curation/"/>
<role content-type="https://credit.niso.org/contributor-roles/resources/"/>
<role content-type="https://credit.niso.org/contributor-roles/Writing - review &#x26; editing/"/>
</contrib>
</contrib-group>
<aff>
<institution>School of Information Engineering</institution>, <institution>Xi&#x2019;an Eurasia University</institution>, <addr-line>Xi&#x2019;an</addr-line>, <country>China</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/456271/overview">Shailender Kumar Verma</ext-link>, University of Delhi, India</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/381999/overview">Akanksha Rajput</ext-link>, University of California, San Diego, United States</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/2835673/overview">Rokhsareh Rohban</ext-link>, Medical University of Graz, Austria</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Zhou Dong, <email>dongzhouch@outlook.com</email>
</corresp>
</author-notes>
<pub-date pub-type="epub">
<day>10</day>
<month>04</month>
<year>2025</year>
</pub-date>
<pub-date pub-type="collection">
<year>2025</year>
</pub-date>
<volume>16</volume>
<elocation-id>1523015</elocation-id>
<history>
<date date-type="received">
<day>05</day>
<month>11</month>
<year>2024</year>
</date>
<date date-type="accepted">
<day>31</day>
<month>03</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2025 Dong, Chen, Yang and Hao.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Dong, Chen, Yang and Hao</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>Breast cancer is one of the most common malignancies among women globally, with its incidence rate continuously increasing, posing a serious threat to women&#x2019;s health. Although current treatments, such as drugs targeting estrogen receptor alpha (ER&#x3b1;), have extended patient survival, issues such as drug resistance and severe side effects remain widespread. This study proposes a machine learning-based optimization model for anti-breast cancer candidate drugs, aimed at enhancing biological activity and optimizing ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) properties through multi-objective optimization. Initially, grey relational analysis and Spearman correlation analysis were performed on the molecular descriptors of 1,974 compounds, identifying 91 key descriptors. A Random Forest model combined with Shapley Additive Explanations (SHAP) values was then used to further select the top 20 descriptors with the greatest impact on biological activity. The constructed Quantitative Structure-Activity Relationship (QSAR) model, using algorithms such as LightGBM, Random Forest, and XGBoost, achieved an R<sup>2</sup> value of 0.743 for biological activity prediction, demonstrating strong predictive performance. Additionally, a multi-model fusion strategy and Particle Swarm Optimization (PSO) algorithm were employed to optimize both biological activity and ADMET properties, thereby improving the prediction of Caco-2, CYP3A4, hERG, HOB, and MN properties. For example, the best model for predicting Caco-2 achieved an F1 score of 0.8905, while the model for predicting CYP3A4 reached an F1 score of 0.9733. This multi-objective optimization model provides a novel and efficient tool for drug development, offering significant improvements in both biological activity and pharmacokinetic properties, with practical implications for the optimization of future anti-breast cancer drugs.</p>
</abstract>
<kwd-group>
<kwd>breast cancer</kwd>
<kwd>machine Learning</kwd>
<kwd>quantitative structure-activity relationship models(QSAR)</kwd>
<kwd>particle swarm optimization(PSO)</kwd>
<kwd>ADMET Properties</kwd>
<kwd>drug screening</kwd>
<kwd>biological Activity</kwd>
</kwd-group>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Computational Genomics</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>Over two million women are diagnosed with breast cancer each year, and some of these patients progress to advanced stages, urgently requiring effective treatments (<xref ref-type="bibr" rid="B34">Sung et al., 2021</xref>; <xref ref-type="bibr" rid="B36">Waks and Winer, 2019</xref>). While existing treatment options have extended survival, issues such as drug resistance and side effects persist (<xref ref-type="bibr" rid="B11">Giaquinto et al., 2022</xref>; <xref ref-type="bibr" rid="B22">Lumachi et al., 2011</xref>), creating a pressing need for the development of new anti-breast cancer drugs, particularly those targeting estrogen receptor alpha (ER&#x3b1;) and optimizing ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) properties (<xref ref-type="bibr" rid="B26">Marra et al., 2020</xref>). With the rapid advancements in computer science and technology, machine learning has provided new solutions for drug design and optimization (<xref ref-type="bibr" rid="B25">Mak et al., 2023</xref>; <xref ref-type="bibr" rid="B40">Zhavoronkov et al., 2019</xref>). By constructing Quantitative Structure-Activity Relationship (QSAR) models based on compound structural features and biological activity data (<xref ref-type="bibr" rid="B7">Cherkasov et al., 2014</xref>), and integrating various machine learning algorithms, it is possible to efficiently predict the biological activity and ADMET properties of new compounds, reducing the time and cost of drug development (<xref ref-type="bibr" rid="B14">Jim&#xe9; et al., 2020</xref>). Furthermore, optimization algorithms such as Particle Swarm Optimization (PSO) have shown excellent performance in multi-objective optimization tasks (<xref ref-type="bibr" rid="B21">Liu et al., 2021</xref>; <xref ref-type="bibr" rid="B20">Liu et al., 2024</xref>; <xref ref-type="bibr" rid="B28">Poli et al., 2007</xref>), enhancing both the biological activity and ADMET properties of compounds, thus providing powerful tools for drug screening and optimization.</p>
<p>Based on this background, the present study proposes a machine learning-based optimization model for anti-breast cancer candidate drugs. By integrating QSAR models, multi-model fusion techniques, and the PSO algorithm, this study aims to achieve multi-objective optimization of anti-breast cancer compounds, enhancing their biological activity against ER&#x3b1; while ensuring excellent ADMET properties. Here is the experimental procedure in this paper:</p>
<p>Phase 1: Data preprocessing, where 225 features with all zero values are removed and the data is normalized. A gray relational analysis is performed to select the 200 molecular descriptors most related to biological activity, followed by Spearman coefficient analysis, retaining 91 features. Then, Random Forest combined with SHAP value analysis is used to select the top 20 molecular descriptors with the most significant impact on biological activity (<xref ref-type="table" rid="T2">Table 2</xref>).</p>
<p>Phase 2: Using pIC50 (negative logarithm of the IC50 value) as the target variable, 10 regression models are used to predict the 20 selected features. By comparing evaluations, LightGBM, RandomForest, and XGBoost are identified as the best performers. To further improve prediction accuracy, these three models are combined using three ensemble methods: simple averaging, weighted averaging, and stacking. Finally, the stacking ensemble model is used to predict the pIC50 values for 50 target compounds and calculate their corresponding pIC50 (half-maximal inhibitory concentration) values, with the final results recorded in &#x201c;ER&#x3b1;_activity_test.csv.&#x201d;</p>
<p>Phase 3: After removing the 225 features with all zero values in Phase 1, Random Forest is used for recursive feature elimination (RFE) on the remaining 504 features. This selects 25 important features for each of the five ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) properties: Caco-2, CYP3A4, hERG, HOB, and MN. Using these selected features, 11 machine learning classification models are constructed. By comparing evaluation metrics such as F1 score and ROC curve, the best models for predicting Caco-2, CYP3A4, and hERG are identified as LightGBM, XGBoost, and NaiveBayes, respectively, with XGBoost being the best model for predicting MN. Finally, use the selected models to predict the classification results for 50 target compounds on Caco-2, CYP3A4, hERG, HOB, and MN, with the final results recorded in &#x201c;ADMET_test.csv.&#x201d;</p>
<p>Phase 4: First, a single-objective optimization model is constructed to improve the inhibition of ER&#x3b1; (Estrogen Receptor Alpha) biological activity while satisfying at least three ADMET properties. A total of 106 feature variables with high correlation to biological activity and ADMET properties from Phases 2 and 3 are selected. Regression and classification models are constructed based on these features to create the single-objective optimization model. Finally, a Particle Swarm Optimization (PSO) algorithm is used for multi-objective optimization search. Through multiple iterations, the best solution from each iteration is recorded and gradually converges to obtain the optimal value range. The final results are recorded in &#x201c;results.csv.&#x201d;</p>
</sec>
<sec id="s2">
<title>2 Related work</title>
<p>Breast cancer is one of the most common malignant tumors among women worldwide. Although current treatments such as surgery, radiotherapy, chemotherapy, and endocrine therapy have extended patient survival, these methods still have limitations due to the heterogeneity, drug resistance, and severe side effects associated with breast cancer (<xref ref-type="bibr" rid="B12">Hong and Xu, 2022</xref>; <xref ref-type="bibr" rid="B3">Belachew and Sewasew, 2021</xref>). Endocrine therapies targeting estrogen receptor alpha (ER&#x3b1;), such as tamoxifen and letrozole, have played a key role in treating ER&#x3b1;-positive breast cancer. However, as treatment progresses, these therapies increasingly face drug resistance, limiting their clinical application (<xref ref-type="bibr" rid="B26">Marra et al., 2020</xref>). Additionally, these drugs are associated with side effects such as cardiotoxicity and hepatotoxicity, creating an urgent need to develop new candidate drugs that not only address biological activity but also optimize ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) properties (<xref ref-type="bibr" rid="B4">Caron and Nohria, 2018</xref>; <xref ref-type="bibr" rid="B17">Larroquette et al., 1986</xref>; <xref ref-type="bibr" rid="B39">Xu et al., 2015</xref>).</p>
<p>Recent advances in computer science and artificial intelligence have opened new avenues for drug design and optimization, offering substantial potential for overcoming these limitations (<xref ref-type="bibr" rid="B29">Rodrigues and Schneider, 2022</xref>; <xref ref-type="bibr" rid="B32">Stokes et al., 2020b</xref>; <xref ref-type="bibr" rid="B30">Schneider et al., 2020</xref>). Specifically, machine learning (<xref ref-type="bibr" rid="B41">Zitnik et al., 2019</xref>; <xref ref-type="bibr" rid="B35">Vamathevan et al., 2019</xref>) has proven to be a powerful tool for predicting the biological activity and ADMET properties of novel compounds, leveraging vast amounts of molecular descriptors and biological activity data (<xref ref-type="bibr" rid="B25">Mak et al., 2023</xref>; <xref ref-type="bibr" rid="B40">Zhavoronkov et al., 2019</xref>; <xref ref-type="bibr" rid="B32">Stokes et al., 2020a</xref>; <xref ref-type="bibr" rid="B15">Jim&#xe9; et al., 2021</xref>). Traditional Quantitative Structure-Activity Relationship (QSAR) models, which correlate the physicochemical properties of compounds with their biological activity, have long been the cornerstone of drug development (<xref ref-type="bibr" rid="B7">Cherkasov et al., 2014</xref>; <xref ref-type="bibr" rid="B14">Jim&#xe9; et al., 2020</xref>; <xref ref-type="bibr" rid="B39">Xu et al., 2015</xref>). However, these models often struggle to handle the complex nonlinear relationships between molecular features, limiting their ability to provide accurate predictions (<xref ref-type="bibr" rid="B5">Chen et al., 2020</xref>). To address this, recent studies have increasingly relied on multi-model fusion techniques, which combine the advantages of multiple models to improve prediction accuracy and stability (<xref ref-type="bibr" rid="B19">Lin et al., 2022</xref>; <xref ref-type="bibr" rid="B6">Chen and Guestrin, 2016</xref>). For instance, gradient boosting models such as LightGBM and XGBoost are particularly adept at handling high-dimensional data and capturing complex nonlinear relationships, making them widely used in predicting biological activity and ADMET properties (<xref ref-type="bibr" rid="B31">Shou, 2020</xref>; <xref ref-type="bibr" rid="B18">Lei et al., 2016</xref>).</p>
<p>The success of drug development depends not only on the biological activity of the drug but also on its ADMET properties. Favorable ADMET properties are crucial for the successful conversion of a candidate compound into an effective drug (<xref ref-type="bibr" rid="B9">Er-rajy et al., 2022</xref>; <xref ref-type="bibr" rid="B1">Ahmad et al., 2023</xref>). Some studies have utilized machine learning algorithms for classification and regression predictions of ADMET properties, achieving significant success in predicting permeability, metabolism, toxicity, and other pharmacokinetic attributes (<xref ref-type="bibr" rid="B2">Atallah et al., 2013</xref>; <xref ref-type="bibr" rid="B16">Komura et al., 2023</xref>). Algorithms such as Support Vector Machines (SVM), Random Forest, and XGBoost have been effective in screening compounds with favorable ADMET properties, reducing experimental costs and minimizing the risk of failure (<xref ref-type="bibr" rid="B10">Ferreira and Andricopulo, 2019</xref>; <xref ref-type="bibr" rid="B13">Huang et al., 2021</xref>).</p>
<p>However, optimizing multiple objectives simultaneously, such as enhancing biological activity and improving ADMET properties, remains a significant challenge in drug development (<xref ref-type="bibr" rid="B23">Luukkonen et al., 2023a</xref>). Traditional optimization methods struggle to effectively manage the trade-offs between these competing objectives (<xref ref-type="bibr" rid="B8">Deb et al., 2002</xref>). Particle Swarm Optimization (PSO), a swarm intelligence optimization technique that simulates cooperative search behavior within a population, has become a powerful tool for multi-objective optimization tasks, including drug design (<xref ref-type="bibr" rid="B28">Poli et al., 2007</xref>; <xref ref-type="bibr" rid="B23">Luukkonen et al., 2023b</xref>; <xref ref-type="bibr" rid="B37">Wang et al., 2018</xref>). PSO has been effectively applied to simultaneously optimize biological activity and ADMET properties, achieving the global optimal selection of candidate drugs and balancing these key attributes (<xref ref-type="bibr" rid="B27">Merk et al., 2018</xref>).</p>
<p>Building on these advances, this study integrates machine learning models with optimization algorithms such as PSO to successfully achieve multi-objective drug design. For example, integrating PSO with QSAR models has successfully enabled multi-objective optimization of both biological activity and ADMET properties in drug design. Additionally, multi-model fusion strategies have been employed to further improve predictive performance, combining different machine learning algorithms to reduce the bias of individual models and enhance overall prediction accuracy. These efforts have significantly advanced the development of drug optimization methods and tools.Based on previous work, this study proposes a novel machine learning-based optimization model for anti-breast cancer drugs. By combining QSAR models, multi-model fusion techniques, and the PSO algorithm, this study aims to simultaneously optimize the biological activity and ADMET properties of candidate compounds. Specifically, it enhances biological activity against ER&#x3b1; while ensuring optimal ADMET performance. This method not only provides an efficient and reliable tool for the development of anti-breast cancer drugs but also lays the foundation for future drug optimization research.</p>
</sec>
<sec id="s3">
<title>3 Dataset description</title>
<sec id="s3-1">
<title>3.1 Dataset source</title>
<p>The core dataset used in this study is the &#x201c;Anti-Breast Cancer Candidate Drug Optimization Modeling (2021)&#x201d; dataset provided by the China Association for Science and Technology. This dataset is primarily focused on the biological activity prediction and ADMET property analysis targeting the breast cancer marker ER&#x3b1;, providing key data support for the machine learning modeling conducted in this study.</p>
</sec>
<sec id="s3-2">
<title>3.2 Dataset description</title>
<sec id="s3-2-1">
<title>3.2.1 ER&#x3b1; activity dataset (ER&#x3b1;_activity.xlsx)</title>
<p>Training Set (training table): Contains biological activity data for 1,974 compounds.<list list-type="simple">
<list-item>
<p>SMILES Format: The first column records the SMILES (Simplified Molecular Input Line Entry System) representation of each compound, which describes its structure.</p>
</list-item>
<list-item>
<p>IC50 Values: The second column lists the biological activity values against the ER&#x3b1; target in nanomoles (nM). Lower IC50 values indicate higher biological activity.</p>
</list-item>
<list-item>
<p>pIC50 Values: The third column records the negative logarithm of the IC50 values (pIC50), facilitating a more intuitive representation of the compounds&#x2019; biological activity; higher pIC50 values indicate stronger biological activity.</p>
</list-item>
</list>
</p>
<p>Test Set (test table): Contains the SMILES representation for 50 compounds, used for model prediction testing.</p>
</sec>
<sec id="s3-2-2">
<title>3.2.2 Molecular descriptor dataset (Molecular_Descriptor.xlsx)</title>
<p>Training Set (training table): Includes 729 molecular descriptors for 1,974 compounds, describing each compound&#x2019;s structure and its physicochemical properties.<list list-type="simple">
<list-item>
<p>SMILES Format: The first column contains the SMILES representation of the compounds, consistent with those in the ER&#x3b1;_activity.xlsx.</p>
</list-item>
<list-item>
<p>Molecular Descriptors: The subsequent 729 columns cover various molecular descriptors for each compound, including molecular weight, number of hydrogen bond donors, and hydrophobicity parameters (such as LogP), detailing their physicochemical characteristics and topological structure.</p>
</list-item>
</list>
</p>
<p>Test Set (test table): Contains the molecular descriptors for 50 compounds, used for model testing and evaluation.</p>
</sec>
<sec id="s3-2-3">
<title>3.2.3 ADMET properties dataset (ADMET.xlsx)</title>
<p>Training Set (training table): Includes data on five ADMET properties for 1,974 compounds, all represented in a binary format.<list list-type="simple">
<list-item>
<p>Caco-2: Indicates the intestinal epithelial cell permeability of the compounds; 1 for good permeability, 0 for poor permeability.</p>
</list-item>
<list-item>
<p>CYP3A4: Indicates whether the compound can be metabolized by CYP3A4; 1 for metabolizable, 0 for non-metabolizable.</p>
</list-item>
<list-item>
<p>hERG: Indicates whether the compound has cardiotoxicity; 1 for toxic, 0 for non-toxic.</p>
</list-item>
<list-item>
<p>HOB: Indicates the oral bioavailability of the compound; 1 for good bioavailability, 0 for poor.</p>
</list-item>
<list-item>
<p>MN: Indicates whether the compound has mutagenicity; 1 for toxic, 0 for non-toxic.</p>
</list-item>
</list>
</p>
<p>Test Set (test table): Contains the SMILES representation for 50 compounds, used for model prediction and validation.</p>
</sec>
</sec>
</sec>
<sec id="s4">
<title>4 Experimental method and the solution results</title>
<sec id="s4-1">
<title>4.1 Experimental design</title>
<p>This research consists of four main experimental steps, designated for selecting important molecular descriptors, predicting the biological activity of compounds, classifying ADMET properties, and multi-objective optimization.</p>
<sec id="s4-1-1">
<title>4.1.1 Feature selection and preprocessing</title>
<p>
<list list-type="simple">
<list-item>
<p>1. Feature Cleaning: Remove 225 molecular descriptors where all observations are zero to avoid redundancy and reduce the risk of overfitting.</p>
</list-item>
<list-item>
<p>2. Feature Normalization: Perform min-max normalization on the remaining 504 molecular descriptors to ensure that features are trained on the same scale, avoiding issues related to different dimensions affecting model training.</p>
</list-item>
<list-item>
<p>3. Grey Relational Analysis (GRA): Evaluate the correlation between pIC50 values and molecular descriptors using grey relational analysis, selecting the top 200 descriptors most relevant to biological activity.</p>
</list-item>
<list-item>
<p>4. Spearman Correlation Analysis: To further reduce feature redundancy, Spearman correlation analysis is used to process highly correlated features, retaining 91 key features to enhance model efficiency and accuracy.</p>
</list-item>
<list-item>
<p>5. Random Forest and SHAP Values: Further select 20 features with the greatest impact on biological activity.</p>
</list-item>
</list>
</p>
</sec>
<sec id="s4-1-2">
<title>4.1.2 Construction of biological activity prediction model</title>
<p>
<list list-type="simple">
<list-item>
<p>1. Regression Model Selection: We utilize ten common machine learning regression models, including Linear Regression, Ridge, Lasso, ElasticNet, RandomForest, LightGBM, XGBoost, Gradient Boosting Decision Tree (GBDT), SVM, and Decision Tree.</p>
</list-item>
<list-item>
<p>2. Multi-Model Fusion: To improve the predictive performance of the model, we experimented with three fusion strategies on the three best-performing models (LightGBM, RandomForest, and XGBoost), including simple averaging, weighted averaging, and stacking. The stacking fusion showed the best effect.</p>
</list-item>
<list-item>
<p>3. Prediction Results: Use the best model to predict the pIC50 values for 50 test set compounds and convert them to IC50 values.</p>
</list-item>
</list>
</p>
</sec>
<sec id="s4-1-3">
<title>4.1.3 Classification prediction of ADMET properties</title>
<p>
<list list-type="simple">
<list-item>
<p>1. Recursive Feature Elimination (RFE): Using RandomForest as the base model, the RFE method is applied to select features for ADMET properties, selecting 25 most representative molecular descriptors for each ADMET attribute.</p>
</list-item>
<list-item>
<p>2. Classification Model Selection: Utilize 11 classification models, including Logistic Regression, Naive Bayes, LDA, Decision Tree, RandomForest, AdaBoost, GradientBoosting, SVM, MLP, XGBoost, and LightGBM, to predict the ADMET properties of compounds.</p>
</list-item>
<list-item>
<p>3. Classification Effectiveness Assessment: Evaluate model performance using metrics such as the F1 score and ROC curve, and select the best models. The best classification models for different ADMET properties are LightGBM (Caco-2), XGBoost (CYP3A4 and hERG), NaiveBayes (HOB), and XGBoost (MN).</p>
</list-item>
<list-item>
<p>4. ADMET Property Prediction: Use the selected best models to predict the ADMET properties of 50 compounds.</p>
</list-item>
</list>
</p>
</sec>
<sec id="s4-1-4">
<title>4.1.4 Multi-objective optimization</title>
<p>
<list list-type="simple">
<list-item>
<p>1. Single-Objective Optimization: Establish a single-objective optimization model aiming to enhance the biological activity (pIC50 value) of compounds while ensuring that at least three ADMET properties perform well.</p>
</list-item>
<list-item>
<p>2. Particle Swarm Optimization (PSO): Apply the PSO algorithm for global optimization of 106 important features, recording the optimal solution in each iteration, and ultimately finding the value range that provides the best performance in both biological activity and ADMET properties.</p>
</list-item>
<list-item>
<p>3. Final Results: Apply the optimized compound features to 50 test compounds, outputting their optimal predictive values.</p>
</list-item>
</list>
</p>
</sec>
</sec>
<sec id="s4-2">
<title>4.2 Selection of molecular descriptors</title>
<sec id="s4-2-1">
<title>4.2.1 Data preprocessing and feature selection</title>
<sec id="s4-2-1-1">
<title>4.2.1.1 Data preprocessing</title>
<p>Basic statistical analysis is performed on the data provided in the &#x201c;Molecular_Descriptor.xlsx&#x201d; file. Some of the statistical results are shown in <xref ref-type="table" rid="T1">Table 1</xref>.</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Statistical information for selected molecular descriptors.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center"/>
<th align="center">nAtom</th>
<th align="center">nHeavyAtom</th>
<th align="center">nH</th>
<th align="center">nB</th>
<th align="center">nC</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">Count</td>
<td align="center">1974</td>
<td align="center">1974</td>
<td align="center">1974</td>
<td align="center">1974</td>
<td align="center">1974</td>
</tr>
<tr>
<td align="center">Mean</td>
<td align="center">50.76</td>
<td align="center">28.11</td>
<td align="center">22.65</td>
<td align="center">0</td>
<td align="center">22.61</td>
</tr>
<tr>
<td align="center">Std</td>
<td align="center">18.09</td>
<td align="center">8.07</td>
<td align="center">10.78</td>
<td align="center">0</td>
<td align="center">6.63</td>
</tr>
<tr>
<td align="center">Min</td>
<td align="center">21</td>
<td align="center">14</td>
<td align="center">5</td>
<td align="center">0</td>
<td align="center">7</td>
</tr>
<tr>
<td align="center">25%</td>
<td align="center">36.25</td>
<td align="center">21</td>
<td align="center">14</td>
<td align="center">0</td>
<td align="center">17</td>
</tr>
<tr>
<td align="center">50%</td>
<td align="center">50</td>
<td align="center">28</td>
<td align="center">22</td>
<td align="center">0</td>
<td align="center">22</td>
</tr>
<tr>
<td align="center">75%</td>
<td align="center">62</td>
<td align="center">34</td>
<td align="center">29</td>
<td align="center">0</td>
<td align="center">28</td>
</tr>
<tr>
<td align="center">Max</td>
<td align="center">343</td>
<td align="center">163</td>
<td align="center">180</td>
<td align="center">0</td>
<td align="center">95</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>As observed, the values for the molecular descriptor nB are all zeros. Although a value of zero can have practical significance, prediction models are unable to recognize its meaning. Consequently, these variables are considered redundant features, which can affect the accuracy of the model. Therefore, we choose to remove these features, totaling the elimination of 225 molecular descriptors.</p>
<p>To eliminate the impact of dimensions and reduce the range of variables, the remaining features are normalized. The normalization formula is shown in <xref ref-type="disp-formula" rid="e1">Equation 1</xref>.<disp-formula id="e1">
<mml:math id="m1">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>min</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>max</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>min</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<label>(1)</label>
</disp-formula>Where <inline-formula id="inf1">
<mml:math id="m2">
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the result after normalization, <inline-formula id="inf2">
<mml:math id="m3">
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the value in the original data table, <inline-formula id="inf3">
<mml:math id="m4">
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>max</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the maximum value of a certain molecular descriptor in the original data table, and <inline-formula id="inf4">
<mml:math id="m5">
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>min</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the minimum value of that molecular descriptor in the original data table.</p>
</sec>
<sec id="s4-2-1-2">
<title>4.2.1.2 Grey relational analysis (GRA)</title>
<p>Grey relational analysis is used to identify the primary and secondary factors among the many influencing the development of a system. The fundamental idea is based on the degree of similarity in the geometric shapes of the sequence curves to determine the closeness of their relationships. The closer the curves are, the greater the degree of association between the corresponding sequences, and <italic>vice versa</italic>.Consider the reference sequence (biological activity) as <inline-formula id="inf5">
<mml:math id="m6">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">X</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and the compared sequences (influencing factors) as <inline-formula id="inf6">
<mml:math id="m7">
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">X</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">X</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#xb7;</mml:mo>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mo>&#xb7;</mml:mo>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mo>&#xb7;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">X</mml:mi>
<mml:mi mathvariant="normal">m</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>
</inline-formula>. The steps for calculating the grey relational analysis are as follows:<list list-type="simple">
<list-item>
<p>1. Calculate the correlation coefficients between each parameter in the compared sequences and the corresponding parameters in the reference sequence. Define the grey relational coefficients, which represents the extent of association between biological activity and each influencing factor, as presented in <xref ref-type="disp-formula" rid="e2">Equation 2</xref>.</p>
</list-item>
</list>
<disp-formula id="e2">
<mml:math id="m8">
<mml:mrow>
<mml:mi>&#x3be;</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3b1;</mml:mi>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mfenced open="|" close="|" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3b1;</mml:mi>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>&#x2001;</mml:mo>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>&#x22ef;</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>&#x22ef;</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(2)</label>
</disp-formula>
</p>
<p>Where <inline-formula id="inf7">
<mml:math id="m9">
<mml:mrow>
<mml:mi mathvariant="normal">a</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the minimum difference between the extremes, <inline-formula id="inf8">
<mml:math id="m10">
<mml:mrow>
<mml:mi mathvariant="normal">b</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the maximum difference between the extremes, and <inline-formula id="inf9">
<mml:math id="m11">
<mml:mrow>
<mml:mi mathvariant="normal">&#x3b1;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the resolution coefficient (typically set to 0.5).<disp-formula id="equ1">
<mml:math id="m12">
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:munder>
<mml:mi>min</mml:mi>
<mml:mi>i</mml:mi>
</mml:munder>
<mml:munder>
<mml:mrow>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mi>min</mml:mi>
</mml:mrow>
<mml:mi>k</mml:mi>
</mml:munder>
<mml:mrow>
<mml:mfenced open="|" close="|" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="equ2">
<mml:math id="m13">
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:munder>
<mml:mi>max</mml:mi>
<mml:mi>i</mml:mi>
</mml:munder>
<mml:munder>
<mml:mrow>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mi>max</mml:mi>
</mml:mrow>
<mml:mi>k</mml:mi>
</mml:munder>
<mml:mrow>
<mml:mfenced open="|" close="|" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<list list-type="simple">
<list-item>
<p>2. Calculate the grey relational degree. Define <inline-formula id="inf10">
<mml:math id="m14">
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> as the grey relational degree, obtained by calculating the mean of each column in the correlation coefficient matrix As shown in <xref ref-type="disp-formula" rid="e3">Equation 3</xref>.</p>
</list-item>
</list>
<disp-formula id="e3">
<mml:math id="m15">
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:munderover>
</mml:mstyle>
<mml:mi>&#x3be;</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(3)</label>
</disp-formula>
</p>
<p>Next, we calculate the grey relational degree between each molecular descriptor and biological activity, retaining the top 200 molecular descriptors with the highest association values. As shown in <xref ref-type="fig" rid="F1">Figure 1</xref>, only the top 30 molecular descriptors with the highest association values are displayed.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>The top 30 molecular descriptors with the highest grey relational degree.</p>
</caption>
<graphic xlink:href="fgene-16-1523015-g001.tif"/>
</fig>
<p>
<xref ref-type="fig" rid="F1">Figure 1</xref> shows the top 30 molecular descriptors most strongly correlated with biological activity, selected through GRA. These molecular descriptors are ranked based on their grey relational degree with the pIC50 values (biological activity prediction values). The higher the grey relational degree, the stronger the correlation between the molecular descriptor and biological activity.</p>
<p>The molecular descriptors are sorted in descending order of grey relational degree, starting from the top. Each row represents a molecular descriptor, with the horizontal axis indicating its grey relational degree, ranging from 0 to 0.8. Descriptors such as MDEC-23, LipoaffinityIndex, MLogP, and nRing are displayed, all of which are used in subsequent models to predict molecular activity.</p>
</sec>
<sec id="s4-2-1-3">
<title>4.2.1.3 Analysis of correlations between influencing factors</title>
<p>The Pearson correlation coefficient assumes that data follows a normal distribution and can only analyze linear relationships between variables. However, there are also complex nonlinear relationships between the data variables obtained. Therefore, the Spearman coefficient is chosen for analysis, as shown in <xref ref-type="disp-formula" rid="e4">Equation 4</xref>
<disp-formula id="e4">
<mml:math id="m16">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3c1;</mml:mi>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mi>y</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:munder>
<mml:mo>&#x2211;</mml:mo>
<mml:mi>i</mml:mi>
</mml:munder>
</mml:mstyle>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:mover accent="true">
<mml:mi>x</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:munder>
<mml:mo>&#x2211;</mml:mo>
<mml:mi>i</mml:mi>
</mml:munder>
</mml:mstyle>
<mml:msup>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:mover accent="true">
<mml:mi>x</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mstyle displaystyle="true">
<mml:munder>
<mml:mo>&#x2211;</mml:mo>
<mml:mi>i</mml:mi>
</mml:munder>
</mml:mstyle>
<mml:msup>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:msqrt>
</mml:mfrac>
</mml:mrow>
</mml:math>
<label>(4)</label>
</disp-formula>
</p>
<p>Where <inline-formula id="inf11">
<mml:math id="m17">
<mml:mrow>
<mml:mi mathvariant="normal">x</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf12">
<mml:math id="m18">
<mml:mrow>
<mml:mi mathvariant="normal">y</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> are the values of the two variables being analyzed, <inline-formula id="inf13">
<mml:math id="m19">
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="normal">x</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf14">
<mml:math id="m20">
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="normal">y</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> are the mean values of the two variables. The Spearman correlation coefficient measures the monotonic relationship between two variables, with values ranging from &#x2212;1 to &#x2b;1. Positive values indicate a positive correlation between the variables, negative values indicate a negative correlation, and values close to 0 indicate a weaker correlation. By calculating the Spearman coefficients between the 200 molecular descriptors, we obtained the heatmap shown in <xref ref-type="fig" rid="F2">Figure 2</xref>.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>Spearman correlation coefficient heatmap between features.</p>
</caption>
<graphic xlink:href="fgene-16-1523015-g002.tif"/>
</fig>
<p>
<xref ref-type="fig" rid="F2">Figure 2</xref> displays a heatmap of the Spearman correlation coefficient matrix for all molecular descriptors. In the heatmap, the intensity of the colors represents the magnitude of the Spearman correlation coefficient. Dark red indicates a strong positive correlation, dark blue indicates a strong negative correlation, and lighter colors represent weaker correlations. Highly correlated variables (greater than 0.85) were then filtered out, removing 109 molecular descriptors, and ultimately leaving 91 molecular descriptors.</p>
</sec>
<sec id="s4-2-1-4">
<title>4.2.1.4 Variable selection model based on random forest</title>
<p>Subsequently, we used the remaining 91 molecular descriptors as feature variables to establish a random forest model for regression prediction of molecular activity and calculated the SHAP values for each molecular descriptor.</p>
<p>The random forest is an ensemble learning method used for tasks such as classification and regression. It builds multiple decision trees during the training process and uses the majority vote (for classification) or the average (for regression) of these trees&#x2019; predicted classes or values for final prediction. The random forest algorithm utilizes bagging (Bootstrap Aggregating) to create multiple training subsets from the original dataset.</p>
<p>Suppose the original dataset is <inline-formula id="inf15">
<mml:math id="m21">
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:msubsup>
<mml:mo>}</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>, containing <inline-formula id="inf16">
<mml:math id="m22">
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> samples. The bagging process generates <inline-formula id="inf17">
<mml:math id="m23">
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> bootstrap samples <inline-formula id="inf18">
<mml:math id="m24">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="script">D</mml:mi>
<mml:mi>b</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, where <inline-formula id="inf19">
<mml:math id="m25">
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mfenced open="{" close="}" separators="&#x7c;">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>B</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>. Each bootstrap sample <inline-formula id="inf20">
<mml:math id="m26">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="script">D</mml:mi>
<mml:mi>b</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is used to build a decision tree. At each node, a subset of features is randomly selected for the splitting strategy, and the best feature within this subset is chosen for the split. If there are <inline-formula id="inf21">
<mml:math id="m27">
<mml:mrow>
<mml:mi mathvariant="normal">p</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> total features, typically <inline-formula id="inf22">
<mml:math id="m28">
<mml:mrow>
<mml:mi mathvariant="normal">m</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> features are selected, <inline-formula id="inf23">
<mml:math id="m29">
<mml:mrow>
<mml:mi mathvariant="normal">m</mml:mi>
<mml:mo>&#x2248;</mml:mo>
<mml:msqrt>
<mml:mi mathvariant="normal">p</mml:mi>
</mml:msqrt>
</mml:mrow>
</mml:math>
</inline-formula>. For regression problems, the final prediction is the average of all predictions from each tree: <inline-formula id="inf24">
<mml:math id="m30">
<mml:mrow>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>B</mml:mi>
</mml:munderover>
</mml:mstyle>
<mml:mrow>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>b</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
</sec>
<sec id="s4-2-1-5">
<title>4.2.1.5 SHAP interpretation of machine learning model</title>
<p>The SHAP (Shapley Additive Explanations) value was initially proposed to address the problem of reward distribution in cooperative game theory. In machine learning, the model&#x2019;s prediction result can be seen as the outcome of the &#x201c;cooperation&#x201d; of all features. The SHAP value assigns a contribution value to each feature to explain its importance in the model output.</p>
<p>The process of calculating the SHAP value for a specific feature <inline-formula id="inf25">
<mml:math id="m31">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">X</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is as follows:<list list-type="simple">
<list-item>
<p>1. Perform weighting for all possible feature subsets <inline-formula id="inf26">
<mml:math id="m32">
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, where <inline-formula id="inf27">
<mml:math id="m33">
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> does not contain the feature <inline-formula id="inf28">
<mml:math id="m34">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">X</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
</list-item>
<list-item>
<p>2. Calculate the difference in model output between the model <inline-formula id="inf29">
<mml:math id="m35">
<mml:mrow>
<mml:mi mathvariant="normal">f</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi mathvariant="normal">S</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> before adding the feature <inline-formula id="inf30">
<mml:math id="m36">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">X</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and the model <inline-formula id="inf31">
<mml:math id="m37">
<mml:mrow>
<mml:mi mathvariant="normal">f</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi mathvariant="normal">S</mml:mi>
<mml:mo>&#x222a;</mml:mo>
<mml:mrow>
<mml:mfenced open="{" close="}" separators="&#x7c;">
<mml:mrow>
<mml:mi mathvariant="normal">i</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> after adding the feature <inline-formula id="inf32">
<mml:math id="m38">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">X</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
</list-item>
<list-item>
<p>3. Calculate the contribution value <inline-formula id="inf33">
<mml:math id="m39">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3d5;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> for feature <inline-formula id="inf34">
<mml:math id="m40">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">X</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> by averaging all these differences with weights.</p>
</list-item>
</list>
</p>
<p>As shown in <xref ref-type="disp-formula" rid="e5">Equation 5</xref>:<disp-formula id="e5">
<mml:math id="m41">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3d5;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munder>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mo>&#x2286;</mml:mo>
<mml:mi>N</mml:mi>
<mml:mrow>
<mml:mfenced open="{" close="}" separators="&#x7c;">
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:munder>
</mml:mstyle>
<mml:mfrac>
<mml:mrow>
<mml:mrow>
<mml:mfenced open="|" close="|" separators="&#x7c;">
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>!</mml:mo>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mrow>
<mml:mfenced open="|" close="|" separators="&#x7c;">
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mrow>
<mml:mfenced open="|" close="|" separators="&#x7c;">
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>!</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mfenced open="|" close="|" separators="&#x7c;">
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>!</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:mrow>
<mml:mfenced open="[" close="]" separators="&#x7c;">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mo>&#x222a;</mml:mo>
<mml:mrow>
<mml:mfenced open="{" close="}" separators="&#x7c;">
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(5)</label>
</disp-formula>where <inline-formula id="inf35">
<mml:math id="m42">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">&#x3d5;</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the SHAP value for feature <inline-formula id="inf36">
<mml:math id="m43">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">X</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf37">
<mml:math id="m44">
<mml:mrow>
<mml:mi mathvariant="normal">S</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is a subset of features that does not include <inline-formula id="inf38">
<mml:math id="m45">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">X</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, N is the set of all features, <inline-formula id="inf39">
<mml:math id="m46">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is the model output prediction for the feature subset <inline-formula id="inf40">
<mml:math id="m47">
<mml:mrow>
<mml:mi mathvariant="normal">S</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, and <inline-formula id="inf41">
<mml:math id="m48">
<mml:mrow>
<mml:mfenced open="|" close="|" separators="&#x7c;">
<mml:mrow>
<mml:mi mathvariant="normal">S</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>
</inline-formula> is the number of features in subset <inline-formula id="inf42">
<mml:math id="m49">
<mml:mrow>
<mml:mi mathvariant="normal">S</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
<p>SHAP values are used to explain the contribution of features in machine learning models, assessing the specific impact of each feature on the model&#x2019;s predictions. Through the above calculations, the top 20 molecular descriptors with the highest SHAP values were selected, representing the 20 descriptors with the most significant impact on biological activity, as shown in <xref ref-type="fig" rid="F3">Figure 3</xref>. Each violin plot in the figure represents the SHAP value distribution for each molecular descriptor, with the SHAP value reflecting the extent to which the descriptor influences the model output.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Top 20 molecular descriptors with the highest SHAP values based on the random forest model.</p>
</caption>
<graphic xlink:href="fgene-16-1523015-g003.tif"/>
</fig>
<p>In <xref ref-type="fig" rid="F3">Figure 3</xref>:<list list-type="simple">
<list-item>
<p>1. The SHAP values of each molecular descriptor are mapped to dots of different colors, with the color bar on the right indicating the magnitude of the feature values. Blue represents low feature values, while red represents high feature values.</p>
</list-item>
<list-item>
<p>2. The horizontal axis represents the magnitude of SHAP values. The larger the SHAP value, the greater the positive contribution of the feature to the model&#x2019;s prediction. Conversely, smaller SHAP values indicate a smaller contribution.</p>
</list-item>
<list-item>
<p>3. The shape of the violin plot shows the distribution of SHAP values at different feature values. A wider distribution indicates greater variation in the feature&#x2019;s influence on the model output across different values.</p>
</list-item>
</list>
</p>
<p>The final selected molecular descriptors are shown in <xref ref-type="table" rid="T2">Table 2</xref>.</p>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>The 20 molecular descriptors.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">No.</th>
<th align="center">Molecular descriptor</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">1</td>
<td align="center">LipoaffinityIndex</td>
</tr>
<tr>
<td align="center">2</td>
<td align="center">BCUTc-1l</td>
</tr>
<tr>
<td align="center">3</td>
<td align="center">minsssN</td>
</tr>
<tr>
<td align="center">4</td>
<td align="center">minHsOH</td>
</tr>
<tr>
<td align="center">5</td>
<td align="center">maxsOH</td>
</tr>
<tr>
<td align="center">6</td>
<td align="center">ATSc3</td>
</tr>
<tr>
<td align="center">7</td>
<td align="center">nHBAcc</td>
</tr>
<tr>
<td align="center">8</td>
<td align="center">BCUTp-1h</td>
</tr>
<tr>
<td align="center">9</td>
<td align="center">minsOH</td>
</tr>
<tr>
<td align="center">10</td>
<td align="center">minHBint10</td>
</tr>
<tr>
<td align="center">11</td>
<td align="center">MEDC-23</td>
</tr>
<tr>
<td align="center">12</td>
<td align="center">MLogP</td>
</tr>
<tr>
<td align="center">13</td>
<td align="center">minHBint5</td>
</tr>
<tr>
<td align="center">14</td>
<td align="center">XLogP</td>
</tr>
<tr>
<td align="center">15</td>
<td align="center">ATSc2</td>
</tr>
<tr>
<td align="center">16</td>
<td align="center">mindssC</td>
</tr>
<tr>
<td align="center">17</td>
<td align="center">MDEO-12</td>
</tr>
<tr>
<td align="center">18</td>
<td align="center">MAXDP2</td>
</tr>
<tr>
<td align="center">19</td>
<td align="center">ETA_BetaP_s</td>
</tr>
<tr>
<td align="center">20</td>
<td align="center">C3SP2</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
</sec>
<sec id="s4-3">
<title>4.3 Construction of biological activity prediction model</title>
<p>The feature variables selected are the 20 molecular descriptors shown in <xref ref-type="table" rid="T2">Table 2</xref>, with the data divided into training, testing, and validation sets in an 8:1:1 ratio.<list list-type="simple">
<list-item>
<p>1. Regression Model Selection: Ten common machine learning regression models were used, including Linear Regression, Ridge, Lasso, ElasticNet, RandomForest, LightGBM, XGBoost, Gradient Boosting Decision Tree (GBDT), SVM, and Decision Tree.</p>
</list-item>
<list-item>
<p>2. Multi-Model Fusion: To improve the predictive performance of the model, we experimented with three fusion strategies on the three best-performing models (LightGBM, RandomForest, and XGBoost), including simple averaging, weighted averaging, and stacking fusion. Stacking fusion yielded the best results.</p>
</list-item>
<list-item>
<p>3. Prediction Results: The optimal model was used to predict the pIC50 values for 50 test set compounds, which were then converted into IC50 values.</p>
</list-item>
</list>
</p>
<sec id="s4-3-1">
<title>4.3.1 Regression prediction model</title>
<sec id="s4-3-1-1">
<title>4.3.1.1 Linear regression</title>
<p>The linear regression model is a type of model that attempts to find the best linear relationship to describe the relationship between the target variable <inline-formula id="inf43">
<mml:math id="m50">
<mml:mrow>
<mml:mi mathvariant="normal">y</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and input features <inline-formula id="inf44">
<mml:math id="m51">
<mml:mrow>
<mml:mi mathvariant="normal">X</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. As shown in <xref ref-type="disp-formula" rid="e6">Equation 6</xref>:<disp-formula id="e6">
<mml:math id="m52">
<mml:mrow>
<mml:mi mathvariant="bold">y</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi mathvariant="bold">X</mml:mi>
<mml:mi mathvariant="bold">&#x3b2;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3f5;</mml:mi>
</mml:mrow>
</mml:math>
<label>(6)</label>
</disp-formula>Where, <inline-formula id="inf45">
<mml:math id="m53">
<mml:mrow>
<mml:mi mathvariant="normal">X</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the feature matrix, and <inline-formula id="inf46">
<mml:math id="m54">
<mml:mrow>
<mml:mi mathvariant="normal">&#x3b2;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the regression coefficient, and <inline-formula id="inf47">
<mml:math id="m55">
<mml:mrow>
<mml:mi>&#x3f5;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> represents the error terms.</p>
</sec>
<sec id="s4-3-1-2">
<title>4.3.1.2 Ridge regression</title>
<p>Ridge regression is an improved form of linear regression that incorporates an <italic>L</italic>
<sub>
<italic>2</italic>
</sub> regularization term into the regression model to reduce model complexity. As shown in <xref ref-type="disp-formula" rid="e7">Equation 7</xref>:<disp-formula id="e7">
<mml:math id="m56">
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="normal">&#x3b2;</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>arg</mml:mi>
<mml:munder>
<mml:mi>min</mml:mi>
<mml:mi mathvariant="normal">&#x3b2;</mml:mi>
</mml:munder>
<mml:mrow>
<mml:mfenced open="{" close="}" separators="&#x7c;">
<mml:mrow>
<mml:mrow>
<mml:mo>&#x2225;</mml:mo>
<mml:mi mathvariant="normal">y</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi mathvariant="normal">X</mml:mi>
<mml:mi>&#x3b2;</mml:mi>
</mml:mrow>
<mml:msup>
<mml:mo>&#x2225;</mml:mo>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo>&#x2b;</mml:mo>
<mml:mi mathvariant="normal">&#x3bb;</mml:mi>
<mml:mo>&#x2225;</mml:mo>
<mml:mi mathvariant="normal">&#x3b2;</mml:mi>
<mml:msup>
<mml:mo>&#x2225;</mml:mo>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(7)</label>
</disp-formula>
</p>
<p>Where, <inline-formula id="inf48">
<mml:math id="m57">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the regularization parameter.</p>
</sec>
<sec id="s4-3-1-3">
<title>4.3.1.3 Lasso regression</title>
<p>Lasso regression introduces an L1 regularization term into the regression model, which can cause some regression coefficients to become zero. As shown in <xref ref-type="disp-formula" rid="e8">Equation 8</xref>:<disp-formula id="e8">
<mml:math id="m58">
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="normal">&#x3b2;</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>arg</mml:mi>
<mml:munder>
<mml:mi>min</mml:mi>
<mml:mi mathvariant="normal">&#x3b2;</mml:mi>
</mml:munder>
<mml:mrow>
<mml:mfenced open="{" close="}" separators="&#x7c;">
<mml:mrow>
<mml:mrow>
<mml:mo>&#x2225;</mml:mo>
<mml:mi mathvariant="normal">y</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi mathvariant="normal">X</mml:mi>
<mml:mi>&#x3b2;</mml:mi>
</mml:mrow>
<mml:msup>
<mml:mo>&#x2225;</mml:mo>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo>&#x2b;</mml:mo>
<mml:mi mathvariant="normal">&#x3bb;</mml:mi>
<mml:mo>&#x2225;</mml:mo>
<mml:mi mathvariant="normal">&#x3b2;</mml:mi>
<mml:msub>
<mml:mo>&#x2225;</mml:mo>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(8)</label>
</disp-formula>
</p>
<p>Where <inline-formula id="inf49">
<mml:math id="m59">
<mml:mrow>
<mml:mi mathvariant="normal">&#x3bb;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the regularization parameter.</p>
</sec>
<sec id="s4-3-1-4">
<title>4.3.1.4 Elastic net</title>
<p>Elastic Net combines the advantages of Ridge Regression and Lasso Regression. As shown in <xref ref-type="disp-formula" rid="e9">Equation 9</xref>:<disp-formula id="e9">
<mml:math id="m60">
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="normal">&#x3b2;</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>arg</mml:mi>
<mml:munder>
<mml:mrow>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mi>min</mml:mi>
</mml:mrow>
<mml:mi mathvariant="normal">&#x3b2;</mml:mi>
</mml:munder>
<mml:mrow>
<mml:mfenced open="{" close="}" separators="&#x7c;">
<mml:mrow>
<mml:mrow>
<mml:mo>&#x2225;</mml:mo>
<mml:mi mathvariant="normal">y</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi mathvariant="normal">X</mml:mi>
<mml:mi>&#x3b2;</mml:mi>
</mml:mrow>
<mml:msup>
<mml:mo>&#x2225;</mml:mo>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x3bb;</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mrow>
<mml:mo>&#x2225;</mml:mo>
<mml:mi mathvariant="normal">&#x3b2;</mml:mi>
</mml:mrow>
<mml:msub>
<mml:mo>&#x2225;</mml:mo>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x3bb;</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mrow>
<mml:mo>&#x2225;</mml:mo>
<mml:mi mathvariant="normal">&#x3b2;</mml:mi>
</mml:mrow>
<mml:msup>
<mml:mo>&#x2225;</mml:mo>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(9)</label>
</disp-formula>
</p>
<p>Where <inline-formula id="inf50">
<mml:math id="m61">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">&#x3bb;</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf51">
<mml:math id="m62">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">&#x3bb;</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> are the regularization parameters.</p>
</sec>
<sec id="s4-3-1-5">
<title>4.3.1.5 XGBoost</title>
<p>XGBoost is an implementation of gradient boosting decision trees that provides optimized computational performance and memory usage. It accomplishes regression and classification tasks by incrementally enhancing the tree models. XGBoost employs regularization to prevent overfitting, as shown in <xref ref-type="disp-formula" rid="e10">Equation 10</xref>:<disp-formula id="e10">
<mml:math id="m63">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>K</mml:mi>
</mml:munderover>
</mml:mstyle>
<mml:msub>
<mml:mi>&#x3b1;</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(10)</label>
</disp-formula>
</p>
<p>Where <inline-formula id="inf52">
<mml:math id="m64">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">h</mml:mi>
<mml:mi mathvariant="normal">k</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi mathvariant="normal">x</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is the <inline-formula id="inf53">
<mml:math id="m65">
<mml:mrow>
<mml:mi mathvariant="normal">K</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>-th tree, and <inline-formula id="inf54">
<mml:math id="m66">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">&#x3b1;</mml:mi>
<mml:mi mathvariant="normal">k</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is its weight.</p>
</sec>
<sec id="s4-3-1-6">
<title>4.3.1.6 LightGBM</title>
<p>LightGBM is an efficient implementation of gradient boosting decision trees that uses a histogram-based method to accelerate the training process and supports efficient handling of categorical features. It builds multiple trees incrementally, with each tree being optimized on the basis of gradient boosting. The model form is similar to that of XGBoost.</p>
</sec>
<sec id="s4-3-1-7">
<title>4.3.1.7 Gradient boosting decision tree (GBDT)</title>
<p>GBDT is an ensemble learning method that builds multiple decision trees incrementally, with each tree attempting to correct the errors of the previous one to make predictions. The final prediction of the model is the weighted sum of all the decision tree predictions.</p>
</sec>
<sec id="s4-3-1-8">
<title>4.3.1.8 Support vector machine (SVM)</title>
<p>SVM is a model for classification and regression that separates different categories of data points by finding the optimal hyperplane. As shown in <xref ref-type="disp-formula" rid="e11">Equation 11</xref>:<disp-formula id="e11">
<mml:math id="m67">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>sgn</mml:mtext>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mi mathvariant="normal">T</mml:mi>
</mml:msup>
<mml:mi>x</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(11)</label>
</disp-formula>
</p>
<p>Where <inline-formula id="inf55">
<mml:math id="m68">
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the weight vector, and <inline-formula id="inf56">
<mml:math id="m69">
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the bias term.</p>
</sec>
<sec id="s4-3-1-9">
<title>4.3.1.9 Decision tree</title>
<p>Decision Tree is a tree-structured model that performs classification or regression by making conditional judgments on features. Each internal node represents a test on a feature, and each leaf node represents a class or value. As shown in <xref ref-type="disp-formula" rid="e12">Equation 12</xref>:<disp-formula id="e12">
<mml:math id="m70">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>l</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>a</mml:mi>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
<label>(12)</label>
</disp-formula>
</p>
<p>Where <inline-formula id="inf57">
<mml:math id="m71">
<mml:mrow>
<mml:mi mathvariant="normal">x</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the feature vector, and <inline-formula id="inf58">
<mml:math id="m72">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is the predicted class.</p>
</sec>
</sec>
<sec id="s4-3-2">
<title>4.3.2 Model evaluation criteria</title>
<p>To measure the goodness of fit of the model, we used Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and R-Squared (<inline-formula id="inf59">
<mml:math id="m73">
<mml:mrow>
<mml:msup>
<mml:mi mathvariant="normal">R</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>) to evaluate the model.The calculation formula is shown in <xref ref-type="table" rid="T3">Table 3</xref>:</p>
<table-wrap id="T3" position="float">
<label>TABLE 3</label>
<caption>
<p>Model evaluation metrics and their calculation formulas.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">Evaluation metrics</th>
<th align="center">Calculation formulas</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">MSE</td>
<td align="center">
<inline-formula id="inf60">
<mml:math id="m74">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mi>S</mml:mi>
<mml:mi>E</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:msubsup>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>m</mml:mi>
</mml:msubsup>
<mml:msup>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
</tr>
<tr>
<td align="center">RMSE</td>
<td align="center">
<inline-formula id="inf61">
<mml:math id="m75">
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mi>M</mml:mi>
<mml:mi>S</mml:mi>
<mml:mi>E</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:msubsup>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>m</mml:mi>
</mml:msubsup>
<mml:msup>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
</tr>
<tr>
<td align="center">MAE</td>
<td align="center">
<inline-formula id="inf62">
<mml:math id="m76">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>E</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:munder>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munder>
</mml:mstyle>
<mml:mrow>
<mml:mfenced open="|" close="|" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
</tr>
<tr>
<td align="center">MAPE</td>
<td align="center">
<inline-formula id="inf63">
<mml:math id="m77">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>E</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>100</mml:mn>
<mml:mo>%</mml:mo>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:mfrac>
<mml:msubsup>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:msubsup>
<mml:mrow>
<mml:mfenced open="|" close="|" separators="&#x7c;">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2212;</mml:mo>
</mml:mrow>
</mml:msub>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
</tr>
<tr>
<td align="center">
<inline-formula id="inf64">
<mml:math id="m78">
<mml:mrow>
<mml:msup>
<mml:mi mathvariant="normal">R</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="center">
<inline-formula id="inf65">
<mml:math id="m79">
<mml:mrow>
<mml:msup>
<mml:mi>R</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:munder>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munder>
</mml:mstyle>
<mml:msup>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2212;</mml:mo>
</mml:mrow>
</mml:msub>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:munder>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munder>
</mml:mstyle>
<mml:msup>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In <xref ref-type="table" rid="T3">Table 3</xref>, <inline-formula id="inf66">
<mml:math id="m80">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">y</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf67">
<mml:math id="m81">
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi mathvariant="normal">y</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
<mml:mi mathvariant="normal">i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> represent the actual and predicted values on the test set, respectively. The smaller the values of MSE, RMSE, MAE, and MAPE, the higher the predictive accuracy of the model. The closer the <inline-formula id="inf68">
<mml:math id="m82">
<mml:mrow>
<mml:msup>
<mml:mi mathvariant="normal">R</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> value is to 1, the better the model&#x2019;s fit.</p>
</sec>
<sec id="s4-3-3">
<title>4.3.3 Model solving</title>
<p>The feature variables selected are the 20 molecular descriptors listed in <xref ref-type="table" rid="T2">Table 2</xref>, which are divided into training, testing, and validation sets in an 8:1:1 ratio. Initially, ten different machine learning models were used for regression prediction. The predictive performance of these regression models is illustrated in <xref ref-type="fig" rid="F4">Figure 4</xref>.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Comparison of ten regression models.</p>
</caption>
<graphic xlink:href="fgene-16-1523015-g004.tif"/>
</fig>
<p>As can be seen, the three models with the highest <inline-formula id="inf69">
<mml:math id="m83">
<mml:mrow>
<mml:msup>
<mml:mi mathvariant="normal">R</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> values are LightGBM, RandomForest, and XGBoost, with values of 0.737, 0.736, and 0.711, respectively. To enhance the prediction accuracy, we experimented with multi-model fusion predictions. We selected the three models with the highest <inline-formula id="inf70">
<mml:math id="m84">
<mml:mrow>
<mml:msup>
<mml:mi mathvariant="normal">R</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> values and tried three types of fusion strategies: simple average fusion, weighted fusion (5:3:2), and stacking fusion, to improve the predictive performance of the models. The stacking fusion model, which showed the best predictive effect, achieved an <inline-formula id="inf71">
<mml:math id="m85">
<mml:mrow>
<mml:msup>
<mml:mi mathvariant="normal">R</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> value of 0.743. The predictive performance of the stacking model is depicted in <xref ref-type="fig" rid="F5">Figure 5</xref>, and the final results were populated in &#x201c;ER&#x3b1;_activity_test.csv.&#x201d;</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>Prediction performance of the stacking model.</p>
</caption>
<graphic xlink:href="fgene-16-1523015-g005.tif"/>
</fig>
<p>In <xref ref-type="fig" rid="F5">Figure 5</xref>, the left plot displays a comparison between the actual values (on the horizontal axis) and predicted values (on the vertical axis) for the test set. Each red dot represents the corresponding actual and predicted value for a test sample, with the dashed line indicating a perfect prediction. It can be observed that the overall trend of the predictions is quite close to the perfect prediction line. The right plot shows a line chart of the actual values versus predicted values for the first 30 samples. Red dots represent actual values, and blue squares represent predicted values. The dashed line connecting these points illustrates the variation trend between the actual and predicted values for each sample. By observing this line, it can be concluded that the model fits the data well in most cases.</p>
</sec>
</sec>
<sec id="s4-4">
<title>4.4 Classification prediction of ADMET properties</title>
<p>
<list list-type="simple">
<list-item>
<p>1. Recursive Feature Elimination (RFE): Using RandomForest as the base model, the Recursive Feature Elimination method was applied to select features for ADMET properties, selecting 25 most representative molecular descriptors for each ADMET attribute.</p>
</list-item>
<list-item>
<p>2. Classification Model Selection: Eleven classification models were used, including Logistic Regression, Naive Bayes, LDA, Decision Tree, RandomForest, AdaBoost, GradientBoosting, SVM, MLP, XGBoost, and LightGBM, to predict the ADMET properties of compounds.</p>
</list-item>
<list-item>
<p>3. Classification Performance Evaluation: Model performance was evaluated using metrics such as F1 score and ROC curve, and the best model was selected for each ADMET property. The best classification models for different ADMET properties were LightGBM (Caco-2), XGBoost (CYP3A4 and hERG), Naive Bayes (HOB), and XGBoost (MN).</p>
</list-item>
<list-item>
<p>4. ADMET Property Prediction: The selected best models were used to predict the ADMET properties of 50 compounds.</p>
</list-item>
</list>
</p>
<sec id="s4-4-1">
<title>4.4.1 Recursive feature elimination (RFE)</title>
<p>RFE is an algorithm used for feature selection. Its core idea is to recursively train a model and eliminate the least important feature after each training cycle based on the importance scores assigned to features. Assuming a dataset contains nnn features, RFE can be used to select the optimal subset of features.</p>
</sec>
<sec id="s4-4-2">
<title>4.4.2 Classification prediction models</title>
<sec id="s4-4-2-1">
<title>4.4.2.1 Logistic Regression</title>
<p>Logistic regression is a linear model used for binary classification problems. It maps the predicted values to probabilities by applying the sigmoid function to a linear combination of features. As shown in <xref ref-type="disp-formula" rid="e13">Equation 13</xref>:<disp-formula id="e13">
<mml:math id="m86">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>y</mml:mi>
<mml:mrow>
<mml:mfenced open="" close="|" separators="&#x7c;">
<mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2b;</mml:mo>
<mml:msup>
<mml:mi>e</mml:mi>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b2;</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:msup>
<mml:mi>&#x3b2;</mml:mi>
<mml:mi>T</mml:mi>
</mml:msup>
<mml:mi>X</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<label>(13)</label>
</disp-formula>
</p>
<p>Where <inline-formula id="inf72">
<mml:math id="m87">
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the feature vector, <inline-formula id="inf73">
<mml:math id="m88">
<mml:mrow>
<mml:mi>&#x3b2;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the regression coefficient vector, and <inline-formula id="inf74">
<mml:math id="m89">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b2;</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the bias term.</p>
</sec>
<sec id="s4-4-2-2">
<title>4.4.2.2 Naive Bayes</title>
<p>The Naive Bayes classifier is a simple classifier based on Bayes&#x2019; theorem, assuming that features are independent of each other. As shown in <xref ref-type="disp-formula" rid="e14">Equation 14</xref>:<disp-formula id="e14">
<mml:math id="m90">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>y</mml:mi>
<mml:mo>&#x7c;</mml:mo>
<mml:mi>X</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:msubsup>
<mml:mo>&#x220f;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:msubsup>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x7c;</mml:mo>
<mml:mi>y</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<label>(14)</label>
</disp-formula>
</p>
<p>Where, <inline-formula id="inf75">
<mml:math id="m91">
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the class label, <inline-formula id="inf76">
<mml:math id="m92">
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the feature vector, and <inline-formula id="inf77">
<mml:math id="m93">
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the <inline-formula id="inf78">
<mml:math id="m94">
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>-th feature.</p>
</sec>
<sec id="s4-4-2-3">
<title>4.4.2.3 Linear discriminant analysis (LDA)</title>
<p>LDA is a technique used for dimensionality reduction and classification. It seeks to find the projection direction that maximizes between-class scatter while minimizing within-class scatter. The objective is to find the optimal linear transformation by maximizing the ratio of between-class scatter to within-class scatter, As shown in <xref ref-type="disp-formula" rid="e15">Equation 15</xref>:<disp-formula id="e15">
<mml:math id="m95">
<mml:mrow>
<mml:mi>J</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mi mathvariant="normal">T</mml:mi>
</mml:msup>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>B</mml:mi>
</mml:msub>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mi mathvariant="normal">T</mml:mi>
</mml:msup>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>W</mml:mi>
</mml:msub>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<label>(15)</label>
</disp-formula>
</p>
<p>Where <inline-formula id="inf79">
<mml:math id="m96">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">S</mml:mi>
<mml:mi mathvariant="normal">B</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the between-class scatter matrix, <inline-formula id="inf80">
<mml:math id="m97">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">S</mml:mi>
<mml:mi mathvariant="normal">W</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the within-class scatter matrix, and <inline-formula id="inf81">
<mml:math id="m98">
<mml:mrow>
<mml:mi mathvariant="normal">w</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the projection vector.</p>
</sec>
<sec id="s4-4-2-4">
<title>4.4.2.4 Adaptive boosting (AdaBoost)</title>
<p>AdaBoost is an ensemble learning method that iteratively trains a series of weak classifiers (e.g., decision stumps), with each classifier improving upon the previous one. The final classification result is a weighted vote of all weak classifiers. As shown in <xref ref-type="disp-formula" rid="e16">Equation 16</xref>:<disp-formula id="e16">
<mml:math id="m99">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>M</mml:mi>
</mml:munderover>
</mml:mstyle>
<mml:msub>
<mml:mi>&#x3b1;</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(16)</label>
</disp-formula>where <inline-formula id="inf82">
<mml:math id="m100">
<mml:mrow>
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is the <inline-formula id="inf83">
<mml:math id="m101">
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>-th weak classifier, and <inline-formula id="inf84">
<mml:math id="m102">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b1;</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is its weight.</p>
</sec>
<sec id="s4-4-2-5">
<title>4.4.2.5 Gradient boosting</title>
<p>Gradient Boosting Trees is an ensemble learning method that builds decision trees sequentially, where each tree attempts to correct the errors of the previous trees. The model&#x2019;s final prediction is the weighted sum of all decision trees&#x2019; predictions. As shown in <xref ref-type="disp-formula" rid="e17">Equation 17</xref>:<disp-formula id="e17">
<mml:math id="m103">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3b7;</mml:mi>
<mml:mo>&#xb7;</mml:mo>
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(17)</label>
</disp-formula>where <inline-formula id="inf85">
<mml:math id="m104">
<mml:mrow>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is the prediction from the first <inline-formula id="inf86">
<mml:math id="m105">
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> trees, <inline-formula id="inf87">
<mml:math id="m106">
<mml:mrow>
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is the <inline-formula id="inf88">
<mml:math id="m107">
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>-th tree, and <inline-formula id="inf89">
<mml:math id="m108">
<mml:mrow>
<mml:mi>&#x3b7;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the learning rate.</p>
</sec>
<sec id="s4-4-2-6">
<title>4.4.2.6 MLP</title>
<p>A Multilayer Perceptron is a feedforward neural network consisting of an input layer, one or more hidden layers, and an output layer. Each layer comprises multiple neurons that perform nonlinear transformations through activation functions (such as ReLU, Sigmoid, etc.). As shown in <xref ref-type="disp-formula" rid="e18">Equation 18</xref>:<disp-formula id="e18">
<mml:math id="m109">
<mml:mrow>
<mml:msup>
<mml:mi>a</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>&#x3c3;</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msup>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
<mml:msup>
<mml:mi>a</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2b;</mml:mo>
<mml:msup>
<mml:mi>b</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(18)</label>
</disp-formula>
</p>
<p>Where <inline-formula id="inf90">
<mml:math id="m110">
<mml:mrow>
<mml:msup>
<mml:mi>a</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> is the activation vector of the <inline-formula id="inf91">
<mml:math id="m111">
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>-th layer, <inline-formula id="inf92">
<mml:math id="m112">
<mml:mrow>
<mml:msup>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> is the weight matrix of the <inline-formula id="inf93">
<mml:math id="m113">
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>-th layer, <inline-formula id="inf94">
<mml:math id="m114">
<mml:mrow>
<mml:msup>
<mml:mi>b</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> is the bias term, and <inline-formula id="inf95">
<mml:math id="m115">
<mml:mrow>
<mml:mi>&#x3c3;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the activation function.</p>
</sec>
</sec>
<sec id="s4-4-3">
<title>4.4.3 Model evaluation metrics</title>
<p>To select the most effective models, this study utilizes the following classification algorithm evaluation metrics to assess the performance of each model. Let us define:</p>
<p>True Positives (tp): the number of samples correctly predicted as class 1 (predicted as 1 and actually being 1).</p>
<p>False Positives (fp): the number of samples incorrectly predicted as class 1 (predicted as 1 but actually being 0).</p>
<p>False Negatives (fn): the number of samples incorrectly predicted as class 0 (predicted as 0 but actually being 1).</p>
<p>True Negatives (tn): the number of samples correctly predicted as class 0 (predicted as 0 and actually being 0).</p>
<sec id="s4-4-3-1">
<title>4.4.3.1 F1 score</title>
<p>The F1 score is a weighted measure of precision and recall, defined as the harmonic mean of precision and recall. As shown in <xref ref-type="disp-formula" rid="e19">Equation 19</xref>:<disp-formula id="e19">
<mml:math id="m116">
<mml:mrow>
<mml:mi mathvariant="normal">F</mml:mi>
<mml:mn>1</mml:mn>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>2</mml:mn>
<mml:mfrac>
<mml:mrow>
<mml:mtext>Precision</mml:mtext>
<mml:mo>&#xd7;</mml:mo>
<mml:mtext>Recall</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mtext>Precision</mml:mtext>
<mml:mo>&#x2b;</mml:mo>
<mml:mtext>Recall</mml:mtext>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<label>(19)</label>
</disp-formula>
</p>
<p>Where, <inline-formula id="inf96">
<mml:math id="m117">
<mml:mrow>
<mml:mtext>Precision</mml:mtext>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf97">
<mml:math id="m118">
<mml:mrow>
<mml:mtext>Recall</mml:mtext>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</inline-formula>. In model evaluation, a higher F1 score indicates better performance.</p>
</sec>
<sec id="s4-4-3-2">
<title>4.4.3.2 ROC curve</title>
<p>The ROC curve, also known as the Receiver Operating Characteristic curve, is a graphical tool used in binary classification problems. In this context, each point on the ROC curve represents a specific threshold. The classifier assigns a score to each sample; if the score exceeds the threshold, the sample is classified as a positive instance; if it is below the threshold, it is classified as a negative instance. The closer the ROC curve is to the upper-left corner of the plot, the better the classification performance of the model.</p>
</sec>
</sec>
<sec id="s4-4-4">
<title>4.4.4 Model solving</title>
<sec id="s4-4-4-1">
<title>4.4.4.1 Data preprocessing</title>
<p>Initially, using the molecular descriptors remaining after removing single-value variables from Problem 1, the Recursive Feature Elimination (RFE) algorithm was used to select 25 feature variables corresponding to each ADMET property. The specific feature selections for each property are shown in <xref ref-type="table" rid="T4">Tables 4</xref>&#x2013;<xref ref-type="table" rid="T8">8</xref>.</p>
<table-wrap id="T4" position="float">
<label>TABLE 4</label>
<caption>
<p>Selected features for Caco-2.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">No.</th>
<th align="center">Molecular descriptor</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">1</td>
<td align="center">BCUTc-1h</td>
</tr>
<tr>
<td align="center">2</td>
<td align="center">SP-1</td>
</tr>
<tr>
<td align="center">3</td>
<td align="center">SP-2</td>
</tr>
<tr>
<td align="center">4</td>
<td align="center">ECCEN</td>
</tr>
<tr>
<td align="center">5</td>
<td align="center">SHBd</td>
</tr>
<tr>
<td align="center">6</td>
<td align="center">SHother</td>
</tr>
<tr>
<td align="center">7</td>
<td align="center">SsCH3</td>
</tr>
<tr>
<td align="center">8</td>
<td align="center">SaaO</td>
</tr>
<tr>
<td align="center">9</td>
<td align="center">minHBa</td>
</tr>
<tr>
<td align="center">10</td>
<td align="center">minwHBa</td>
</tr>
<tr>
<td align="center">11</td>
<td align="center">minaaO</td>
</tr>
<tr>
<td align="center">12</td>
<td align="center">maxaaO</td>
</tr>
<tr>
<td align="center">13</td>
<td align="center">ETA_Alpha</td>
</tr>
<tr>
<td align="center">14</td>
<td align="center">ETA_Beta_s</td>
</tr>
<tr>
<td align="center">15</td>
<td align="center">ETA_Eta_R_L</td>
</tr>
<tr>
<td align="center">16</td>
<td align="center">FMF</td>
</tr>
<tr>
<td align="center">17</td>
<td align="center">MDEC-23</td>
</tr>
<tr>
<td align="center">18</td>
<td align="center">MLFER_S</td>
</tr>
<tr>
<td align="center">19</td>
<td align="center">MLFER_L</td>
</tr>
<tr>
<td align="center">20</td>
<td align="center">TopoPSA</td>
</tr>
<tr>
<td align="center">21</td>
<td align="center">MW</td>
</tr>
<tr>
<td align="center">22</td>
<td align="center">WTPT-1</td>
</tr>
<tr>
<td align="center">23</td>
<td align="center">WTPT-3</td>
</tr>
<tr>
<td align="center">24</td>
<td align="center">WTPT-4</td>
</tr>
<tr>
<td align="center">25</td>
<td align="center">WPATH</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T5" position="float">
<label>TABLE 5</label>
<caption>
<p>Selected features for CY3A4.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">No.</th>
<th align="center">Molecular descriptor</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">1</td>
<td align="center">ATSc1</td>
</tr>
<tr>
<td align="center">2</td>
<td align="center">bpol</td>
</tr>
<tr>
<td align="center">3</td>
<td align="center">VCH-6</td>
</tr>
<tr>
<td align="center">4</td>
<td align="center">SP-4</td>
</tr>
<tr>
<td align="center">5</td>
<td align="center">SP-7</td>
</tr>
<tr>
<td align="center">6</td>
<td align="center">VP-2</td>
</tr>
<tr>
<td align="center">7</td>
<td align="center">VP-4</td>
</tr>
<tr>
<td align="center">8</td>
<td align="center">VP-7</td>
</tr>
<tr>
<td align="center">9</td>
<td align="center">SHaaCH</td>
</tr>
<tr>
<td align="center">10</td>
<td align="center">ETA_dEpsilon_D</td>
</tr>
<tr>
<td align="center">11</td>
<td align="center">ETA_Eta</td>
</tr>
<tr>
<td align="center">12</td>
<td align="center">WTPT-1</td>
</tr>
<tr>
<td align="center">13</td>
<td align="center">Zagreb</td>
</tr>
<tr>
<td align="center">14</td>
<td align="center">ATSc2</td>
</tr>
<tr>
<td align="center">15</td>
<td align="center">SCH-7</td>
</tr>
<tr>
<td align="center">16</td>
<td align="center">SP-3</td>
</tr>
<tr>
<td align="center">17</td>
<td align="center">SP-5</td>
</tr>
<tr>
<td align="center">18</td>
<td align="center">VP-1</td>
</tr>
<tr>
<td align="center">19</td>
<td align="center">VP-3</td>
</tr>
<tr>
<td align="center">20</td>
<td align="center">VP-5</td>
</tr>
<tr>
<td align="center">21</td>
<td align="center">SHBd</td>
</tr>
<tr>
<td align="center">22</td>
<td align="center">minHBa</td>
</tr>
<tr>
<td align="center">23</td>
<td align="center">ETA_Beta_s</td>
</tr>
<tr>
<td align="center">24</td>
<td align="center">ETA_Eta_L</td>
</tr>
<tr>
<td align="center">25</td>
<td align="center">WTPT-3</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T6" position="float">
<label>TABLE 6</label>
<caption>
<p>Selected Features for hERG.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">No.</th>
<th align="center">Molecular descriptor</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">1</td>
<td align="center">ATSc2</td>
</tr>
<tr>
<td align="center">2</td>
<td align="center">bpol</td>
</tr>
<tr>
<td align="center">3</td>
<td align="center">VP-0</td>
</tr>
<tr>
<td align="center">4</td>
<td align="center">CrippenMR</td>
</tr>
<tr>
<td align="center">5</td>
<td align="center">SHBint8</td>
</tr>
<tr>
<td align="center">6</td>
<td align="center">SsOH</td>
</tr>
<tr>
<td align="center">7</td>
<td align="center">maxHBd</td>
</tr>
<tr>
<td align="center">8</td>
<td align="center">maxaaCH</td>
</tr>
<tr>
<td align="center">9</td>
<td align="center">LipoaffinityIndex</td>
</tr>
<tr>
<td align="center">10</td>
<td align="center">ETA_EtaP_F</td>
</tr>
<tr>
<td align="center">11</td>
<td align="center">Kier2</td>
</tr>
<tr>
<td align="center">12</td>
<td align="center">McGowan_Volume</td>
</tr>
<tr>
<td align="center">13</td>
<td align="center">WPATH</td>
</tr>
<tr>
<td align="center">14</td>
<td align="center">BCUTc-1l</td>
</tr>
<tr>
<td align="center">15</td>
<td align="center">SP-1</td>
</tr>
<tr>
<td align="center">16</td>
<td align="center">VP-1</td>
</tr>
<tr>
<td align="center">17</td>
<td align="center">ECCEN</td>
</tr>
<tr>
<td align="center">18</td>
<td align="center">SHother</td>
</tr>
<tr>
<td align="center">19</td>
<td align="center">minaasC</td>
</tr>
<tr>
<td align="center">20</td>
<td align="center">maxHsOH</td>
</tr>
<tr>
<td align="center">21</td>
<td align="center">hmin</td>
</tr>
<tr>
<td align="center">22</td>
<td align="center">ETA_EtaP</td>
</tr>
<tr>
<td align="center">23</td>
<td align="center">Kier1</td>
</tr>
<tr>
<td align="center">24</td>
<td align="center">Kier3</td>
</tr>
<tr>
<td align="center">25</td>
<td align="center">MDEO-11</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T7" position="float">
<label>TABLE 7</label>
<caption>
<p>Selected features for HOB.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">No.</th>
<th align="center">Molecular descriptor</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">1</td>
<td align="center">ATSc2</td>
</tr>
<tr>
<td align="center">2</td>
<td align="center">BCUTp-1l</td>
</tr>
<tr>
<td align="center">3</td>
<td align="center">VP-3</td>
</tr>
<tr>
<td align="center">4</td>
<td align="center">VP-6</td>
</tr>
<tr>
<td align="center">5</td>
<td align="center">SHsOH</td>
</tr>
<tr>
<td align="center">6</td>
<td align="center">SdO</td>
</tr>
<tr>
<td align="center">7</td>
<td align="center">minsOH</td>
</tr>
<tr>
<td align="center">8</td>
<td align="center">maxsOH</td>
</tr>
<tr>
<td align="center">9</td>
<td align="center">hmin</td>
</tr>
<tr>
<td align="center">10</td>
<td align="center">ETA_BetaP_s</td>
</tr>
<tr>
<td align="center">11</td>
<td align="center">ETA_EtaP_F_L</td>
</tr>
<tr>
<td align="center">12</td>
<td align="center">MLFER_A</td>
</tr>
<tr>
<td align="center">13</td>
<td align="center">WTPT-4</td>
</tr>
<tr>
<td align="center">14</td>
<td align="center">BCUTc-1l</td>
</tr>
<tr>
<td align="center">15</td>
<td align="center">SC-5</td>
</tr>
<tr>
<td align="center">16</td>
<td align="center">VP-5</td>
</tr>
<tr>
<td align="center">17</td>
<td align="center">VP-7</td>
</tr>
<tr>
<td align="center">18</td>
<td align="center">SsOH</td>
</tr>
<tr>
<td align="center">19</td>
<td align="center">minHBa</td>
</tr>
<tr>
<td align="center">20</td>
<td align="center">maxHsOH</td>
</tr>
<tr>
<td align="center">21</td>
<td align="center">maxdO</td>
</tr>
<tr>
<td align="center">22</td>
<td align="center">ETA_Shape_P</td>
</tr>
<tr>
<td align="center">23</td>
<td align="center">ETA_EtaP_L</td>
</tr>
<tr>
<td align="center">24</td>
<td align="center">Kier3</td>
</tr>
<tr>
<td align="center">25</td>
<td align="center">MLFER_BO</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T8" position="float">
<label>TABLE 8</label>
<caption>
<p>Selected features for MN.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">No.</th>
<th align="center">Molecular descriptor</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">1</td>
<td align="center">nN</td>
</tr>
<tr>
<td align="center">2</td>
<td align="center">VPC-5</td>
</tr>
<tr>
<td align="center">3</td>
<td align="center">SssCH2</td>
</tr>
<tr>
<td align="center">4</td>
<td align="center">minHBa</td>
</tr>
<tr>
<td align="center">5</td>
<td align="center">maxsCH3</td>
</tr>
<tr>
<td align="center">6</td>
<td align="center">ETA Epsilon 1</td>
</tr>
<tr>
<td align="center">7</td>
<td align="center">ETA dEpsilon A</td>
</tr>
<tr>
<td align="center">8</td>
<td align="center">ETA BetaP</td>
</tr>
<tr>
<td align="center">9</td>
<td align="center">ETA EtaP B RC</td>
</tr>
<tr>
<td align="center">10</td>
<td align="center">nHBAcc Lipinski</td>
</tr>
<tr>
<td align="center">11</td>
<td align="center">MLFER E</td>
</tr>
<tr>
<td align="center">12</td>
<td align="center">WTPT-3</td>
</tr>
<tr>
<td align="center">13</td>
<td align="center">WTPT-5</td>
</tr>
<tr>
<td align="center">14</td>
<td align="center">SCH-7</td>
</tr>
<tr>
<td align="center">15</td>
<td align="center">nssCH2</td>
</tr>
<tr>
<td align="center">16</td>
<td align="center">SssO</td>
</tr>
<tr>
<td align="center">17</td>
<td align="center">mindssC</td>
</tr>
<tr>
<td align="center">18</td>
<td align="center">maxsssCH</td>
</tr>
<tr>
<td align="center">19</td>
<td align="center">ETA Epsilon 4</td>
</tr>
<tr>
<td align="center">20</td>
<td align="center">ETA dEpsilon C</td>
</tr>
<tr>
<td align="center">21</td>
<td align="center">ETA BetaP s</td>
</tr>
<tr>
<td align="center">22</td>
<td align="center">FMF</td>
</tr>
<tr>
<td align="center">23</td>
<td align="center">MLFER S</td>
</tr>
<tr>
<td align="center">24</td>
<td align="center">TopoPSA</td>
</tr>
<tr>
<td align="center">25</td>
<td align="center">WTPT-4</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4-4-4-2">
<title>4.4.4.2 Results of the model in ADMET property prediction</title>
<p>Subsequently, eleven machine learning models were used to classify the five ADMET features individually. The F1 scores of each model&#x2019;s prediction results are shown in <xref ref-type="fig" rid="F6">Figure 6</xref>.</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption>
<p>Comparison of F1 scores for 11 classification models in ADMET property prediction.</p>
</caption>
<graphic xlink:href="fgene-16-1523015-g006.tif"/>
</fig>
<p>
<xref ref-type="fig" rid="F6">Figure 6</xref> displays the F1 scores of different classification models for five distinct ADMET properties: Caco-2, CYP3A4, hERG, HOB, and MN. The performance of 11 classification models is compared using line charts. Each target variable is represented by different symbols to distinguish their performance in predictions.</p>
<p>Application Results of Different Models in ADMET Property Prediction:</p>
<p>In the confusion matrix of the following set of figures, the symbols represent the following meanings.<list list-type="simple">
<list-item>
<p>1. True 0: Samples where the actual value is 0 (poor intestinal absorption).</p>
</list-item>
<list-item>
<p>2. True 1: Samples where the actual value is 1 (good intestinal absorption).</p>
</list-item>
<list-item>
<p>3. Predicted 0: Samples predicted as 0 by the model.</p>
</list-item>
<list-item>
<p>4. Predicted 1: Samples predicted as 1 by the model.</p>
</list-item>
</list>
</p>
<p>The best-performing model for Caco-2 prediction is LightGBM, with an F1 score of <bold>0.8905</bold>. The ROC curve and confusion matrix are shown in <xref ref-type="fig" rid="F7">Figure 7</xref>.</p>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption>
<p>LightGBM prediction for Caco-2.</p>
</caption>
<graphic xlink:href="fgene-16-1523015-g007.tif"/>
</fig>
<p>The best-performing model for CYP3A4 prediction is XGBoost, with an F1 score of <bold>0.9733</bold>. The ROC curve and confusion matrix are shown in <xref ref-type="fig" rid="F8">Figure 8</xref>.</p>
<fig id="F8" position="float">
<label>FIGURE 8</label>
<caption>
<p>XGBoost prediction for CYP3A4.</p>
</caption>
<graphic xlink:href="fgene-16-1523015-g008.tif"/>
</fig>
<p>The best-performing model for hERG prediction is XGBoost, with an F1 score of <bold>0.9138</bold>. The ROC curve and confusion matrix are shown in <xref ref-type="fig" rid="F9">Figure 9</xref>.</p>
<fig id="F9" position="float">
<label>FIGURE 9</label>
<caption>
<p>XGBoost Prediction for hERG.</p>
</caption>
<graphic xlink:href="fgene-16-1523015-g009.tif"/>
</fig>
<p>The best-performing model for HOB prediction is Naive Bayes, with an F1 score of <bold>0.6824</bold>. The ROC curve and confusion matrix are shown in <xref ref-type="fig" rid="F10">Figure 10</xref>.</p>
<fig id="F10" position="float">
<label>FIGURE 10</label>
<caption>
<p>NaiveBayes prediction for HOB.</p>
</caption>
<graphic xlink:href="fgene-16-1523015-g010.tif"/>
</fig>
<p>The best-performing model for MN prediction is XGBoost, with an F1 score of <bold>0.9695</bold>. The ROC curve and confusion matrix are shown in <xref ref-type="fig" rid="F11">Figure 11</xref>.In this figure, the AUC (Area Under the Curve) of the ROC curve is 0.99, indicating that the model performs exceptionally well in the MN prediction task.</p>
<fig id="F11" position="float">
<label>FIGURE 11</label>
<caption>
<p>XGBoost prediction for MN.</p>
</caption>
<graphic xlink:href="fgene-16-1523015-g011.tif"/>
</fig>
<p>Finally, we used the best-performing models to predict the ADMET properties of 50 compounds, and the final results were entered into &#x201c;ADMET_test.csv.&#x201d;</p>
</sec>
</sec>
</sec>
<sec id="s4-5">
<title>4.5 Multi-objective optimization</title>
<p>
<list list-type="simple">
<list-item>
<p>1. Single-Objective Optimization: Establish a single-objective optimization model with the goal of enhancing the biological activity (pIC50 value) of the compounds while ensuring that at least three ADMET properties perform well.</p>
</list-item>
<list-item>
<p>2. Particle Swarm Optimization (PSO): Utilize the PSO algorithm to globally optimize 106 important features, recording the optimal solution in each iteration, and ultimately finding the value range that provides the best performance in both biological activity and ADMET properties.</p>
</list-item>
<list-item>
<p>3. Final Results: Apply the optimized compound features to 50 test compounds, outputting their optimal predicted values.</p>
</list-item>
</list>
</p>
<sec id="s4-5-1">
<title>4.5.1 Constrained optimization</title>
<p>A constrained optimization problem (COP) involves optimizing an objective function under specific constraints. In this case, we can establish a constrained optimization model to solve the problem.</p>
<sec id="s4-5-1-1">
<title>4.5.1.1 Decision variables</title>
<p>In the model established for this problem, there are a total of 106 molecular descriptors that affect both the biological activity and ADMET properties of the compounds. This includes 20 molecular descriptors affecting biological activity identified in the first question, and 25 descriptors affecting each ADMET property identified in the third question, with 39 of these descriptors being duplicates.</p>
<p>The decision variable <inline-formula id="inf98">
<mml:math id="m119">
<mml:mrow>
<mml:mi mathvariant="normal">x</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is denoted as: <inline-formula id="inf99">
<mml:math id="m120">
<mml:mrow>
<mml:mi mathvariant="normal">x</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mfenced open="[" close="]" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">x</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">x</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x22ef;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">x</mml:mi>
<mml:mn>106</mml:mn>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mi mathvariant="normal">T</mml:mi>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
</sec>
<sec id="s4-5-1-2">
<title>4.5.1.2 Objective function and constraints</title>
<p>As shown in <xref ref-type="disp-formula" rid="e20">Equation 20</xref>: Objective Function:<disp-formula id="e20">
<mml:math id="m121">
<mml:mrow>
<mml:mtable columnalign="right">
<mml:mtr>
<mml:mtd/>
<mml:mtd>
<mml:mrow>
<mml:mi>min</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mtext>&#x2002;</mml:mtext>
<mml:mi>F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mi>I</mml:mi>
<mml:mi>C</mml:mi>
<mml:mn>50</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd/>
<mml:mtd>
<mml:mrow>
<mml:mi mathvariant="normal">s</mml:mi>
<mml:mo>.</mml:mo>
<mml:mi mathvariant="normal">t</mml:mi>
<mml:mo>.</mml:mo>
<mml:mtext>&#x2003;Reward</mml:mtext>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>&#x2265;</mml:mo>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd/>
<mml:mtd>
<mml:mrow>
<mml:mtext>&#x2003;</mml:mtext>
<mml:msubsup>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x2264;</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2264;</mml:mo>
<mml:msubsup>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi>U</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>&#x22ef;</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd/>
<mml:mtd>
<mml:mrow>
<mml:mtext>&#x2003;</mml:mtext>
<mml:mi>x</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi mathvariant="double-struck">R</mml:mi>
<mml:mi>n</mml:mi>
</mml:msup>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:math>
<label>(20)</label>
</disp-formula>
</p>
<p>Where: <inline-formula id="inf100">
<mml:math id="m122">
<mml:mrow>
<mml:mi mathvariant="normal">F</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi mathvariant="normal">x</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> represents the biological activity prediction function for the compound. <inline-formula id="inf101">
<mml:math id="m123">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">g</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mi mathvariant="normal">x</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf102">
<mml:math id="m124">
<mml:mrow>
<mml:mi mathvariant="normal">i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>4</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> represent the classification models for the ADMET properties affecting the compound.</p>
<p>The reward function <inline-formula id="inf103">
<mml:math id="m125">
<mml:mrow>
<mml:mtext>Reward</mml:mtext>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">g</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is given by: <inline-formula id="inf104">
<mml:math id="m126">
<mml:mrow>
<mml:mtext>Reward</mml:mtext>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>3</mml:mn>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>4</mml:mn>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>5</mml:mn>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>. Here, <inline-formula id="inf105">
<mml:math id="m127">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">g</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> represents the Caco-2 classification model, <inline-formula id="inf106">
<mml:math id="m128">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">g</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> represents the CYP3A4 classification model, <inline-formula id="inf107">
<mml:math id="m129">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">g</mml:mi>
<mml:mn>3</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> represents the hERG classification model, <inline-formula id="inf108">
<mml:math id="m130">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">g</mml:mi>
<mml:mn>4</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> represents the HOB classification model, <inline-formula id="inf109">
<mml:math id="m131">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">g</mml:mi>
<mml:mn>5</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> represents the MN classification model.Assuming that the optimal combination is achieved when Caco-2 is set to 1, CYP3A4 is set to 1, hERG is set to 0, HOB is set to 1, and MN is set to 0, the reward function becomes Reward &#x3d; 5 under these conditions.</p>
<p>The requirement is met as long as the Reward function value is greater than or equal to 3.</p>
</sec>
</sec>
<sec id="s4-5-2">
<title>4.5.2 Particle swarm optimization algorithm for finding optimal solutions</title>
<p>Particle Swarm Optimization (PSO), a concept inspired by the simulation of birds foragingBy designing particles to simulate birds, which represent feasible solutions to optimization problems, each particle possesses three attributes&#x2014;velocity, position, and fitness value. Each particle independently searches for the best solution in the search space, known as the personal best, and shares it with all particles in the swarm. The best of these personal bests is considered the current global best solution for the entire swarm. All particles then adjust their positions based on this global best and their own personal bests until a globally optimal solution that meets the criteria is found.</p>
<p>Assume a swarm of <inline-formula id="inf110">
<mml:math id="m132">
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> particles in a <inline-formula id="inf111">
<mml:math id="m133">
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>-dimensional target search space. The properties of the <inline-formula id="inf112">
<mml:math id="m134">
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>-th particle at time <inline-formula id="inf113">
<mml:math id="m135">
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> consist of two vectors:<list list-type="simple">
<list-item>
<p>1. Velocity: <inline-formula id="inf114">
<mml:math id="m136">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold">v</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msubsup>
<mml:mi>v</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>v</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:mo>&#x22ef;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>v</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf115">
<mml:math id="m137">
<mml:mrow>
<mml:msubsup>
<mml:mi>v</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msubsup>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mfenced open="[" close="]" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mrow>
<mml:mi>min</mml:mi>
<mml:mtext>&#x2009;</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mrow>
<mml:mi>max</mml:mi>
<mml:mtext>&#x2009;</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>. Where <inline-formula id="inf116">
<mml:math id="m138">
<mml:mrow>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mi>min</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf117">
<mml:math id="m139">
<mml:mrow>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mi>max</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> represent the minimum and maximum components of the velocity, respectively.</p>
</list-item>
<list-item>
<p>2. Position: <inline-formula id="inf118">
<mml:math id="m140">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msubsup>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:mo>&#x22ef;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf119">
<mml:math id="m141">
<mml:mrow>
<mml:msubsup>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msubsup>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mfenced open="[" close="]" separators="&#x7c;">
<mml:mrow>
<mml:msub>
<mml:mi>l</mml:mi>
<mml:mi>d</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>u</mml:mi>
<mml:mi>d</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>. Where <inline-formula id="inf120">
<mml:math id="m142">
<mml:mrow>
<mml:msub>
<mml:mi>l</mml:mi>
<mml:mi>d</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf121">
<mml:math id="m143">
<mml:mrow>
<mml:msub>
<mml:mi>u</mml:mi>
<mml:mi>d</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> are the lower and upper bounds of each particle&#x2019;s search space components.</p>
</list-item>
</list>
</p>
<p>In each iteration, two optimal positions are recorded.<list list-type="simple">
<list-item>
<p>1. Individual optimal position: <inline-formula id="inf122">
<mml:math id="m144">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold">p</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msubsup>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:mo>&#x22ef;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>;</p>
</list-item>
<list-item>
<p>2. Global optimal position: <inline-formula id="inf123">
<mml:math id="m145">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold">p</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>t</mml:mi>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msubsup>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mi>g</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mi>g</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:mo>&#x22ef;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mi>g</mml:mi>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>; where <inline-formula id="inf124">
<mml:math id="m146">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>M</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>&#x2264;</mml:mo>
<mml:mover accent="true">
<mml:mi>d</mml:mi>
<mml:mo>&#x2d9;</mml:mo>
</mml:mover>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>D</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
</list-item>
</list>
</p>
<p>According to the above theory, the velocity and position of the particle are updated at time <inline-formula id="inf125">
<mml:math id="m147">
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> and the formulas are shown in <xref ref-type="disp-formula" rid="e21">Equations 21</xref>, <xref ref-type="disp-formula" rid="e22">22</xref>:<disp-formula id="e21">
<mml:math id="m148">
<mml:mrow>
<mml:msubsup>
<mml:mi>v</mml:mi>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:msubsup>
<mml:mi>v</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
</mml:msubsup>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msubsup>
<mml:mi>p</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
</mml:msubsup>
<mml:mo>&#x2212;</mml:mo>
<mml:msubsup>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mrow>
<mml:mfenced open="(" close=")" separators="&#x7c;">
<mml:mrow>
<mml:msubsup>
<mml:mi>p</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>t</mml:mi>
</mml:msubsup>
<mml:mo>&#x2212;</mml:mo>
<mml:msubsup>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(21)</label>
</disp-formula>
<disp-formula id="e22">
<mml:math id="m149">
<mml:mrow>
<mml:msubsup>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:msubsup>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
</mml:msubsup>
<mml:mo>&#x2b;</mml:mo>
<mml:msubsup>
<mml:mi>v</mml:mi>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
<label>(22)</label>
</disp-formula>
</p>
<p>Here, <inline-formula id="inf126">
<mml:math id="m150">
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf127">
<mml:math id="m151">
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> are random numbers in the range (0,1), and <inline-formula id="inf128">
<mml:math id="m152">
<mml:mrow>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf129">
<mml:math id="m153">
<mml:mrow>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> are learning factors.</p>
</sec>
<sec id="s4-5-3">
<title>4.5.3 Model solving</title>
<p>The selected feature variables consist of the 20 variables most highly correlated with biological activity, identified in the first question, and the top 25 variables most highly correlated with each of the five ADMET properties, identified in the third question. There are 39 duplicate variables, making a total of 106 feature variables.</p>
<p>In our Particle Swarm Optimization approach, after various trials and adjustments, we determined the optimal parameters: the inertia weight w &#x3d; 0.8, cognitive coefficient <inline-formula id="inf130">
<mml:math id="m154">
<mml:mrow>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.5</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, and social coefficient <inline-formula id="inf131">
<mml:math id="m155">
<mml:mrow>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.5</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>. The convergence process is illustrated in <xref ref-type="fig" rid="F12">Figure 12</xref>.</p>
<fig id="F12" position="float">
<label>FIGURE 12</label>
<caption>
<p>Optimal values obtained using the particle swarm optimization algorithm.</p>
</caption>
<graphic xlink:href="fgene-16-1523015-g012.tif"/>
</fig>
<p>In <xref ref-type="fig" rid="F12">Figure 12</xref>, the X-axis represents the number of iterations in the Particle Swarm Optimization (PSO) process, ranging from 0 to 80 iterations; the Y-axis represents the global best objective function value after each iteration; the blue curve in the figure shows the trend of the objective function value, starting from the initial value and decreasing rapidly with each iteration, eventually stabilizing and approaching the final converged value.</p>
<p>This figure demonstrates that the PSO algorithm converges rapidly after multiple iterations, with the objective function value gradually decreasing from an initially high value and eventually stabilizing, indicating that the optimization process effectively finds a solution.</p>
<p>The optimal value ranges for some molecular descriptors are shown in <xref ref-type="table" rid="T9">Table 9</xref>. The complete results are available in the attached document &#x201c;results.csv.&#x201d;</p>
<table-wrap id="T9" position="float">
<label>TABLE 9</label>
<caption>
<p>Optimal value ranges for molecular descriptors.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">Molecular descriptors</th>
<th align="center">Optimal value ranges</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">ATSc1</td>
<td align="center">(0.03, 1.89)</td>
</tr>
<tr>
<td align="center">ATSc3</td>
<td align="center">(&#x2212;0.37, &#x2212;0.16)</td>
</tr>
<tr>
<td align="center">BCUTc-1l</td>
<td align="center">(&#x2212;0.32, &#x2212;0.19)</td>
</tr>
<tr>
<td align="center">ATSc2</td>
<td align="center">(&#x2212;2.38, &#x2212;1.00)</td>
</tr>
<tr>
<td align="center">BCUTc-1h</td>
<td align="center">(0.07, 0.33)</td>
</tr>
<tr>
<td align="center">BCUTp-1h</td>
<td align="center">(7.97, 16.75)</td>
</tr>
<tr>
<td align="center">BCUTp-1l</td>
<td align="center">(3.01, 7.00)</td>
</tr>
<tr>
<td align="center">CrippenMR</td>
<td align="center">(56.15, 400.61)</td>
</tr>
<tr>
<td align="center">C3SP2</td>
<td align="center">(0.00, 9.30)</td>
</tr>
<tr>
<td align="center">ECCEN</td>
<td align="center">(196.00, 1294.89)</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
</sec>
<sec sec-type="results" id="s5">
<title>5 Results</title>
<p>This study proposes a machine learning-based optimization model for anti-breast cancer candidate drugs, which has achieved significant results in enhancing the biological activity of compounds and optimizing their ADMET (absorption, distribution, metabolism, excretion, toxicity) properties. After feature selection from 1,974 compounds, 20 molecular descriptors highly correlated with biological activity were retained. The QSAR (Quantitative Structure-Activity Relationship) model built upon these descriptors demonstrates high predictive accuracy. The results of the conducted experiments are presented below, highlighting the performance of the various models used in this study. A comparison of performance metrics for different regression and classification models is shown, with models being evaluated based on their ability to predict biological activity (pIC50 values) and ADMET properties. The metrics include R<sup>2</sup> for regression tasks, and F1 score and accuracy for classification tasks. As shown in <xref ref-type="table" rid="T10">Table 10</xref>, the stacking ensemble model performed the best in predicting biological activity, achieving an R<sup>2</sup> value of 0.743. For ADMET property prediction, models such as XGBoost and LightGBM achieved the highest F1 scores for specific properties, detailed further in <xref ref-type="table" rid="T10">Table 10</xref>.</p>
<table-wrap id="T10" position="float">
<label>TABLE 10</label>
<caption>
<p>Comparison of model performance.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">Model</th>
<th align="center">Task</th>
<th align="center">R<sup>2</sup>/F1 score</th>
<th align="center">Accuracy/AUC</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">LightGBM</td>
<td align="center">Biological Activity</td>
<td align="center">0.737</td>
<td align="center">&#x2014;</td>
</tr>
<tr>
<td align="center">RandomForest</td>
<td align="center">Biological Activity</td>
<td align="center">0.736</td>
<td align="center">&#x2014;</td>
</tr>
<tr>
<td align="center">XGBoost</td>
<td align="center">Biological Activity</td>
<td align="center">0.711</td>
<td align="center">&#x2014;</td>
</tr>
<tr>
<td align="center">Stacking Ensemble</td>
<td align="center">Biological Activity</td>
<td align="center">
<bold>0.743</bold>
</td>
<td align="center">&#x2014;</td>
</tr>
<tr>
<td align="center">LightGBM</td>
<td align="center">Caco-2 Prediction</td>
<td align="center">&#x2014;</td>
<td align="center">
<bold>0.8905</bold>
</td>
</tr>
<tr>
<td align="center">XGBoost</td>
<td align="center">CYP3A4 Prediction</td>
<td align="center">&#x2014;</td>
<td align="center">
<bold>0.9733</bold>
</td>
</tr>
<tr>
<td align="center">XGBoost</td>
<td align="center">hERG Prediction</td>
<td align="center">&#x2014;</td>
<td align="center">
<bold>0.9138</bold>
</td>
</tr>
<tr>
<td align="center">Naive Bayes</td>
<td align="center">HOB Prediction</td>
<td align="center">&#x2014;</td>
<td align="center">
<bold>0.6824</bold>
</td>
</tr>
<tr>
<td align="center">XGBoost</td>
<td align="center">MN Prediction</td>
<td align="center">&#x2014;</td>
<td align="center">
<bold>0.9695</bold>
</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The stacking ensemble model achieved an excellent R<sup>2</sup> value of 0.743 for predicting biological activity. In terms of ADMET property prediction, XGBoost performed best for predicting CYP3A4 and MN, while Naive Bayes demonstrated strong performance in predicting HOB. By applying the Particle Swarm Optimization (PSO) algorithm, effective multi-objective optimization was performed for both biological activity and ADMET properties. The optimized compounds met the pre-defined combination of ADMET properties and exhibited good biological activity. Ultimately, the 50 optimized test compounds achieved ideal predictive results for both biological activity and ADMET properties, validating the effectiveness and practicality of this model in the development of anti-breast cancer drugs.</p>
</sec>
<sec sec-type="discussion" id="s6">
<title>6 Discussion</title>
<p>This study proposes a machine learning-based optimization model for anti-breast cancer candidate drugs, which has made significant progress in enhancing the biological activity of candidate compounds and optimizing their ADMET properties. However, there are still several potential directions for future research and practical applications.</p>
<sec id="s6-1">
<title>6.1 Future research directions</title>
<p>With the continuous development of drug discovery and optimization, this study opens several potential avenues for future progress:</p>
<sec id="s6-1-1">
<title>6.1.1 Incorporating more data</title>
<p>While this study primarily relies on molecular descriptors and biological activity data, future research could consider incorporating more diverse datasets, such as gene expression profiles, protein-ligand interactions, and <italic>in vivo</italic> pharmacokinetic data. These additional data could improve the robustness of the model and enhance the generalizability of predictions.</p>
</sec>
<sec id="s6-1-2">
<title>6.1.2 Exploring other optimization algorithms</title>
<p>Although Particle Swarm Optimization (PSO) has shown effective results in multi-objective optimization, exploring other optimization algorithms such as Genetic Algorithms (GA), Differential Evolution (DE), or multi-objective versions of Reinforcement Learning could potentially extend the model&#x2019;s applicability to drug screening and optimization for other diseases.</p>
</sec>
<sec id="s6-1-3">
<title>6.1.3 Applying the model to other cancer types</title>
<p>While this study focuses on breast cancer, the machine learning-based optimization approach can be extended to other types of cancer. Future research can incorporate biomarkers and therapeutic targets specific to different diseases and apply the model to various cancer targets, such as ovarian cancer, lung cancer, or prostate cancer. This would broaden the scope and applicability of the model, making it a valuable tool in the global fight against cancer.</p>
</sec>
</sec>
<sec id="s6-2">
<title>6.2 Practical applications of the model</title>
<p>The model proposed in this study not only provides theoretical insights but also has great potential in the practical application of drug development and personalized medicine:</p>
<sec id="s6-2-1">
<title>6.2.1 Early drug discovery screening</title>
<p>The multi-objective optimization model can be applied in the early stages of drug discovery to screen large compound libraries. By predicting both biological activity and ADMET properties simultaneously, the model can help researchers identify promising lead compounds with favorable characteristics, reducing experimental screening time and costs. This can accelerate the identification of promising drug candidates, especially in cancer treatment.</p>
</sec>
<sec id="s6-2-2">
<title>6.2.2 Personalized cancer therapy</title>
<p>In the context of precision medicine, the model can be used to optimize drugs based on individual patients&#x2019; genomic profiles and tumor characteristics. By predicting how specific compounds interact with a patient&#x2019;s unique molecular features, this approach can contribute to the development of more effective and personalized treatment plans, ultimately improving patient outcomes and reducing side effects.</p>
</sec>
<sec id="s6-2-3">
<title>6.2.3 Optimizing existing drugs</title>
<p>The model can also be applied to optimize existing anti-cancer drugs that are already in clinical use. By fine-tuning their biological activity and ADMET properties, the model can suggest modifications or derivatives of these drugs to overcome existing limitations such as drug resistance, toxicity, or poor bioavailability. This can enhance the therapeutic effectiveness of existing drugs and provide new treatment options for patients.</p>
</sec>
<sec id="s6-2-4">
<title>6.2.4 Integration into drug discovery platforms</title>
<p>In industrial settings, the model can be integrated into drug discovery platforms as a valuable decision-support tool. Pharmaceutical companies can use the model to guide their drug development strategies, especially during the preclinical phase. The ability to predict the combined impact of biological activity and ADMET properties on the success of drug candidates will be a key asset in determining which compounds should proceed to further testing and clinical development.</p>
</sec>
</sec>
</sec>
<sec sec-type="conclusion" id="s7">
<title>7 Conclusion</title>
<p>This study proposes an optimization model for anti-breast cancer candidate drugs based on machine learning and particle swarm optimization, achieving significant results in enhancing the biological activity and ADMET properties of candidate compounds. Through grey relational analysis, Spearman correlation analysis, and SHAP value screening from the random forest model, 20 molecular descriptors most influential to biological activity were successfully selected. A multi-model fusion technique was applied to improve the accuracy of biological activity predictions. The use of efficient classification models in ADMET property prediction further ensures the superior pharmacokinetic performance of candidate drugs. The successful application of the particle swarm optimization algorithm in multi-objective optimization tasks demonstrates its potential in drug design.</p>
<p>The model proposed in this study provides a novel and efficient solution for the field of drug design and development, accelerating the development process of new anti-breast cancer drugs and offering theoretical foundations and technical support for future multi-objective drug optimization. Future research will focus on validation and optimization on large-scale datasets, integrating laboratory data to further improve the performance of machine learning models, thereby achieving a closed-loop development process from computational prediction to experimental validation.</p>
</sec>
</body>
<back>
<sec sec-type="data-availability" id="s8">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/<xref ref-type="sec" rid="s14">Supplementary Material</xref>, further inquiries can be directed to the corresponding author.</p>
</sec>
<sec sec-type="author-contributions" id="s9">
<title>Author contributions</title>
<p>ZD: Data curation, Methodology, Writing &#x2013; original draft, Writing &#x2013; review and editing. HC: Funding acquisition, Resources, Writing &#x2013; review and editing. YY: Data curation, Formal Analysis, Methodology, Visualization, Writing &#x2013; original draft. HH: Data curation, Resources, Writing &#x2013; review and editing.</p>
</sec>
<sec sec-type="funding-information" id="s10">
<title>Funding</title>
<p>The author(s) declare that no financial support was received for the research and/or publication of this article.</p>
</sec>
<sec sec-type="COI-statement" id="s11">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="ai-statement" id="s12">
<title>Generative AI statement</title>
<p>The author(s) declare that no Generative AI was used in the creation of this manuscript.</p>
</sec>
<sec sec-type="disclaimer" id="s13">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<sec id="s14">
<title>Supplementary material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fgene.2025.1523015/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fgene.2025.1523015/full&#x23;supplementary-material</ext-link>
</p>
<supplementary-material xlink:href="Presentation1.zip" id="SM1" mimetype="application/zip" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="DataSheet1.zip" id="SM2" mimetype="application/zip" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="DataSheet2.zip" id="SM3" mimetype="application/zip" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ahmad</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Khan</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Serdaro&#x11f;lu</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Physicochemical properties, drug likeness, ADMET, DFT studies, and <italic>in vitro</italic> antioxidant activity of oxindole derivatives</article-title>. <source>Comput. Biol. Chem.</source> <volume>104</volume>, <fpage>107861</fpage>. <pub-id pub-id-type="doi">10.1016/j.compbiolchem.2023.107861</pub-id>
</citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Atallah</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Wagener</surname>
<given-names>K. B.</given-names>
</name>
<name>
<surname>Schulz</surname>
<given-names>M. D.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>ADMET: the future revealed</article-title>. <source>Macromolecules</source> <volume>46</volume> (<issue>12</issue>), <fpage>4735</fpage>&#x2013;<lpage>4741</lpage>. <pub-id pub-id-type="doi">10.1021/ma400067b</pub-id>
</citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Belachew</surname>
<given-names>E. B.</given-names>
</name>
<name>
<surname>Sewasew</surname>
<given-names>D. T.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Molecular Mechanisms of Endocrine Resistance in Estrogen-Positive Breast Cancer</article-title>. <source>Front. Endocrinol.</source> <volume>12</volume>, <fpage>599586</fpage>. <pub-id pub-id-type="doi">10.3389/fendo.2021.599586</pub-id>
</citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Caron</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Nohria</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Cardiac toxicity from breast cancer treatment: can we avoid this?</article-title>. <source>Curr. Oncol. Rep.</source> <volume>20</volume>, <fpage>61</fpage>. <pub-id pub-id-type="doi">10.1007/s11912-018-0710-1</pub-id>
</citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>C. H.</given-names>
</name>
<name>
<surname>Tanaka</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Kotera</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Funatsu</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Comparison and improvement of the predictability and interpretability with ensemble learning models in QSPR applications</article-title>. <source>J. Cheminform</source> <volume>12</volume>, <fpage>19</fpage>. <pub-id pub-id-type="doi">10.1186/s13321-020-0417-9</pub-id>
</citation>
</ref>
<ref id="B6">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Guestrin</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Xgboost: a scalable tree boosting system</article-title>,&#x201d; in <source>Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining</source>, <fpage>785</fpage>&#x2013;<lpage>794</lpage>. <pub-id pub-id-type="doi">10.1145/2939672.2939785</pub-id>
</citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cherkasov</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Muratov</surname>
<given-names>E. N.</given-names>
</name>
<name>
<surname>Fourches</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Varnek</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Baskin</surname>
<given-names>I. I.</given-names>
</name>
<name>
<surname>Cronin</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2014</year>). <article-title>QSAR modeling: where have you been? Where are you going to?</article-title>. <source>J. Med. Chem.</source> <volume>57</volume> (<issue>12</issue>), <fpage>4977</fpage>&#x2013;<lpage>5010</lpage>. <pub-id pub-id-type="doi">10.1021/jm4004285</pub-id>
</citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Deb</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Pratap</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Agarwal</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Meyarivan</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2002</year>). <article-title>A fast and elitist multiobjective genetic algorithm: NSGA-II</article-title>. <source>IEEE Trans. Evol. Comput.</source> <volume>6</volume> (<issue>2</issue>), <fpage>182</fpage>&#x2013;<lpage>197</lpage>. <pub-id pub-id-type="doi">10.1109/4235.996017</pub-id>
</citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Er-rajy</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>El Fadili</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Hadni</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Mrabti</surname>
<given-names>N. N.</given-names>
</name>
<name>
<surname>Zarougui</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Elhallaoui</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>2D-QSAR modeling, drug-likeness studies, ADMET prediction, and molecular docking for anti-lung cancer activity of 3-substituted-5-(phenylamino) indolone derivatives</article-title>. <source>Struct. Chem.</source> <volume>33</volume>, <fpage>973</fpage>&#x2013;<lpage>986</lpage>. <pub-id pub-id-type="doi">10.1007/s11224-022-01913-3</pub-id>
</citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ferreira</surname>
<given-names>L. L.</given-names>
</name>
<name>
<surname>Andricopulo</surname>
<given-names>A. D.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>ADMET modeling approaches in drug discovery</article-title>. <source>Drug Discov. today</source> <volume>24</volume> (<issue>5</issue>), <fpage>1157</fpage>&#x2013;<lpage>1165</lpage>. <pub-id pub-id-type="doi">10.1016/j.drudis.2019.03.015</pub-id>
</citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Giaquinto</surname>
<given-names>A. N.</given-names>
</name>
<name>
<surname>Sung</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>K. D.</given-names>
</name>
<name>
<surname>Kramer</surname>
<given-names>J. L.</given-names>
</name>
<name>
<surname>Newman</surname>
<given-names>L. A.</given-names>
</name>
<name>
<surname>Minihan</surname>
<given-names>A.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <article-title>Breast cancer statistics, 2022</article-title>. <source>CA a cancer J. Clin.</source> <volume>72</volume> (<issue>6</issue>), <fpage>524</fpage>&#x2013;<lpage>541</lpage>. <pub-id pub-id-type="doi">10.3322/caac.21754</pub-id>
</citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hong</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Breast cancer: an up&#x2010;to&#x2010;date review and future perspectives</article-title>. <source>Cancer Commun.</source> <volume>42</volume> (<issue>10</issue>), <fpage>913</fpage>&#x2013;<lpage>936</lpage>. <pub-id pub-id-type="doi">10.1002/cac2.12358</pub-id>
</citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huang</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Fu</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Roohani</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Leskovec</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Therapeutics data commons: machine learning datasets and tasks for drug discovery and development</article-title>. <source>arXiv Prepr.</source> <comment>Available online at: <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2102.09548">https://arxiv.org/abs/2102.09548</ext-link>.</comment>
</citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jim&#xe9;nez-Luna</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Grisoni</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Schneider</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Drug discovery with explainable artificial intelligence</article-title>. <source>Nat. Mach. Intell.</source> <volume>2</volume> (<issue>10</issue>), <fpage>573</fpage>&#x2013;<lpage>584</lpage>. <pub-id pub-id-type="doi">10.1038/s42256-020-00236-4</pub-id>
</citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jim&#xe9;nez-Luna</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Grisoni</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Schneider</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Artificial intelligence in drug discovery: recent advances and future perspectives</article-title>. <source>Expert Opin. Drug Discov.</source> <volume>16</volume> (<issue>9</issue>), <fpage>949</fpage>&#x2013;<lpage>959</lpage>. <pub-id pub-id-type="doi">10.1080/17460441.2021.1909567</pub-id>
</citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Komura</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Watanabe</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Mizuguchi</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>The trends and future prospective of <italic>in silico</italic> models from the viewpoint of ADME evaluation in drug discovery</article-title>. <source>Pharmaceutics</source> <volume>15</volume> (<issue>11</issue>), <fpage>2619</fpage>. <pub-id pub-id-type="doi">10.3390/pharmaceutics15112619</pub-id>
</citation>
</ref>
<ref id="B17">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Larroquette</surname>
<given-names>C. A.</given-names>
</name>
<name>
<surname>Hortobagyi</surname>
<given-names>G. N.</given-names>
</name>
<name>
<surname>Buzdar</surname>
<given-names>A. U.</given-names>
</name>
<name>
<surname>Holmes</surname>
<given-names>F. A.</given-names>
</name>
</person-group> (<year>1986</year>). <article-title>Subclinical hepatic toxicity during combination chemotherapy for breast cancer</article-title>. <source>Jama</source> <volume>256</volume> (<issue>21</issue>), <fpage>2988</fpage>&#x2013;<lpage>2990</lpage>. <pub-id pub-id-type="doi">10.1001/jama.1986.03380210084030</pub-id>
</citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lei</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Song</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Hou</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>ADMET evaluation in drug discovery: 15. Accurate prediction of rat oral acute toxicity using relevance vector machine and consensus modeling</article-title>. <source>J. Cheminform</source> <volume>8</volume>, <fpage>6</fpage>. <pub-id pub-id-type="doi">10.1186/s13321-016-0117-7</pub-id>
</citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lin</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Zuo</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Data-driven prediction of building energy consumption using an adaptive multi-model fusion approach</article-title>. <source>Appl. Soft Comput.</source> <volume>129</volume>, <fpage>109616</fpage>. <pub-id pub-id-type="doi">10.1016/j.asoc.2022.109616</pub-id>
</citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Xiu</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Qiang</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Z.</given-names>
</name>
</person-group> (<year>2024</year>). <article-title>Latent chemical space searching for plug-in multi-objective molecule generation</article-title>. <comment>arXiv preprint arXiv:2404.06691</comment>. <pub-id pub-id-type="doi">10.48550/arXiv.2404.06691</pub-id>
</citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Ye</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>van Vlijmen</surname>
<given-names>H. W. T.</given-names>
</name>
<name>
<surname>Emmerich</surname>
<given-names>M. T. M.</given-names>
</name>
<name>
<surname>IJzerman</surname>
<given-names>A. P.</given-names>
</name>
<name>
<surname>van Westen</surname>
<given-names>G. J. P.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>DrugEx v2: <italic>de novo</italic> design of drug molecules by Pareto-based multi-objective reinforcement learning in polypharmacology</article-title>. <source>J. Cheminform</source> <volume>13</volume>, <fpage>85</fpage>. <pub-id pub-id-type="doi">10.1186/s13321-021-00561-9</pub-id>
</citation>
</ref>
<ref id="B22">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lumachi</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Luisetto</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Mm Basso</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Basso</surname>
<given-names>U.</given-names>
</name>
<name>
<surname>Brunello</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Camozzi</surname>
<given-names>V.</given-names>
</name>
</person-group> (<year>2011</year>). <article-title>Endocrine therapy of breast cancer</article-title>. <source>Curr. Med. Chem.</source> <volume>18</volume> (<issue>4</issue>), <fpage>513</fpage>&#x2013;<lpage>522</lpage>. <pub-id pub-id-type="doi">10.2174/092986711794480177</pub-id>
</citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Luukkonen</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>van den Maagdenberg</surname>
<given-names>H. W.</given-names>
</name>
<name>
<surname>Emmerich</surname>
<given-names>M. T.</given-names>
</name>
<name>
<surname>van Westen</surname>
<given-names>G. J.</given-names>
</name>
</person-group> (<year>2023a</year>). <article-title>Artificial intelligence in multi-objective drug design</article-title>. <source>Curr. Opin. Struct. Biol.</source> <volume>79</volume>, <fpage>102537</fpage>. <pub-id pub-id-type="doi">10.1016/j.sbi.2023.102537</pub-id>
</citation>
</ref>
<ref id="B25">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Mak</surname>
<given-names>K. K.</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>Y. H.</given-names>
</name>
<name>
<surname>Pichika</surname>
<given-names>M. R.</given-names>
</name>
</person-group> (<year>2023</year>). &#x201c;<article-title>Artificial intelligence in drug discovery and development</article-title>,&#x201d; in <source>Drug discovery and evaluation: safety and pharmacokinetic assays</source>. Editors <person-group person-group-type="editor">
<name>
<surname>Hock</surname>
<given-names>F. J.</given-names>
</name>
<name>
<surname>Pugsley</surname>
<given-names>M. K.</given-names>
</name>
</person-group> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>). <pub-id pub-id-type="doi">10.1007/978-3-030-73317-9_92-1</pub-id>
</citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marra</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Trapani</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Viale</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Criscitiello</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Curigliano</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Practical classification of triple-negative breast cancer: intratumoral heterogeneity, mechanisms of drug resistance, and novel therapies</article-title>. <source>npj Breast Cancer</source> <volume>6</volume>, <fpage>54</fpage>. <pub-id pub-id-type="doi">10.1038/s41523-020-00197-2</pub-id>
</citation>
</ref>
<ref id="B27">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Merk</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Friedrich</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Grisoni</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Schneider</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>
<italic>De novo</italic> design of bioactive small molecules by artificial intelligence</article-title>. <source>Mol. Inf.</source> <volume>37</volume> (<issue>1-2</issue>), <fpage>1700153</fpage>. <pub-id pub-id-type="doi">10.1002/minf.201700153</pub-id>
</citation>
</ref>
<ref id="B28">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Poli</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Kennedy</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Blackwell</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2007</year>). <article-title>Particle swarm optimization</article-title>. <source>Swarm Intell.</source> <volume>1</volume>, <fpage>33</fpage>&#x2013;<lpage>57</lpage>. <pub-id pub-id-type="doi">10.1007/s11721-007-0002-0</pub-id>
</citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rodrigues</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Schneider</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Accessing new chemical entities through generative artificial intelligence</article-title>. <source>Nat. Rev. Drug Discov.</source> <volume>21</volume> (<issue>3</issue>), <fpage>175</fpage>&#x2013;<lpage>176</lpage>. <pub-id pub-id-type="doi">10.1038/d41573-022-00012-5</pub-id>
</citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schneider</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Walters</surname>
<given-names>W. P.</given-names>
</name>
<name>
<surname>Plowright</surname>
<given-names>A. T.</given-names>
</name>
<name>
<surname>Sieroka</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Listgarten</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Goodnow</surname>
<given-names>R. A.</given-names>
<suffix>Jr</suffix>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Rethinking drug design in the artificial intelligence era</article-title>. <source>Nat. Rev. Drug Discov.</source> <volume>19</volume> (<issue>5</issue>), <fpage>353</fpage>&#x2013;<lpage>364</lpage>. <pub-id pub-id-type="doi">10.1038/s41573-019-0050-3</pub-id>
</citation>
</ref>
<ref id="B31">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shou</surname>
<given-names>W. Z.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Current status and future directions of high-throughput ADME screening in drug discovery</article-title>. <source>J. Pharm. Analysis</source> <volume>10</volume> (<issue>3</issue>), <fpage>201</fpage>&#x2013;<lpage>208</lpage>. <pub-id pub-id-type="doi">10.1016/j.jpha.2020.05.004</pub-id>
</citation>
</ref>
<ref id="B32">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Stokes</surname>
<given-names>J. M.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Swanson</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Jin</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Cubillos-Ruiz</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Donghia</surname>
<given-names>N. M.</given-names>
</name>
<etal/>
</person-group> (<year>2020a</year>). <article-title>A deep learning approach to antibiotic discovery</article-title>. <source>Cell</source> <volume>180</volume> (<issue>4</issue>), <fpage>688</fpage>&#x2013;<lpage>702.e13</lpage>. <pub-id pub-id-type="doi">10.1016/j.cell.2020.01.021</pub-id>
</citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sung</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Ferlay</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Siegel</surname>
<given-names>R. L.</given-names>
</name>
<name>
<surname>Laversanne</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Soerjomataram</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Jemal</surname>
<given-names>A.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Global Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countriesancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries</article-title>. <source>CA A Cancer J. Clin.</source> <volume>71</volume> (<issue>3</issue>), <fpage>209</fpage>&#x2013;<lpage>249</lpage>. <pub-id pub-id-type="doi">10.3322/caac.21660</pub-id>
</citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vamathevan</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Clark</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Czodrowski</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Dunham</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Ferran</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>G.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Applications of machine learning in drug discovery and development</article-title>. <source>Nat. Rev. Drug Discov.</source> <volume>18</volume> (<issue>6</issue>), <fpage>463</fpage>&#x2013;<lpage>477</lpage>. <pub-id pub-id-type="doi">10.1038/s41573-019-0024-5</pub-id>
</citation>
</ref>
<ref id="B36">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Waks</surname>
<given-names>A. G.</given-names>
</name>
<name>
<surname>Winer</surname>
<given-names>E. P.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Breast Breast Cancer Treatment: A Reviewancer treatment: a review</article-title>. <source>JAMA</source> <volume>321</volume> (<issue>3</issue>), <fpage>288</fpage>&#x2013;<lpage>300</lpage>. <pub-id pub-id-type="doi">10.1001/jama.2018.19323</pub-id>
</citation>
</ref>
<ref id="B37">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Tan</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Particle swarm optimization algorithm: an overview</article-title>. <source>Soft Comput.</source> <volume>22</volume> (<issue>2</issue>), <fpage>387</fpage>&#x2013;<lpage>408</lpage>. <pub-id pub-id-type="doi">10.1007/s00500-016-2474-6</pub-id>
</citation>
</ref>
<ref id="B39">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Xu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Dai</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Pei</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Lai</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Deep learning for drug-induced liver injury</article-title>. <source>J. Chem. Inf. Model.</source> <volume>55</volume> (<issue>10</issue>), <fpage>2085</fpage>&#x2013;<lpage>2093</lpage>. <pub-id pub-id-type="doi">10.1021/acs.jcim.5b00238</pub-id>
</citation>
</ref>
<ref id="B40">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhavoronkov</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Ivanenkov</surname>
<given-names>Y. A.</given-names>
</name>
<name>
<surname>Aliper</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Veselov</surname>
<given-names>M. S.</given-names>
</name>
<name>
<surname>Aladinskiy</surname>
<given-names>V. A.</given-names>
</name>
<name>
<surname>Aladinskaya</surname>
<given-names>A. V.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Deep learning enables rapid identification of potent DDR1 kinase inhibitors</article-title>. <source>Nat. Biotechnol.</source> <volume>37</volume> (<issue>9</issue>), <fpage>1038</fpage>&#x2013;<lpage>1040</lpage>. <pub-id pub-id-type="doi">10.1038/s41587-019-0224-x</pub-id>
</citation>
</ref>
<ref id="B41">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zitnik</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Nguyen</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Leskovec</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Goldenberg</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Hoffman</surname>
<given-names>M. M.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Machine learning for integrating data in biology and medicine: Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunitiesrinciples, practice, and opportunities</article-title>. <source>Inf. Fusion</source> <volume>50</volume>, <fpage>71</fpage>&#x2013;<lpage>91</lpage>. <pub-id pub-id-type="doi">10.1016/j.inffus.2018.09.012</pub-id>
</citation>
</ref>
</ref-list>
</back>
</article>