<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="review-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Bioinform.</journal-id>
<journal-title>Frontiers in Bioinformatics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Bioinform.</abbrev-journal-title>
<issn pub-type="epub">2673-7647</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">1520382</article-id>
<article-id pub-id-type="doi">10.3389/fbinf.2025.1520382</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Bioinformatics</subject>
<subj-group>
<subject>Mini Review</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Machine learning approaches for predicting protein-ligand binding sites from sequence data</article-title>
<alt-title alt-title-type="left-running-head">Vural and Jololian</alt-title>
<alt-title alt-title-type="right-running-head">
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/fbinf.2025.1520382">10.3389/fbinf.2025.1520382</ext-link>
</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Vural</surname>
<given-names>Orhun</given-names>
</name>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/2883514/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/conceptualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/data-curation/"/>
<role content-type="https://credit.niso.org/contributor-roles/formal-analysis/"/>
<role content-type="https://credit.niso.org/contributor-roles/investigation/"/>
<role content-type="https://credit.niso.org/contributor-roles/resources/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
<role content-type="https://credit.niso.org/contributor-roles/Writing - review &#x26; editing/"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Jololian</surname>
<given-names>Leon</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/2092142/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/project-administration/"/>
<role content-type="https://credit.niso.org/contributor-roles/supervision/"/>
<role content-type="https://credit.niso.org/contributor-roles/validation/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
<role content-type="https://credit.niso.org/contributor-roles/Writing - review &#x26; editing/"/>
</contrib>
</contrib-group>
<aff>
<institution>Department of Electrical and Computer Engineering</institution>, <institution>The University of Alabama at Birmingham</institution>, <addr-line>Birmingham</addr-line>, <addr-line>AL</addr-line>, <country>United States</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/335670/overview">Wen Wei</ext-link>, Arizona State University, United States</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1017704/overview">Kumar Yugandhar</ext-link>, Cornell University, United States</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/2328294/overview">Minh Nguyen</ext-link>, Bioinformatics Institute (A&#x2217;STAR), Singapore</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Orhun Vural, <email>orhun@uab.edu</email>
</corresp>
</author-notes>
<pub-date pub-type="epub">
<day>03</day>
<month>02</month>
<year>2025</year>
</pub-date>
<pub-date pub-type="collection">
<year>2025</year>
</pub-date>
<volume>5</volume>
<elocation-id>1520382</elocation-id>
<history>
<date date-type="received">
<day>31</day>
<month>10</month>
<year>2024</year>
</date>
<date date-type="accepted">
<day>10</day>
<month>01</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2025 Vural and Jololian.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Vural and Jololian</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>Proteins, composed of amino acids, are crucial for a wide range of biological functions. Proteins have various interaction sites, one of which is the protein-ligand binding site, essential for molecular interactions and biochemical reactions. These sites enable proteins to bind with other molecules, facilitating key biological functions. Accurate prediction of these binding sites is pivotal in computational drug discovery, helping to identify therapeutic targets and facilitate treatment development. Machine learning has made significant contributions to this field by improving the prediction of protein-ligand interactions. This paper reviews studies that use machine learning to predict protein-ligand binding sites from sequence data, focusing on recent advancements. The review examines various embedding methods and machine learning architectures, addressing current challenges and the ongoing debates in the field. Additionally, research gaps in the existing literature are highlighted, and potential future directions for advancing the field are discussed. This study provides a thorough overview of sequence-based approaches for predicting protein-ligand binding sites, offering insights into the current state of research and future possibilities.</p>
</abstract>
<kwd-group>
<kwd>protein-ligand binding sites</kwd>
<kwd>computational drug discovery</kwd>
<kwd>sequence-based methods</kwd>
<kwd>deep learning</kwd>
<kwd>binding prediction</kwd>
</kwd-group>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Protein Bioinformatics</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>Protein-ligand binding sites are specific regions on proteins where various ligands&#x2014;including small organic molecules, peptides, nucleotides, and proteins&#x2014;can attach or bind (<xref ref-type="bibr" rid="B93">Zhao et al., 2020</xref>). Although experimental laboratory methods identify these regions with the highest accuracy, they are generally costly and time-consuming (<xref ref-type="bibr" rid="B63">Sadybekov and Katritch, 2023</xref>). Therefore, computational approaches to drug discovery have become increasingly important. These computational methods offer distinct advantages by reducing costs and speeding up identifying and optimizing potential drug candidates (<xref ref-type="bibr" rid="B26">Gupta et al., 2021</xref>). Predicting protein-ligand binding sites is a critical component of computational drug discovery, essential for pinpointing viable drug targets and advancing the development of new therapeutics (<xref ref-type="bibr" rid="B69">Stank et al., 2016</xref>). Recent advancements in machine learning have significantly improved this field by introducing sophisticated computational techniques to analyze the complex interactions between proteins and ligands (<xref ref-type="bibr" rid="B83">Xia et al., 2024</xref>). While traditional methods based on geometry, energy, or templates have been successful, deep learning has recently achieved much better results (<xref ref-type="bibr" rid="B23">Gagliardi et al., 2022</xref>). Deep learning models can learn complex patterns directly from raw data and generalize better across diverse datasets. Protein ligand binding sites prediction in computational models is divided into two main categories, based on input type: structure-based and sequence-based (<xref ref-type="bibr" rid="B24">Gamouh et al., 2023</xref>; <xref ref-type="bibr" rid="B33">Hosseini et al., 2024</xref>).</p>
<p>Structure-based methods in computational drug discovery (SBDD) utilize detailed knowledge of the spatial information of proteins and integrate chemical properties using methods such as voxel-grid techniques (<xref ref-type="bibr" rid="B72">Sunseri and Koes, 2020</xref>). <xref ref-type="fig" rid="F1">Figure 1</xref> presents a 3D view of the 6Y3C protein and its associated ligands (<xref ref-type="bibr" rid="B54">Miciaccia et al., 2021</xref>). The number of protein-ligand binding sites on a protein can vary widely, depending on the specific protein and its function. In <xref ref-type="fig" rid="F1">Figure 1A</xref>, the regions highlighted in yellow, blue, and purple represent protein-ligand binding sites. <xref ref-type="fig" rid="F1">Figure 1B</xref> focuses on one of the binding sites shown in <xref ref-type="fig" rid="F1">Figure 1A</xref>, offering a closer view of how the ligand interacts with the binding pocket. <xref ref-type="fig" rid="F1">Figure 1C</xref> highlights the specific interactions between the ligand and the surrounding amino acid residues. In recent years, deep learning techniques used to identify these regions have often approached the problem as either image segmentation or object detection within structure-based frameworks. For instance, studies like RefinePocket (<xref ref-type="bibr" rid="B51">Liu et al., 2023</xref>), Kalasanty (<xref ref-type="bibr" rid="B70">Stepniewska-Dziubinska et al., 2020</xref>), PointSite (<xref ref-type="bibr" rid="B85">Yan et al., 2022</xref>), and DeepPocket (<xref ref-type="bibr" rid="B3">Aggarwal et al., 2021</xref>) use image segmentation techniques for binding site prediction, while RecurPocket (<xref ref-type="bibr" rid="B49">Li et al., 2022</xref>) and FRSite (<xref ref-type="bibr" rid="B35">Jiang et al., 2019</xref>) employ object detection techniques. Structure-based approaches depend on high-resolution 3D protein structures from X-ray crystallography or NMR spectroscopy (<xref ref-type="bibr" rid="B53">Maveyraud and Mourey, 2020</xref>). These methods face challenges such as reliance on accurate structures, static views of dynamic proteins, and high time and cost demands. AlphaFold (<xref ref-type="bibr" rid="B1">Abramson et al., 2024</xref>) has revolutionized the determination of 3D protein structures, significantly reducing reliance on experimental methods. However, drug discovery still primarily depends on 1D amino acid sequence data for critical tasks. Advancing approaches like AlphaFold requires a deeper understanding of the 1D sequence data used as input. This topic is further explored in the Discussion and Analysis section.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>
<bold>(A)</bold> Three binding site regions of 6Y3C protein in blue, yellow, and purple. <bold>(B)</bold> Close-up of a binding site with its ligand. <bold>(C)</bold> Ligand (orange) binding to a site. Generated with PyMol (<xref ref-type="bibr" rid="B66">Schrodinger, 2015</xref>).</p>
</caption>
<graphic xlink:href="fbinf-05-1520382-g001.tif"/>
</fig>
<p>Sequence-based methods utilize one-dimensional (1D) amino acid sequence data as input. The 1D sequence is a direct representation of the protein&#x2019;s genetic blueprint and is experimentally measurable with high reliability (<xref ref-type="bibr" rid="B4">Alfaro et al., 2021</xref>). Sequence-based methods are less computationally intensive, do not require high-resolution structural data, and can be applied to a wider variety of proteins, including those for which structural information is unavailable. There are many more known protein sequences than experimentally determined structures (<xref ref-type="bibr" rid="B12">Chelur and Priyakumar, 2022</xref>). The general process of sequence-based binding site identification begins with a given protein sequence as input, leading to the final prediction and evaluation. The first step is feature extraction, which is challenging due to the complexity and diversity of proteins. This involves converting linear sequence data into numerical vectors that accurately represent the protein&#x2019;s functional and structural characteristics. Effective feature extraction is critical because the quality of the numerical representation directly impacts the performance of subsequent machine learning models. These techniques include binary representation, which encodes the presence or absence of specific amino acids; physicochemical representation, which considers the chemical and physical properties of amino acids; evolution-based representation, which leverages evolutionary information from multiple sequence alignments; and structure or machine learning-based representations, which use structural data or advanced algorithms to infer relevant features (<xref ref-type="bibr" rid="B36">Jing et al., 2019</xref>). Once the protein sequence is converted into a numerical format, it is ready for training with machine learning models using datasets with known binding sites. These datasets provide the necessary ground truth for model training and validation. The datasets most frequently employed in the literature are sc-PDB (<xref ref-type="bibr" rid="B17">Desaphy et al., 2015</xref>), COACH420 (<xref ref-type="bibr" rid="B44">Kriv&#xe1;k and Hoksza, 2018</xref>), HOLO4k (<xref ref-type="bibr" rid="B44">Kriv&#xe1;k and Hoksza, 2018</xref>), PDBBind (<xref ref-type="bibr" rid="B52">Liu et al., 2015</xref>), CSAR NRC-HiQ (<xref ref-type="bibr" rid="B19">Dunbar et al., 2013</xref>), UniProt (<xref ref-type="bibr" rid="B75">UniProt Consortium, 2015</xref>), Pfam (<xref ref-type="bibr" rid="B22">Finn et al., 2014</xref>), BioLip (<xref ref-type="bibr" rid="B88">Zhang et al., 2024</xref>), and PiSite (<xref ref-type="bibr" rid="B30">Higurashi et al., 2009</xref>). Each dataset has its unique characteristics and specific applications, contributing to the robustness and generalizability of the trained models. For instance, COACH420, derived from the COACH (<xref ref-type="bibr" rid="B87">Yang et al., 2013b</xref>) test set, is a widely recognized benchmark dataset that includes 420 protein-ligand complexes. Each complex consists of a single-chain protein intricately bound to a small molecule ligand. HOLO4K: A larger and more challenging dataset with 4,009 protein-ligand complexes. It includes multi-chain structures, offering a wider range of protein binding scenarios.</p>
<p>In this paper, we focus on sequence-based protein-ligand binding site prediction studies that employ machine learning techniques. As seen in <xref ref-type="table" rid="T1">Table 1</xref>, we have summarized these studies by focusing on their feature extraction techniques, and machine learning models. The Analysis and Discussion section provides a detailed evaluation of the machine learning models listed in <xref ref-type="table" rid="T1">Table 1</xref>, highlighting the strengths, limitations, and research gaps of sequence-based approaches. Additionally, potential future directions are outlined in the Future Directions section.</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Sequence-based machine learning models for predicting protein-ligand binding sites.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">Model</th>
<th align="center">Feature extraction methods</th>
<th align="center">Machine learning model<xref ref-type="table-fn" rid="Tfn1">
<sup>a</sup>
</xref>
</th>
<th align="center">Dataset</th>
<th align="center">Evaluation metric</th>
<th align="center">Accuracy<xref ref-type="table-fn" rid="Tfn2">
<sup>b</sup>
</xref>
</th>
<th align="center">Year</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">SCRIBER (<xref ref-type="bibr" rid="B90">Zhang and Kurgan, 2019</xref>)</td>
<td align="center">ASAquick, HHblits, ANCHOR, PSIPRED, AAindex</td>
<td align="center">Logistic Regression</td>
<td align="center">BioLip, UniProt, Pfam</td>
<td align="center">MCC</td>
<td align="center">0.230</td>
<td align="center">2019</td>
</tr>
<tr>
<td align="center">DeepCSeqSite (<xref ref-type="bibr" rid="B16">Cui et al., 2019</xref>)</td>
<td align="center">PSSpred, Anglor, Jensen-Shannon divergence (JSD), Relative entropy</td>
<td align="center">Deep Convolutional Neural Network</td>
<td align="center">BioLip</td>
<td align="center">MCC</td>
<td align="center">0.496</td>
<td align="center">2019</td>
</tr>
<tr>
<td align="center">DELIA (<xref ref-type="bibr" rid="B82">Xia et al., 2020</xref>)</td>
<td align="center">PSI-BLAST, HHblits, SCRATCH-1D, S-SITE</td>
<td align="center">ResNet &#x2b; BiLSTM</td>
<td align="center">BioLip, ATPBind</td>
<td align="center">MCC</td>
<td align="center">0.469</td>
<td align="center">2020</td>
</tr>
<tr>
<td align="center">HoTs (<xref ref-type="bibr" rid="B48">Lee and Nam, 2022</xref>)</td>
<td align="center">1D-CNN, hierarchical recurrent neural network</td>
<td align="center">CNN &#x2b; Transformers</td>
<td align="center">scPDB, PDBbind, COACH420, HOLO4k</td>
<td align="center">Top-n success rate (%)</td>
<td align="center">66.3 &#xb1; 0.9</td>
<td align="center">2022</td>
</tr>
<tr>
<td align="center">Birds (<xref ref-type="bibr" rid="B12">Chelur and Priyakumar, 2022</xref>)</td>
<td align="center">DeepMSA, PSIPRED, SOLVPRED</td>
<td align="center">ResNet</td>
<td align="center">scPDB</td>
<td align="center">MCC</td>
<td align="center">0.568</td>
<td align="center">2022</td>
</tr>
<tr>
<td align="center">T5 GAT Ensemble (<xref ref-type="bibr" rid="B24">Gamouh et al., 2023</xref>)</td>
<td align="center">ProtT5</td>
<td align="center">Graph Neural Network &#x2b; Attention</td>
<td align="center">BioLip, RCSB</td>
<td align="center">MCC</td>
<td align="center">0.592</td>
<td align="center">2023</td>
</tr>
<tr>
<td align="center">LaMPSite (<xref ref-type="bibr" rid="B92">Zhang and Xie, 2023</xref>)</td>
<td align="center">ESM-2, RDKit</td>
<td align="center">Pooling &#x2b; Clustering</td>
<td align="center">scPDB, COACH420</td>
<td align="center">Top-n success rate</td>
<td align="center">66.02</td>
<td align="center">2023</td>
</tr>
<tr>
<td align="center">Pseq2Sites (<xref ref-type="bibr" rid="B68">Seo et al., 2024</xref>)</td>
<td align="center">ProtTrans</td>
<td align="center">CNN &#x2b; Attention</td>
<td align="center">COACH420, HOLO4k, CSAR</td>
<td align="center">Top-n success rate</td>
<td align="center">96.8</td>
<td align="center">2024</td>
</tr>
<tr>
<td align="center">Seq-InSite (<xref ref-type="bibr" rid="B33">Hosseini et al., 2024</xref>)</td>
<td align="center">ProtT5, MSA</td>
<td align="center">MLP &#x2b; LSTM</td>
<td align="center">PiSite</td>
<td align="center">MCC</td>
<td align="center">0.462</td>
<td align="center">2024</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="Tfn1">
<label>
<sup>a</sup>
</label>
<p>The Machine Learning Model column catalogs foundational models that constitute the core framework of the research presented, although the architecture of these studies may incorporate additional models.</p>
</fn>
<fn id="Tfn2">
<label>
<sup>b</sup>
</label>
<p>The reported results are sourced from their own publications. Please note that direct comparisons between these values may not be valid due to differences in methodologies, preprocessing steps, and testing datasets. If separate results were provided for each ligand type, their average was calculated.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="s2">
<title>2 Sequence-based computational methods</title>
<p>Proteins are composed of a set of amino acids, each represented by a unique symbol (e.g., &#x201c;A&#x201d; for Alanine, &#x201c;G&#x201d; for Glycine). Similar to human language, which consists of sequences of words that convey meaning, protein sequences are structured in specific patterns that hold significant biological information. To analyze these sequences, feature engineering techniques are employed to derive meaningful attributes from the data. Machine learning models are then trained on these features to predict protein-ligand interactions or other relevant biological properties.</p>
<sec id="s2-1">
<title>2.1 Feature engineering</title>
<p>Sequence-based methods leverage sequence data to capture biochemical and biophysical properties without direct 3D structural information. Multiple review papers provide a detailed overview of embedding approaches for protein sequence-based structures (<xref ref-type="bibr" rid="B36">Jing et al., 2019</xref>; <xref ref-type="bibr" rid="B34">Ibtehaz and Kihara, 2023</xref>; <xref ref-type="bibr" rid="B78">Villegas-Morcillo et al., 2022</xref>; <xref ref-type="bibr" rid="B91">Zhang and Liu, 2019</xref>; <xref ref-type="bibr" rid="B32">Hoksza and Gamouh, 2022</xref>; <xref ref-type="bibr" rid="B74">Tran et al., 2023</xref>). Embedding methods have been categorized in various ways across different studies. <xref ref-type="bibr" rid="B36">Jing et al. (2019)</xref> classified these methods into five distinct categories based on their information sources and methodologies: binary encoding, physicochemical properties encoding, evolution-based encoding, structure-based encoding, and machine-learning encoding. We categorize embedding methods into two groups: traditional embedding methods and machine learning-based embedding methods.</p>
<p>Transformer-based models (<xref ref-type="bibr" rid="B76">Vaswani et al., 2017</xref>) have gained popularity for applying linguistic analogies to protein sequences. For example, ProtTrans (<xref ref-type="bibr" rid="B20">Elnaggar et al., 2021</xref>), ESM-1b (<xref ref-type="bibr" rid="B62">Rives et al., 2021</xref>), and ESM-MSA (<xref ref-type="bibr" rid="B60">Rao et al., 2021</xref>) are transformer-based protein language models used for feature extraction. ProtTrans includes models like ProtBert and ProtT5, leveraging the transformer architecture to process large-scale protein datasets and produce sequence embeddings. ProtBert has 420 million parameters and was trained on 2 billion protein sequences. ESM-1b employs a transformer-based architecture to generate embeddings for protein sequences and has been trained on 250 million protein sequences. ESM-MSA is another protein language model that uses multiple sequence alignments (MSAs) from UniRef50 (<xref ref-type="bibr" rid="B73">Suzek et al., 2007</xref>) as input, interleaving row and column attention. It is trained on 26 million MSAs. Other popular advanced embedding methods for protein sequences are ProtVec (<xref ref-type="bibr" rid="B6">Asgari and Mofrad, 2015</xref>), SeqVec (<xref ref-type="bibr" rid="B29">Heinzinger et al., 2019</xref>), and UniRep (<xref ref-type="bibr" rid="B5">Alley et al., 2019</xref>). ProtVec uses the skip-gram-based Word2Vec model (<xref ref-type="bibr" rid="B55">Mikolov et al., 2013</xref>) to treat amino acid k-mers like words. It is trained on a corpus of 546,790 sequences obtained from Swiss-Prot (<xref ref-type="bibr" rid="B7">Boutet et al., 2007</xref>). SeqVec uses the Embeddings from Language Models (ELMo) (<xref ref-type="bibr" rid="B64">Sarzynska-Wawer et al., 2021</xref>) approach, which generates context-aware embeddings by considering the surrounding amino acids in a sequence. UniRep, based on a multiplicative Long Short-Term Memory (mLSTM) model (<xref ref-type="bibr" rid="B43">Krause et al., 2016</xref>), captures essential biochemical properties by predicting the next amino acid in a sequence and is trained on approximately 24 million protein sequences from UniRef50.</p>
<p>In addition to these protein language models, various other methods can be employed to create feature maps from protein sequences. These techniques include 1D-CNN, calculating relative solvent accessibility (RSA), position-specific score matrix (PSSM), secondary structure (SS), token embeddings, segment embeddings, one-hot encoding, conservation scores (CS), amino acid composition (AAC), physiochemical properties, and more (<xref ref-type="bibr" rid="B46">Laine et al., 2021</xref>; <xref ref-type="bibr" rid="B25">Guo et al., 2021</xref>; <xref ref-type="bibr" rid="B58">Raj and Chandra, 2024</xref>). Many specialized tools and software have been developed to calculate these features, enabling the generation of comprehensive feature maps from protein sequences.</p>
</sec>
<sec id="s2-2">
<title>2.2 Methodological approaches</title>
<p>
<xref ref-type="table" rid="T1">Table 1</xref> lists studies that focus on sequence-based protein binding site prediction. In this section, we provide an overview of each model included in <xref ref-type="table" rid="T1">Table 1</xref>, highlighting the feature extraction techniques employed, the specific machine learning algorithms applied.</p>
<p>SCRIBER (<xref ref-type="bibr" rid="B90">Zhang and Kurgan, 2019</xref>) converts input protein sequences into profiles representing structural, evolutionary, and physicochemical properties. These profiles include relative solvent accessibility (RSA) values predicted by ASAquick (<xref ref-type="bibr" rid="B21">Faraggi et al., 2014</xref>), which calculates solvent accessibility scores using only sequence-based features without relying on 3D protein structures and predicts the ASA for each residue based on encoded sequence features. Other features include evolutionary conservation values from HHblits (<xref ref-type="bibr" rid="B61">Remmert et al., 2012</xref>), relative amino acid propensity (RAAP) scores, protein-binding disorder from ANCHOR (<xref ref-type="bibr" rid="B18">Doszt&#xe1;nyi et al., 2009</xref>), secondary structure from PSIPRED (<xref ref-type="bibr" rid="B9">Buchan et al., 2013</xref>), a sequence-based tool. Physicochemical properties (charge, hydrophobicity, and polarity) from the AAindex resource (<xref ref-type="bibr" rid="B39">Kawashima et al., 2007</xref>). SCRIBER employs a logistic regression model (<xref ref-type="bibr" rid="B15">Cramer, 2002</xref>) to predict protein-binding residues. SCRIBER processes a protein in approximately 45 s, significantly faster than PSI-BLAST, which takes 194 s, and PSI-BLAST combined with SANN (<xref ref-type="bibr" rid="B38">Joo et al., 2012</xref>), which requires 246 s.</p>
<p>DeepCSeqSite (<xref ref-type="bibr" rid="B16">Cui et al., 2019</xref>) leverages a Deep Convolutional Neural Network along with position-specific score matrix (PSSM), relative solvent accessibility (RSA), and secondary structure (SS) anticipated through PSSpred (<xref ref-type="bibr" rid="B84">Yan et al., 2013</xref>). RSA, a numeric value (often between 0 and 1), indicates how much of a residue&#x2019;s surface is solvent-exposed versus buried. PSSpred uses neural networks to predict secondary structure elements, such as alpha-helices, beta-sheets, and coils, directly from sequence data. These elements, combined with positional embeddings, are used to build a detailed feature map from the protein sequence. To further enhance prediction accuracy, additional features such as conservation scores&#x2014;calculated via Jensen-Shannon divergence (JSD) and relative entropy&#x2014;residue type and dihedral angles, with predictions made by ANGLOR (<xref ref-type="bibr" rid="B81">Wu and Zhang, 2008</xref>), are incorporated.</p>
<p>DELIA (<xref ref-type="bibr" rid="B82">Xia et al., 2020</xref>) predicts protein&#x2013;ligand binding residues using a hybrid model of convolutional neural networks (CNNs) (<xref ref-type="bibr" rid="B47">LeCun and Bengio, 1995</xref>) and bidirectional long short-term memory networks (BiLSTMs) (<xref ref-type="bibr" rid="B67">Schuster and Paliwal, 1997</xref>). It processes both 1D sequence feature vectors and 2D distance matrices to analyze amino acid sequences alongside protein spatial structures. DELIA utilizes sequence-based insights by integrating PSSMs from PSI-BLAST for evolutionary insights, fast and accurate evolutionary data from HHblits, secondary structure, and solvent accessibility predictions from SCRATCH-1D (<xref ref-type="bibr" rid="B13">Cheng et al., 2005</xref>), as well as binding propensities from S-SITE (<xref ref-type="bibr" rid="B86">Yang et al., 2013a</xref>). The SCRATCH software generates predictions for secondary structure and solvent accessibility using the amino acid sequence provided.</p>
<p>HoTS (<xref ref-type="bibr" rid="B48">Lee and Nam, 2022</xref>), employs a hierarchical recurrent neural network and 1D-CNN for protein sequence embedding to predict binding regions and drug&#x2013;target interactions. HoTS leverages both CNN and transformer-based models, utilizing CNN layers to identify sequential motifs and transformers to model interdependencies. It also employs fully connected layers for accurately predicting binding regions.</p>
<p>Birds (<xref ref-type="bibr" rid="B12">Chelur and Priyakumar, 2022</xref>), utilizes a ResNet (<xref ref-type="bibr" rid="B27">He et al., 2016</xref>) architecture to predict a protein&#x2019;s binding site based on the protein&#x2019;s sequence information. This study employs a variety of techniques to extract information from protein sequences and construct a feature map, including token, positional, and segment embeddings, as well as multiple sequence alignments (MSAs) from DeepMSA (<xref ref-type="bibr" rid="B89">Zhang et al., 2020</xref>). From these MSAs, the position-specific score matrix (PSSM), Secondary Structure (SS), and Information Content (IC) were derived. Additionally, the Relative Solvent Accessibility (RSA) of each amino acid was determined by SOLVPRED from MetaPSICOV 2.0 (<xref ref-type="bibr" rid="B37">Jones et al., 2015</xref>).</p>
<p>T5 GAT Ensemble (<xref ref-type="bibr" rid="B24">Gamouh et al., 2023</xref>) predicts protein ligand binding sites with a hybrid approach combining sequence and structure data. This approach incorporates protein language models (pLMs) for sequence analysis and Graph Neural Networks (GNNs) (<xref ref-type="bibr" rid="B65">Scarselli et al., 2008</xref>) for structural insights, utilizing ProtT5-XL-UniRef50 (<xref ref-type="bibr" rid="B20">Elnaggar et al., 2021</xref>) to generate amino acid sequence embeddings. These embeddings serve as node features in the protein graph. The construction of the protein graph leverages the Python Deep Graph Library (DGL) (<xref ref-type="bibr" rid="B80">Wang et al., 2019</xref>), facilitating a sophisticated approach to modeling protein structures. In this graph, nodes are designated for individual residues, and edges define the spatial proximity between these residues. To determine the most suitable architecture, they tested two well-known GNN designs: the Graph Convolutional Network (GCN) (<xref ref-type="bibr" rid="B41">Kipf and Welling, 2016</xref>) and the Graph Attention Network (GAT) (<xref ref-type="bibr" rid="B77">Veli&#x2c7;ckovi&#xb4;c et al., 2017</xref>).</p>
<p>LaMPSite (<xref ref-type="bibr" rid="B92">Zhang and Xie, 2023</xref>) predicts ligand binding sites using protein sequences and ligand molecular graphs. This approach incorporates residue-level embeddings from the ESM-2 protein language model (<xref ref-type="bibr" rid="B50">Lin et al., 2023</xref>) for proteins and atom-level embeddings from a graph neural network for ligands. Additionally, LaMPSite employs a pooling module to aggregate interaction embeddings, simplifying them to generate a residue-specific score. Then it clusters residues using the protein contact map, ranking these clusters to pinpoint binding sites. Current clustering and filtering processes typically yield one binding site per prediction, which may limit the identification of multiple or cryptic binding sites.</p>
<p>Pseq2Sites (<xref ref-type="bibr" rid="B68">Seo et al., 2024</xref>) uses ProtTrans, a transformer-based model, to extract amino acid-level embeddings for protein sequence analysis. Subsequently, 1D-CNNs were utilized to extract local features from the resulting embedding sequence, followed by the application of methods employing position-based attention mechanisms to capture long-distance contextual information.</p>
<p>Seq-InSite (<xref ref-type="bibr" rid="B33">Hosseini et al., 2024</xref>) utilizes ProtT5 and MSA-transformer embeddings to predict protein interaction sites from sequence data. Its architecture employs ensemble learning techniques, integrating a Multi-Layer Perceptron (MLP) and a Long Short-Term Memory (LSTM) (<xref ref-type="bibr" rid="B31">Hochreiter and Schmidhuber, 1997</xref>) network. While Seq-InSite predicts a broad range of protein interaction sites, including protein-ligand binding sites.</p>
<p>Overall, accurate prediction of protein-ligand binding sites is a crucial step in the drug discovery pipeline. Beyond theoretical predictions, these methods provide actionable insights that support drug target identification, lead optimization, and ligand design. Once protein binding sites are identified, these predictions lead to a variety of applications, including virtual screening (<xref ref-type="bibr" rid="B40">Kimber et al., 2021</xref>), studying off-target effects (<xref ref-type="bibr" rid="B59">Rao et al., 2023</xref>), predicting druggability scores (<xref ref-type="bibr" rid="B57">Raies et al., 2022</xref>), protein function prediction (<xref ref-type="bibr" rid="B45">Kulmanov and Hoehndorf, 2020</xref>), assessing mutation impacts (<xref ref-type="bibr" rid="B71">Sun et al., 2021</xref>), and pose prediction (<xref ref-type="bibr" rid="B79">Wang et al., 2022</xref>), among others.</p>
</sec>
</sec>
<sec id="s3">
<title>3 Analysis and discussion</title>
<p>This section discusses four main topics: advancements in extracting features from protein sequences, the limitations of sequence-based methods with an analysis of the approaches listed in <xref ref-type="table" rid="T1">Table 1</xref>, the advantages of hybrid methods that combine sequence- and structure-based techniques, and a review of the datasets used for testing, as well as tools like AlphaFold that are employed for protein folding predictions. Each topic highlights critical aspects of the methodologies and their contributions to improving protein-ligand binding site predictions.</p>
<p>The models in <xref ref-type="table" rid="T1">Table 1</xref> demonstrate a broad range of feature extraction techniques, spanning traditional evolution- and structure-based encodings to advanced protein language models (pLMs). 1D-CNNs are effective at extracting local motifs from protein sequences but may lose global context when motifs are spread across non-consecutive regions (<xref ref-type="bibr" rid="B48">Lee and Nam, 2022</xref>). PSSMs, a cornerstone of traditional methods, remain critical for capturing evolutionary information, with their removal causing significant performance drops (<xref ref-type="bibr" rid="B12">Chelur and Priyakumar, 2022</xref>). Relative solvent accessibility (RSA) and secondary structure elements add structural insights, but their impact on performance is less pronounced than that of embeddings or PSSMs (<xref ref-type="bibr" rid="B12">Chelur and Priyakumar, 2022</xref>). Secondary structure features and predicted dihedral angles provide structural context, with dihedral angles offering more fine-grained information; however, these features may also introduce noise (<xref ref-type="bibr" rid="B16">Cui et al., 2019</xref>). Protein language models, such as ProtT5-XL, offer significant advantages in terms of processing speed, generating embeddings for a human protein in as little as 0.12 s (<xref ref-type="bibr" rid="B20">Elnaggar et al., 2021</xref>). This efficiency is essential when analyzing extensive datasets with millions of sequences, allowing for high accuracy without reliance on traditional, computationally intensive evolutionary steps. ProtT5-XL embeddings, for example, deliver high accuracy and rich information, outperforming alternatives such as MSA-transformer embeddings in predictive tasks (<xref ref-type="bibr" rid="B33">Hosseini et al., 2024</xref>). Protein language models (pLMs) tend to be less effective for proteins that are rare or underrepresented in training datasets. However, pLMs perform best with well-represented proteins, and challenges remain in predicting binding sites for rare or novel proteins due to limited sequence data representation. As shown in <xref ref-type="table" rid="T1">Table 1</xref>, studies T5 GAT Ensemble, LaMPSite, Pseq2Sites, and Seq-InSite, which utilize pLMs extraction methods, demonstrate promising results compared to other studies listed in <xref ref-type="table" rid="T1">Table 1</xref> that use traditional feature extraction methods.</p>
<p>One key advantage of sequence-based methods is their computational efficiency. For instance, on the well-known COACH420 dataset, sequence-based protein-ligand binding site prediction methods achieved significantly faster execution times: Pseq2Sites completed predictions in 1.07 s, Birds in 3.97 s, DeepCSeqSite in 11.13 s, and HoTs in 51.84 s. In contrast, structure-based methods were considerably slower, with DeepPocket taking 894.28 s, DeepSurf 2436.76 s, and P2Rank 914.61 s (<xref ref-type="bibr" rid="B68">Seo et al., 2024</xref>). Although sequence-based methods are computationally efficient, they lack the spatial context needed to identify complex binding interactions, such as those involving residues across multiple protein chains. By analyzing each chain individually and then combining the results, traditional sequence-based methods often miss critical relationships, limiting their accuracy in predicting binding sites. The studies in <xref ref-type="table" rid="T1">Table 1</xref> highlight distinct characteristics of various models. For instance, SCRIBER incorporates over 1,000 input features and relies on feature elimination techniques to manage complexity, though it remains susceptible to overfitting. SCRIBER reported a Matthews correlation coefficient (MCC) (<xref ref-type="bibr" rid="B14">Chicco and Jurman, 2020</xref>) of 0.23. DELIA, on the other hand, is tailored for specific ligand types, which enhances predictive accuracy for those interactions but limits its applicability to general protein-ligand binding site prediction. DELIA achieved an average MCC of 0.469, which was derived from results across five different ligand types. Attention-based models like HoTS and Pseq2Sites excel at capturing both local interactions and long-range dependencies within sequences, making them effective for understanding complex sequence patterns. In the Pseq2Sites study, Pseq2Sites demonstrated a 96.8% success rate on the COACH test dataset, calculated as the number of correctly identified pockets divided by the total number of pockets. Additionally, the study reported success rates for other models, with HoTS achieving 14.3% and Birds reaching 70%, highlighting the comparative performance within the same evaluation framework. Seq-InSite achieved an MCC value of 0.462 on the Dset_448 dataset, which focuses on ligands that are not proteins. However, sequence-based models still struggle to fully capture inter-chain interactions, which are critical for predicting functional binding sites in multimeric proteins. Sequence-based approaches are generally less effective in identifying allosteric binding sites, which are often located far from the active site and can be missed without considering the protein&#x2019;s full 3D structure (<xref ref-type="bibr" rid="B83">Xia et al., 2024</xref>).</p>
<p>Hybrid approaches, which integrate both sequence-based and structural features, have emerged as powerful strategies to enhance the accuracy of protein function prediction tasks. The T5-GAT Ensemble, a hybrid model, combines sequence and structural features of proteins. While the sequence-based MLP model achieves an MCC of 0.54, the hybrid model improves this to 0.59 by incorporating structural features. Similarly, DELIA, tested on five ligand types, demonstrated that the hybrid architecture outperformed sequence-based models in MCC scores for all ligand types. Another method, LaMPSite, predicts ligand binding sites by utilizing both protein sequences and ligand molecular graphs. The ablation study for LaMPSite indicates a decrease in accuracy when the interaction module, which combines the benefits of both methods, is omitted. For this study, the reported success rate in terms of DCA (Distance Cutoff Accuracy) is 66.02%.</p>
<p>The choice of datasets in protein-ligand binding site prediction plays a crucial role in developing and evaluating computational models. To ensure fair testing, addressing data leakage is essential, especially the similarity between training and test datasets. For instance, LaMPSite excludes scPDB structures with more than 50% sequence identity or 0.9 ligand similarity and removes proteins from COACH420. Pseq2Sites takes additional steps by using unseen test datasets and filtering proteins with &#x2264;40% structural similarity for unbiased evaluation. Studies like HoTS further promote fair analysis by reporting results at various similarity thresholds.</p>
<p>Protein folding software such as AlphaFold can facilitate hybrid approaches, certain limitations persist. AlphaFold2 (AF2) relies on patterns extracted from known protein folds rather than understanding the physical and chemical basis of proteins (<xref ref-type="bibr" rid="B2">Agarwal and McShan, 2024</xref>). The experimentally determined 3D structural dataset is limited to fewer than 300,000 structures, compared to the billions of protein sequences available in public repositories. AlphaFold 3 (AF3) builds on the evoformer architecture from AF2, incorporating a diffusion network that refines a cloud of atoms iteratively to generate highly accurate protein structures. AF3 can predict heme-binding sites; however, its reliance on structurally similar proteins in its training data limits its effectiveness for less-represented or novel protein sequences (<xref ref-type="bibr" rid="B42">Kondo and Takano, 2024</xref>). AF3 struggles to accurately predict ligand-binding poses, particularly for complex ligands such as peptides, ions, and non-standard molecules (<xref ref-type="bibr" rid="B28">He et al., 2024</xref>). Additionally, the lack of support for user-defined ligands and a broader range of ligand types further restricts AF3&#x2019;s applicability in practical drug discovery efforts. Single changes in the sequence (e.g., point mutations) can significantly alter a protein&#x2019;s function or cause misfolding. AF2 is not trained to predict the effects of mutations on protein structure or stability (<xref ref-type="bibr" rid="B2">Agarwal and McShan, 2024</xref>; <xref ref-type="bibr" rid="B56">Pak et al., 2023</xref>). AF3 has limitations in stereochemistry, hallucinations, dynamic behavior, and accuracy for specific targets (<xref ref-type="bibr" rid="B1">Abramson et al., 2024</xref>). The Predicted Local Distance Difference Test (pLDDT) serves as a confidence metric in AF2 and AF3 for evaluating the reliability of protein structure predictions. However, high pLDDT values or low Predicted Aligned Error (PAE) scores do not necessarily ensure alignment with experimental structures (<xref ref-type="bibr" rid="B11">Carugo, 2023</xref>; <xref ref-type="bibr" rid="B10">Buel and Walters, 2022</xref>).</p>
<p>Overall, the paper highlights the strengths and limitations of both 3D and 1D approaches, concluding in the discussion section that hybrid methodologies represent a promising direction for future research.</p>
</sec>
<sec id="s4">
<title>4 Future directions</title>
<p>Future advancements in protein binding site prediction are likely to focus on integrating sequence-based and structure-based data to improve model accuracy, particularly for complex binding sites that depend on 3D spatial context. Hybrid models that combine these two types of data show promise in addressing limitations of sequence-only methods, such as identifying distant allosteric sites or inter-chain interactions. Another promising direction involves the development of transformer-based models specifically tailored for protein-ligand interactions, utilizing advanced embeddings to capture intricate sequence patterns and dependencies. Recently, GPT-based (<xref ref-type="bibr" rid="B8">Brown et al., 2020</xref>) studies have emerged in protein engineering, harnessing protein sequence data and the capabilities of large language models (LLMs) rooted in natural language processing (NLP). These advancements emphasize the need for a deeper understanding of protein sequence data, improving its representation, and designing deep learning architectures to align with these enhancements. Reviews like ours, which focus on sequence-based protein structures, are expected to make valuable contributions to the development of these tools. To further advance the field, it will be critical to enhance the adaptability of protein language models (pLMs) for underrepresented or rare proteins. This could be achieved by expanding training datasets or developing adaptive embedding methods. Additionally, collaboration across computational, experimental, and industrial fields will be essential for validating and refining these models. Such efforts aim to improve generalizability and optimize predictive tools for specific therapeutic targets, ultimately accelerating advancements in computational drug discovery.</p>
</sec>
<sec sec-type="conclusion" id="s5">
<title>5 Conclusion</title>
<p>The prediction of protein-ligand binding sites is crucial for advancing drug discovery and development, as it enables the identification of potential drug targets and the design of more effective therapeutics. Accurate prediction methods can significantly streamline the drug discovery process, reducing the time and cost associated with experimental validation. Our study reviews various sequence-based approaches for predicting protein-ligand binding sites using machine learning techniques in computational drug discovery. Our examination explores the models, focusing on their embedding methods and deep learning architectures, and discusses the challenges and future directions associated with sequence-based methods. Our study aims to serve as a comprehensive guide for sequence-based prediction of protein-ligand binding sites, providing a thorough understanding of the existing literature within a single paper.</p>
</sec>
</body>
<back>
<sec sec-type="author-contributions" id="s6">
<title>Author contributions</title>
<p>OV: Conceptualization, Data curation, Formal Analysis, Investigation, Resources, Writing&#x2013;original draft, Writing&#x2013;review and editing. LJ: Project administration, Supervision, Validation, Writing&#x2013;original draft, Writing&#x2013;review and editing.</p>
</sec>
<sec sec-type="funding-information" id="s7">
<title>Funding</title>
<p>The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.</p>
</sec>
<ack>
<p>The authors would like to sincerely thank Dr. Lurong Pan for her valuable guidance and support throughout this study. We also extend our gratitude to Recep Tayyip Erdogan University.</p>
</ack>
<sec sec-type="COI-statement" id="s8">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="ai-statement" id="s9">
<title>Generative AI statement</title>
<p>The authors declare that no Generative AI was used in the creation of this manuscript.</p>
</sec>
<sec sec-type="disclaimer" id="s10">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Abramson</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Adler</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Dunger</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Evans</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Green</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Pritzel</surname>
<given-names>A.</given-names>
</name>
<etal/>
</person-group> (<year>2024</year>). <article-title>Accurate structure prediction of biomolecular interactions with AlphaFold 3</article-title>. <source>Nature</source> <volume>630</volume>, <fpage>493</fpage>&#x2013;<lpage>500</lpage>. <pub-id pub-id-type="doi">10.1038/s41586-024-07487-w</pub-id>
</citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Agarwal</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>McShan</surname>
<given-names>A. C.</given-names>
</name>
</person-group> (<year>2024</year>). <article-title>The power and pitfalls of AlphaFold2 for structure prediction beyond rigid globular proteins</article-title>. <source>Nat. Chem. Biol.</source> <volume>20</volume> (<issue>8</issue>), <fpage>950</fpage>&#x2013;<lpage>959</lpage>. <pub-id pub-id-type="doi">10.1038/s41589-024-01638-w</pub-id>
</citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Aggarwal</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Gupta</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Chelur</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Jawahar</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Priyakumar</surname>
<given-names>U. D.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>DeepPocket: ligand binding site detection and segmentation using 3D convolutional neural networks</article-title>. <source>J. Chem. Inf. Model.</source> <volume>62</volume> (<issue>21</issue>), <fpage>5069</fpage>&#x2013;<lpage>5079</lpage>. <pub-id pub-id-type="doi">10.1021/acs.jcim.1c00799</pub-id>
</citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Alfaro</surname>
<given-names>J. A.</given-names>
</name>
<name>
<surname>Bohl&#xe4;nder</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Dai</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Filius</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Howard</surname>
<given-names>C. J.</given-names>
</name>
<name>
<surname>Van Kooten</surname>
<given-names>X. F.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>The emerging landscape of single-molecule protein sequencing technologies</article-title>. <source>Nat. methods</source> <volume>18</volume> (<issue>6</issue>), <fpage>604</fpage>&#x2013;<lpage>617</lpage>. <pub-id pub-id-type="doi">10.1038/s41592-021-01143-1</pub-id>
</citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Alley</surname>
<given-names>E. C.</given-names>
</name>
<name>
<surname>Khimulya</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Biswas</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>AlQuraishi</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Church</surname>
<given-names>G. M.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Unified rational protein engineering with sequence-based deep representation learning</article-title>. <source>Nat. methods</source> <volume>16</volume> (<issue>12</issue>), <fpage>1315</fpage>&#x2013;<lpage>1322</lpage>. <pub-id pub-id-type="doi">10.1038/s41592-019-0598-1</pub-id>
</citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Asgari</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Mofrad</surname>
<given-names>M. R.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Continuous distributed representation of biological sequences for deep proteomics and genomics</article-title>. <source>PloS one</source> <volume>10</volume> (<issue>11</issue>), <fpage>e0141287</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0141287</pub-id>
</citation>
</ref>
<ref id="B7">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Boutet</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Lieberherr</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Tognolli</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Schneider</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Bairoch</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2007</year>). &#x201c;<article-title>UniProtKB/Swiss-Prot: the manually annotated section of the UniProt KnowledgeBase</article-title>,&#x201d; in <source>Plant bioinformatics: methods and protocols</source> (<publisher-name>Springer</publisher-name>), <fpage>89</fpage>&#x2013;<lpage>112</lpage>.</citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brown</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Mann</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Ryder</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Subbiah</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Kaplan</surname>
<given-names>J. D.</given-names>
</name>
<name>
<surname>Dhariwal</surname>
<given-names>P.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Language models are few-shot learners</article-title>. <source>Adv. neural Inf. Process. Syst.</source> <volume>33</volume>, <fpage>1877</fpage>&#x2013;<lpage>1901</lpage>. <pub-id pub-id-type="doi">10.48550/arXiv.2005.14165</pub-id>
</citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Buchan</surname>
<given-names>D. W.</given-names>
</name>
<name>
<surname>Minneci</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Nugent</surname>
<given-names>T. C.</given-names>
</name>
<name>
<surname>Bryson</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>D. T.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>Scalable web services for the PSIPRED protein analysis workbench</article-title>. <source>Nucleic acids Res.</source> <volume>41</volume> (<issue>W1</issue>), <fpage>W349</fpage>&#x2013;<lpage>W357</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkt381</pub-id>
</citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Buel</surname>
<given-names>G. R.</given-names>
</name>
<name>
<surname>Walters</surname>
<given-names>K. J.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Can AlphaFold2 predict the impact of missense mutations on structure?</article-title> <source>Nat. Struct. and Mol. Biol.</source> <volume>29</volume> (<issue>1</issue>), <fpage>1</fpage>&#x2013;<lpage>2</lpage>. <pub-id pub-id-type="doi">10.1038/s41594-021-00714-2</pub-id>
</citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Carugo</surname>
<given-names>O.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>pLDDT values in AlphaFold2 protein models are unrelated to globular protein local flexibility</article-title>. <source>Crystals</source> <volume>13</volume> (<issue>11</issue>), <fpage>1560</fpage>. <pub-id pub-id-type="doi">10.3390/cryst13111560</pub-id>
</citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chelur</surname>
<given-names>V. R.</given-names>
</name>
<name>
<surname>Priyakumar</surname>
<given-names>U. D.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Birds-binding residue detection from protein sequences using deep resnets</article-title>. <source>J. Chem. Inf. Model.</source> <volume>62</volume> (<issue>8</issue>), <fpage>1809</fpage>&#x2013;<lpage>1818</lpage>. <pub-id pub-id-type="doi">10.1021/acs.jcim.1c00972</pub-id>
</citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cheng</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Randall</surname>
<given-names>A. Z.</given-names>
</name>
<name>
<surname>Sweredoski</surname>
<given-names>M. J.</given-names>
</name>
<name>
<surname>Baldi</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2005</year>). <article-title>SCRATCH: a protein structure and structural feature prediction server</article-title>. <source>Nucleic acids Res.</source> <volume>33</volume> (<issue>Suppl. l_2</issue>), <fpage>W72</fpage>&#x2013;<lpage>W76</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gki396</pub-id>
</citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chicco</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Jurman</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation</article-title>. <source>BMC genomics</source> <volume>21</volume>, <fpage>6</fpage>&#x2013;<lpage>13</lpage>. <pub-id pub-id-type="doi">10.1186/s12864-019-6413-7</pub-id>
</citation>
</ref>
<ref id="B15">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Cramer</surname>
<given-names>J. S.</given-names>
</name>
</person-group> (<year>2002</year>). <source>The origins of logistic regression, tinbergen Institute working paper, no. 2002-119/4</source>. <pub-id pub-id-type="doi">10.2139/ssrn.360300</pub-id>
</citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cui</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Dong</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Hong</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Predicting protein-ligand binding residues with deep convolutional neural networks</article-title>. <source>BMC Bioinforma.</source> <volume>20</volume>, <fpage>93</fpage>&#x2013;<lpage>12</lpage>. <pub-id pub-id-type="doi">10.1186/s12859-019-2672-1</pub-id>
</citation>
</ref>
<ref id="B17">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Desaphy</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Bret</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Rognan</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Kellenberger</surname>
<given-names>E.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>sc-PDB: a 3D-database of ligandable binding sites&#x2014;10 years on</article-title>. <source>Nucleic acids Res.</source> <volume>43</volume> (<issue>D1</issue>), <fpage>D399</fpage>&#x2013;<lpage>D404</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gku928</pub-id>
</citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Doszt&#xe1;nyi</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>M&#xe9;sz&#xe1;ros</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Simon</surname>
<given-names>I.</given-names>
</name>
</person-group> (<year>2009</year>). <article-title>ANCHOR: web server for predicting protein binding regions in disordered proteins</article-title>. <source>Bioinformatics</source> <volume>25</volume> (<issue>20</issue>), <fpage>2745</fpage>&#x2013;<lpage>2746</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btp518</pub-id>
</citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dunbar</surname>
<given-names>J. B.</given-names>
<suffix>Jr</suffix>
</name>
<name>
<surname>Smith</surname>
<given-names>R. D.</given-names>
</name>
<name>
<surname>Damm-Ganamet</surname>
<given-names>K. L.</given-names>
</name>
<name>
<surname>Ahmed</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Esposito</surname>
<given-names>E. X.</given-names>
</name>
<name>
<surname>Delproposto</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2013</year>). <article-title>CSAR data set release 2012: ligands, affinities, complexes, and docking decoys</article-title>. <source>J. Chem. Inf. Model.</source> <volume>53</volume> (<issue>8</issue>), <fpage>1842</fpage>&#x2013;<lpage>1852</lpage>. <pub-id pub-id-type="doi">10.1021/ci4000486</pub-id>
</citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Elnaggar</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Heinzinger</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Dallago</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Rehawi</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>L.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Prottrans: toward understanding the language of life through self-supervised learning</article-title>. <source>IEEE Trans. pattern analysis Mach. Intell.</source> <volume>44</volume> (<issue>10</issue>), <fpage>7112</fpage>&#x2013;<lpage>7127</lpage>. <pub-id pub-id-type="doi">10.1109/tpami.2021.3095381</pub-id>
</citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Faraggi</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Kloczkowski</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>Accurate single&#x2010;sequence prediction of solvent accessible surface area using local and global features</article-title>. <source>Proteins Struct. Funct. Bioinforma.</source> <volume>82</volume> (<issue>11</issue>), <fpage>3170</fpage>&#x2013;<lpage>3176</lpage>. <pub-id pub-id-type="doi">10.1002/prot.24682</pub-id>
</citation>
</ref>
<ref id="B22">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Finn</surname>
<given-names>R. D.</given-names>
</name>
<name>
<surname>Bateman</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Clements</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Coggill</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Eberhardt</surname>
<given-names>R. Y.</given-names>
</name>
<name>
<surname>Eddy</surname>
<given-names>S. R.</given-names>
</name>
<etal/>
</person-group> (<year>2014</year>). <article-title>Pfam: the protein families database</article-title>. <source>Nucleic acids Res.</source> <volume>42</volume> (<issue>D1</issue>), <fpage>D222</fpage>&#x2013;<lpage>D230</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkt1223</pub-id>
</citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gagliardi</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Raffo</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Fugacci</surname>
<given-names>U.</given-names>
</name>
<name>
<surname>Biasotti</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Rocchia</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>H.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <article-title>SHREC 2022: protein&#x2013;ligand binding site recognition</article-title>. <source>Comput. and Graph.</source> <volume>107</volume>, <fpage>20</fpage>&#x2013;<lpage>31</lpage>. <pub-id pub-id-type="doi">10.1016/j.cag.2022.07.005</pub-id>
</citation>
</ref>
<ref id="B24">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Gamouh</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Hoksza</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Novotny</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2023</year>). <source>Hybrid protein-ligand binding residue prediction with protein language models: does the structure matter?</source> <comment>bioRxiv. 2023.08. 11.553028</comment>.</citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Guo</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Comprehensive study on enhancing low-quality position-specific scoring matrix with deep learning for accurate protein structure property prediction: using bagging multiple sequence alignment learning</article-title>. <source>J. Comput. Biol.</source> <volume>28</volume> (<issue>4</issue>), <fpage>346</fpage>&#x2013;<lpage>361</lpage>. <pub-id pub-id-type="doi">10.1089/cmb.2020.0416</pub-id>
</citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gupta</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Srivastava</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Sahu</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Tiwari</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Ambasta</surname>
<given-names>R. K.</given-names>
</name>
<name>
<surname>Kumar</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Artificial intelligence to deep learning: machine intelligence approach for drug discovery</article-title>. <source>Mol. Divers.</source> <volume>25</volume>, <fpage>1315</fpage>&#x2013;<lpage>1360</lpage>. <pub-id pub-id-type="doi">10.1007/s11030-021-10217-3</pub-id>
</citation>
</ref>
<ref id="B27">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>He</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Deep residual learning for image recognition</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE conference on computer vision and pattern recognition</conf-name>, <conf-loc>Las Vegas, NV, USA</conf-loc>, <conf-date>June 27 2016&#x2013;June 30 2016</conf-date>, <fpage>770</fpage>&#x2013;<lpage>778</lpage>.</citation>
</ref>
<ref id="B28">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>He</surname>
<given-names>Xh.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>J. R.</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>S. Y.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>H. E.</given-names>
</name>
</person-group> (<year>2024</year>). <article-title>AlphaFold3 versus experimental structures: assessment of the accuracy in ligand-bound G protein-coupled receptors</article-title>. <source>Acta Pharmacol. Sin</source>. <pub-id pub-id-type="doi">10.1038/s41401-024-01429-y</pub-id>
</citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Heinzinger</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Elnaggar</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Dallago</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Nechaev</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Matthes</surname>
<given-names>F.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Modeling aspects of the language of life through transfer-learning protein sequences</article-title>. <source>BMC Bioinforma.</source> <volume>20</volume>, <fpage>723</fpage>. <pub-id pub-id-type="doi">10.1186/s12859-019-3220-8</pub-id>
</citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Higurashi</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Ishida</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Kinoshita</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2009</year>). <article-title>PiSite: a database of protein interaction sites using multiple binding states in the PDB</article-title>. <source>Nucleic acids Res.</source> <volume>37</volume> (<issue>Suppl. l_1</issue>), <fpage>D360</fpage>&#x2013;<lpage>D364</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkn659</pub-id>
</citation>
</ref>
<ref id="B31">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hochreiter</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Schmidhuber</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>1997</year>). <article-title>Long short-term memory</article-title>. <source>Neural Comput. MIT-Press</source> <volume>9</volume>, <fpage>1735</fpage>&#x2013;<lpage>1780</lpage>. <pub-id pub-id-type="doi">10.1162/neco.1997.9.8.1735</pub-id>
</citation>
</ref>
<ref id="B32">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Hoksza</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Gamouh</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2022</year>). &#x201c;<article-title>Exploration of protein sequence embeddings for protein-ligand binding site detection</article-title>,&#x201d; in <conf-name>2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)</conf-name>, <conf-loc>Las Vegas, NV, USA</conf-loc>, <conf-date>06-08 December 2022</conf-date> (<publisher-name>IEEE</publisher-name>), <fpage>3356</fpage>&#x2013;<lpage>3361</lpage>.</citation>
</ref>
<ref id="B33">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hosseini</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Golding</surname>
<given-names>G. B.</given-names>
</name>
<name>
<surname>Ilie</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2024</year>). <article-title>Seq-InSite: sequence supersedes structure for protein interaction site prediction</article-title>. <source>Bioinformatics</source> <volume>40</volume> (<issue>1</issue>), <fpage>btad738</fpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btad738</pub-id>
</citation>
</ref>
<ref id="B34">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ibtehaz</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Kihara</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2023</year>). &#x201c;<article-title>Application of sequence embedding in protein sequence-based predictions</article-title>,&#x201d; in <source>Machine learning in bioinformatics of protein sequences: algorithms, databases and resources for modern protein bioinformatics</source> (<publisher-name>World Scientific</publisher-name>), <fpage>31</fpage>&#x2013;<lpage>55</lpage>.</citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jiang</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Z.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Frsite: protein drug binding site prediction based on faster r&#x2013;cnn</article-title>. <source>J. Mol. Graph. Model.</source> <volume>93</volume>, <fpage>107454</fpage>. <pub-id pub-id-type="doi">10.1016/j.jmgm.2019.107454</pub-id>
</citation>
</ref>
<ref id="B36">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jing</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Dong</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Hong</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Amino acid encoding methods for protein sequences: a comprehensive review and assessment</article-title>. <source>IEEE/ACM Trans. Comput. Biol. Bioinforma.</source> <volume>17</volume> (<issue>6</issue>), <fpage>1918</fpage>&#x2013;<lpage>1931</lpage>. <pub-id pub-id-type="doi">10.1109/tcbb.2019.2911677</pub-id>
</citation>
</ref>
<ref id="B37">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jones</surname>
<given-names>D. T.</given-names>
</name>
<name>
<surname>Singh</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Kosciolek</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Tetchner</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins</article-title>. <source>Bioinformatics</source> <volume>31</volume> (<issue>7</issue>), <fpage>999</fpage>&#x2013;<lpage>1006</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btu791</pub-id>
</citation>
</ref>
<ref id="B38">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Joo</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>S. J.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>Sann: solvent accessibility prediction of proteins by nearest neighbor method</article-title>. <source>Proteins Struct. Funct. Bioinforma.</source> <volume>80</volume> (<issue>7</issue>), <fpage>1791</fpage>&#x2013;<lpage>1797</lpage>. <pub-id pub-id-type="doi">10.1002/prot.24074</pub-id>
</citation>
</ref>
<ref id="B39">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kawashima</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Pokarowski</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Pokarowska</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Kolinski</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Katayama</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Kanehisa</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2007</year>). <article-title>AAindex: amino acid index database, progress report 2008</article-title>. <source>Nucleic Acids Res.</source> <volume>36</volume> (<issue>Suppl. l_1</issue>), <fpage>D202</fpage>&#x2013;<lpage>D205</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkm998</pub-id>
</citation>
</ref>
<ref id="B40">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kimber</surname>
<given-names>T. B.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Volkamer</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Deep learning in virtual screening: recent applications and developments</article-title>. <source>Int. J. Mol. Sci.</source> <volume>22</volume> (<issue>9</issue>), <fpage>4435</fpage>. <pub-id pub-id-type="doi">10.3390/ijms22094435</pub-id>
</citation>
</ref>
<ref id="B41">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Kipf</surname>
<given-names>T. N.</given-names>
</name>
<name>
<surname>Welling</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2016</year>). <source>Semi-supervised classification with graph convolutional networks</source>. <comment>arXiv preprint arXiv:1609.02907</comment>.</citation>
</ref>
<ref id="B42">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kondo</surname>
<given-names>H. X.</given-names>
</name>
<name>
<surname>Takano</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2024</year>). <article-title>Structure comparison of heme-binding sites in heme protein predicted by AlphaFold3 and AlphaFold2</article-title>. <source>Chem. Lett.</source> <volume>53</volume> (<issue>8</issue>), <fpage>upae148</fpage>. <pub-id pub-id-type="doi">10.1093/chemle/upae148</pub-id>
</citation>
</ref>
<ref id="B43">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Krause</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Murray</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Renals</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2016</year>). <source>Multiplicative LSTM for sequence modelling</source>. <comment>arXiv preprint arXiv:1609.07959</comment>.</citation>
</ref>
<ref id="B44">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kriv&#xe1;k</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Hoksza</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure</article-title>. <source>J. cheminformatics</source> <volume>10</volume>, <fpage>39</fpage>&#x2013;<lpage>12</lpage>. <pub-id pub-id-type="doi">10.1186/s13321-018-0285-8</pub-id>
</citation>
</ref>
<ref id="B45">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kulmanov</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Hoehndorf</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>DeepGOPlus: improved protein function prediction from sequence</article-title>. <source>Bioinformatics</source> <volume>36</volume> (<issue>2</issue>), <fpage>422</fpage>&#x2013;<lpage>429</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btz595</pub-id>
</citation>
</ref>
<ref id="B46">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Laine</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Eismann</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Elofsson</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Grudinin</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Protein sequence&#x2010;to&#x2010;structure learning: is this the end (&#x2010;to&#x2010;end revolution)?</article-title> <source>Proteins Struct. Funct. Bioinforma.</source> <volume>89</volume> (<issue>12</issue>), <fpage>1770</fpage>&#x2013;<lpage>1786</lpage>. <pub-id pub-id-type="doi">10.1002/prot.26235</pub-id>
</citation>
</ref>
<ref id="B47">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>LeCun</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Bengio</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>1995</year>). <article-title>Convolutional networks for images, speech, and time series</article-title>. <source>Handb. brain theory neural Netw.</source> <volume>3361</volume> (<issue>10</issue>), <fpage>1995</fpage>.</citation>
</ref>
<ref id="B48">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lee</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Nam</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Sequence-based prediction of protein binding regions and drug&#x2013;target interactions</article-title>. <source>J. cheminformatics</source> <volume>14</volume> (<issue>1</issue>), <fpage>5</fpage>. <pub-id pub-id-type="doi">10.1186/s13321-022-00584-w</pub-id>
</citation>
</ref>
<ref id="B49">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Cao</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Tu</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2022</year>). &#x201c;<article-title>Recurpocket: recurrent lmser network with gating mechanism for protein binding site detection</article-title>,&#x201d; in <conf-name>2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)</conf-name>, <conf-loc>Las Vegas, NV, USA</conf-loc>, <conf-date>06-08 December 2022</conf-date> (<publisher-name>IEEE</publisher-name>), <fpage>334</fpage>&#x2013;<lpage>339</lpage>.</citation>
</ref>
<ref id="B50">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lin</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Akin</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Rao</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Hie</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>W.</given-names>
</name>
<etal/>
</person-group> (<year>2023</year>). <article-title>Evolutionary-scale prediction of atomic-level protein structure with a language model</article-title>. <source>Science</source> <volume>379</volume> (<issue>6637</issue>), <fpage>1123</fpage>&#x2013;<lpage>1130</lpage>. <pub-id pub-id-type="doi">10.1126/science.ade2574</pub-id>
</citation>
</ref>
<ref id="B51">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Tu</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Refinepocket: an attention-enhanced and mask-guided deep learning approach for protein binding site prediction</article-title>. <source>IEEE/ACM Trans. Comput. Biol. Bioinforma.</source> <volume>20</volume>, <fpage>3314</fpage>&#x2013;<lpage>3321</lpage>. <pub-id pub-id-type="doi">10.1109/tcbb.2023.3265640</pub-id>
</citation>
</ref>
<ref id="B52">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>Z.</given-names>
</name>
<etal/>
</person-group> (<year>2015</year>). <article-title>PDB-wide collection of binding data: current status of the PDBbind database</article-title>. <source>Bioinformatics</source> <volume>31</volume> (<issue>3</issue>), <fpage>405</fpage>&#x2013;<lpage>412</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btu626</pub-id>
</citation>
</ref>
<ref id="B53">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Maveyraud</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Mourey</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Protein X-ray crystallography and drug discovery</article-title>. <source>Molecules</source> <volume>25</volume> (<issue>5</issue>), <fpage>1030</fpage>. <pub-id pub-id-type="doi">10.3390/molecules25051030</pub-id>
</citation>
</ref>
<ref id="B54">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Miciaccia</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Belviso</surname>
<given-names>B. D.</given-names>
</name>
<name>
<surname>Iaselli</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Cingolani</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Ferorelli</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Cappellari</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Three-dimensional structure of human cyclooxygenase (h COX)-1</article-title>. <source>Sci. Rep.</source> <volume>11</volume> (<issue>1</issue>), <fpage>4312</fpage>. <pub-id pub-id-type="doi">10.1038/s41598-021-83438-z</pub-id>
</citation>
</ref>
<ref id="B55">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Mikolov</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Corrado</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Dean</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2013</year>). <source>Efficient estimation of word representations in vector space</source>. <comment>
<italic>arXiv preprint arXiv:1301.3781</italic>
</comment>.</citation>
</ref>
<ref id="B56">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pak</surname>
<given-names>M. A.</given-names>
</name>
<name>
<surname>Markhieva</surname>
<given-names>K. A.</given-names>
</name>
<name>
<surname>Novikova</surname>
<given-names>M. S.</given-names>
</name>
<name>
<surname>Petrov</surname>
<given-names>D. S.</given-names>
</name>
<name>
<surname>Vorobyev</surname>
<given-names>I. S.</given-names>
</name>
<name>
<surname>Maksimova</surname>
<given-names>E. S.</given-names>
</name>
<etal/>
</person-group> (<year>2023</year>). <article-title>Using AlphaFold to predict the impact of single mutations on protein stability and function</article-title>. <source>Plos one</source> <volume>18</volume> (<issue>3</issue>), <fpage>e0282689</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0282689</pub-id>
</citation>
</ref>
<ref id="B57">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Raies</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Tulodziecka</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Stainer</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Middleton</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Dhindsa</surname>
<given-names>R. S.</given-names>
</name>
<name>
<surname>Hill</surname>
<given-names>P.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <article-title>DrugnomeAI is an ensemble machine-learning framework for predicting druggability of candidate drug targets</article-title>. <source>Commun. Biol.</source> <volume>5</volume> (<issue>1</issue>), <fpage>1291</fpage>. <pub-id pub-id-type="doi">10.1038/s42003-022-04245-4</pub-id>
</citation>
</ref>
<ref id="B58">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Raj</surname>
<given-names>S. S.</given-names>
</name>
<name>
<surname>Chandra</surname>
<given-names>S. V.</given-names>
</name>
</person-group> (<year>2024</year>). <article-title>Significance of sequence features in classification of protein&#x2013;protein interactions using machine learning</article-title>. <source>Protein J.</source> <volume>43</volume> (<issue>1</issue>), <fpage>72</fpage>&#x2013;<lpage>83</lpage>. <pub-id pub-id-type="doi">10.1007/s10930-023-10168-8</pub-id>
</citation>
</ref>
<ref id="B59">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rao</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>McDuffie</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Sachs</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Artificial intelligence/machine learning-driven small molecule repurposing via off-target prediction and transcriptomics</article-title>. <source>Toxics</source> <volume>11</volume> (<issue>10</issue>), <fpage>875</fpage>. <pub-id pub-id-type="doi">10.3390/toxics11100875</pub-id>
</citation>
</ref>
<ref id="B60">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Rao</surname>
<given-names>R. M.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Verkuil</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Meier</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Canny</surname>
<given-names>J. F.</given-names>
</name>
<name>
<surname>Abbeel</surname>
<given-names>P.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). &#x201c;<article-title>MSA transformer</article-title>,&#x201d; in <source>International conference on machine learning</source> (<publisher-name>PMLR</publisher-name>), <fpage>8844</fpage>&#x2013;<lpage>8856</lpage>. <comment>
<ext-link ext-link-type="uri" xlink:href="https://proceedings.mlr.press/v139/rao21a.html">https://proceedings.mlr.press/v139/rao21a.html</ext-link>
</comment>.</citation>
</ref>
<ref id="B61">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Remmert</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Biegert</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Hauser</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>S&#xf6;ding</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment</article-title>. <source>Nat. methods</source> <volume>9</volume> (<issue>2</issue>), <fpage>173</fpage>&#x2013;<lpage>175</lpage>. <pub-id pub-id-type="doi">10.1038/nmeth.1818</pub-id>
</citation>
</ref>
<ref id="B62">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rives</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Meier</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Sercu</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Goyal</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences</article-title>. <source>Proc. Natl. Acad. Sci.</source> <volume>118</volume> (<issue>15</issue>), <fpage>e2016239118</fpage>. <pub-id pub-id-type="doi">10.1073/pnas.2016239118</pub-id>
</citation>
</ref>
<ref id="B63">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sadybekov</surname>
<given-names>A. V.</given-names>
</name>
<name>
<surname>Katritch</surname>
<given-names>V.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Computational approaches streamlining drug discovery</article-title>. <source>Nature</source> <volume>616</volume> (<issue>7958</issue>), <fpage>673</fpage>&#x2013;<lpage>685</lpage>. <pub-id pub-id-type="doi">10.1038/s41586-023-05905-z</pub-id>
</citation>
</ref>
<ref id="B64">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sarzynska-Wawer</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wawer</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Pawlak</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Szymanowska</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Stefaniak</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Jarkiewicz</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Detecting formal thought disorder by deep contextualized word representations</article-title>. <source>Psychiatry Res.</source> <volume>304</volume>, <fpage>114135</fpage>. <pub-id pub-id-type="doi">10.1016/j.psychres.2021.114135</pub-id>
</citation>
</ref>
<ref id="B65">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Scarselli</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Gori</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Tsoi</surname>
<given-names>A. C.</given-names>
</name>
<name>
<surname>Hagenbuchner</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Monfardini</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2008</year>). <article-title>The graph neural network model</article-title>. <source>IEEE Trans. neural Netw.</source> <volume>20</volume> (<issue>1</issue>), <fpage>61</fpage>&#x2013;<lpage>80</lpage>. <pub-id pub-id-type="doi">10.1109/TNN.2008.2005605</pub-id>
</citation>
</ref>
<ref id="B66">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Schrodinger</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2015</year>). <source>The PyMOL molecular graphics system</source>, <fpage>8</fpage>. <comment>Version 1</comment>.</citation>
</ref>
<ref id="B67">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schuster</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Paliwal</surname>
<given-names>K. K.</given-names>
</name>
</person-group> (<year>1997</year>). <article-title>Bidirectional recurrent neural networks</article-title>. <source>IEEE Trans. Signal Process.</source> <volume>45</volume> (<issue>11</issue>), <fpage>2673</fpage>&#x2013;<lpage>2681</lpage>. <pub-id pub-id-type="doi">10.1109/78.650093</pub-id>
</citation>
</ref>
<ref id="B68">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Seo</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Choi</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Choi</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2024</year>). <article-title>Pseq2Sites: enhancing protein sequence-based ligand binding-site prediction accuracy via the deep convolutional network and attention mechanism</article-title>. <source>Eng. Appl. Artif. Intell.</source> <volume>127</volume>, <fpage>107257</fpage>. <pub-id pub-id-type="doi">10.1016/j.engappai.2023.107257</pub-id>
</citation>
</ref>
<ref id="B69">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Stank</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Kokh</surname>
<given-names>D. B.</given-names>
</name>
<name>
<surname>Fuller</surname>
<given-names>J. C.</given-names>
</name>
<name>
<surname>Wade</surname>
<given-names>R. C.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Protein binding pocket dynamics</article-title>. <source>Accounts Chem. Res.</source> <volume>49</volume> (<issue>5</issue>), <fpage>809</fpage>&#x2013;<lpage>815</lpage>. <pub-id pub-id-type="doi">10.1021/acs.accounts.5b00516</pub-id>
</citation>
</ref>
<ref id="B70">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Stepniewska-Dziubinska</surname>
<given-names>M. M.</given-names>
</name>
<name>
<surname>Zielenkiewicz</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Siedlecki</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Improving detection of protein-ligand binding sites with 3D segmentation</article-title>. <source>Sci. Rep.</source> <volume>10</volume> (<issue>1</issue>), <fpage>5035</fpage>. <pub-id pub-id-type="doi">10.1038/s41598-020-61860-z</pub-id>
</citation>
</ref>
<ref id="B71">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sun</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Wen</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>PremPLI: a machine learning model for predicting the effects of missense mutations on protein-ligand interactions</article-title>. <source>Commun. Biol.</source> <volume>4</volume> (<issue>1</issue>), <fpage>1311</fpage>. <pub-id pub-id-type="doi">10.1038/s42003-021-02826-3</pub-id>
</citation>
</ref>
<ref id="B72">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sunseri</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Koes</surname>
<given-names>D. R.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Libmolgrid: graphics processing unit accelerated molecular gridding for deep learning applications</article-title>. <source>J. Chem. Inf. Model.</source> <volume>60</volume> (<issue>3</issue>), <fpage>1079</fpage>&#x2013;<lpage>1084</lpage>. <pub-id pub-id-type="doi">10.1021/acs.jcim.9b01145</pub-id>
</citation>
</ref>
<ref id="B73">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Suzek</surname>
<given-names>B. E.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>McGarvey</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Mazumder</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>C. H.</given-names>
</name>
</person-group> (<year>2007</year>). <article-title>UniRef: comprehensive and non-redundant UniProt reference clusters</article-title>. <source>Bioinformatics</source> <volume>23</volume> (<issue>10</issue>), <fpage>1282</fpage>&#x2013;<lpage>1288</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btm098</pub-id>
</citation>
</ref>
<ref id="B74">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tran</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Khadkikar</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Porollo</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Survey of protein sequence embedding models</article-title>. <source>Int. J. Mol. Sci.</source> <volume>24</volume> (<issue>4</issue>), <fpage>3775</fpage>. <pub-id pub-id-type="doi">10.3390/ijms24043775</pub-id>
</citation>
</ref>
<ref id="B75">
<citation citation-type="journal">
<collab>UniProt Consortium</collab> (<year>2015</year>). <article-title>UniProt: a hub for protein information</article-title>. <source>Nucleic acids Res.</source> <volume>43</volume> (<issue>D1</issue>), <fpage>D204</fpage>&#x2013;<lpage>D212</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gku989</pub-id>
</citation>
</ref>
<ref id="B76">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vaswani</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Shazeer</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Parmar</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Uszkoreit</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Aidan</surname>
<given-names>N.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). <article-title>Attention is all you need</article-title>. <source>Adv. neural Inf. Process. Syst.</source> <volume>30</volume>. <pub-id pub-id-type="doi">10.48550/arXiv.1706.03762</pub-id>
</citation>
</ref>
<ref id="B77">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Veli&#x2c7;ckovi&#xb4;c</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Cucurull</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Casanova</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Romero</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Lio</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Bengio</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2017</year>). <source>Graph attention networks</source>. <comment>arXiv preprint arXiv:1710.10903</comment>.</citation>
</ref>
<ref id="B78">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Villegas-Morcillo</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Gomez</surname>
<given-names>A. M.</given-names>
</name>
<name>
<surname>Sanchez</surname>
<given-names>V.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>An analysis of protein language model embeddings for fold prediction</article-title>. <source>Briefings Bioinforma.</source> <volume>23</volume> (<issue>3</issue>), <fpage>bbac142</fpage>. <pub-id pub-id-type="doi">10.1093/bib/bbac142</pub-id>
</citation>
</ref>
<ref id="B79">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Pan</surname>
<given-names>F.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <article-title>A reinforcement learning approach for protein&#x2013;ligand binding pose prediction</article-title>. <source>BMC Bioinforma.</source> <volume>23</volume> (<issue>1</issue>), <fpage>368</fpage>. <pub-id pub-id-type="doi">10.1186/s12859-022-04912-7</pub-id>
</citation>
</ref>
<ref id="B80">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Zheng</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Ye</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Gan</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Song</surname>
<given-names>X.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <source>Deep graph library: a graph-centric, highly-performant package for graph neural networks</source>. <comment>arXiv preprint arXiv:1909.01315</comment>.</citation>
</ref>
<ref id="B81">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wu</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2008</year>). <article-title>ANGLOR: a composite machine-learning algorithm for protein backbone torsion angle prediction</article-title>. <source>PloS one</source> <volume>3</volume> (<issue>10</issue>), <fpage>e3400</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0003400</pub-id>
</citation>
</ref>
<ref id="B82">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Xia</surname>
<given-names>C. Q.</given-names>
</name>
<name>
<surname>Pan</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>H.-B.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Protein&#x2013;ligand binding residue prediction enhancement through hybrid deep heterogeneous learning of sequence and structure data</article-title>. <source>Bioinformatics</source> <volume>36</volume> (<issue>10</issue>), <fpage>3018</fpage>&#x2013;<lpage>3027</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btaa110</pub-id>
</citation>
</ref>
<ref id="B83">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Xia</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Pan</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>H.-B.</given-names>
</name>
</person-group> (<year>2024</year>). <article-title>A comprehensive survey on protein-ligand binding site prediction</article-title>. <source>Curr. Opin. Struct. Biol.</source> <volume>86</volume>, <fpage>102793</fpage>. <pub-id pub-id-type="doi">10.1016/j.sbi.2024.102793</pub-id>
</citation>
</ref>
<ref id="B84">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yan</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Walker</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>A comparative assessment and analysis of 20 representative sequence alignment methods for protein structure prediction</article-title>. <source>Sci. Rep.</source> <volume>3</volume> (<issue>1</issue>), <fpage>2619</fpage>. <pub-id pub-id-type="doi">10.1038/srep02619</pub-id>
</citation>
</ref>
<ref id="B85">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yan</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>S.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <article-title>PointSite: a point cloud segmentation tool for identification of protein ligand binding atoms</article-title>. <source>J. Chem. Inf. Model.</source> <volume>62</volume> (<issue>11</issue>), <fpage>2835</fpage>&#x2013;<lpage>2845</lpage>. <pub-id pub-id-type="doi">10.1021/acs.jcim.1c01512</pub-id>
</citation>
</ref>
<ref id="B86">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Roy</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2013a</year>). <article-title>Protein&#x2013;ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment</article-title>. <source>Bioinformatics</source> <volume>29</volume> (<issue>20</issue>), <fpage>2588</fpage>&#x2013;<lpage>2595</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btt447</pub-id>
</citation>
</ref>
<ref id="B87">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Roy</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2013b</year>). <article-title>Protein&#x2013;ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment</article-title>. <source>Bioinformatics</source> <volume>29</volume> (<issue>20</issue>), <fpage>2588</fpage>&#x2013;<lpage>2595</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btt447</pub-id>
</citation>
</ref>
<ref id="B88">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Freddolino</surname>
<given-names>P. L.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2024</year>). <article-title>BioLiP2: an updated structure database for biologically relevant ligand&#x2013;protein interactions</article-title>. <source>Nucleic Acids Res.</source> <volume>52</volume> (<issue>D1</issue>), <fpage>D404</fpage>&#x2013;<lpage>D412</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkad630</pub-id>
</citation>
</ref>
<ref id="B89">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Zheng</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Mortuza</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins</article-title>. <source>Bioinformatics</source> <volume>36</volume> (<issue>7</issue>), <fpage>2105</fpage>&#x2013;<lpage>2112</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btz863</pub-id>
</citation>
</ref>
<ref id="B90">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Kurgan</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>SCRIBER: accurate and partner type-specific prediction of protein-binding residues from proteins sequences</article-title>. <source>Bioinformatics</source> <volume>35</volume> (<issue>14</issue>), <fpage>i343</fpage>&#x2013;<lpage>i353</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btz324</pub-id>
</citation>
</ref>
<ref id="B91">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>A review on the recent developments of sequence-based protein feature extraction methods</article-title>. <source>Curr. Bioinforma.</source> <volume>14</volume> (<issue>3</issue>), <fpage>190</fpage>&#x2013;<lpage>199</lpage>. <pub-id pub-id-type="doi">10.2174/1574893614666181212102749</pub-id>
</citation>
</ref>
<ref id="B92">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Xie</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2023</year>). <source>Protein Language model-powered 3D ligand binding site prediction from protein sequence</source>. <comment>arXiv preprint arXiv:2312.03016</comment>.</citation>
</ref>
<ref id="B93">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Cao</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Exploring the computational methods for protein-ligand binding site prediction</article-title>. <source>Comput. Struct. Biotechnol. J.</source> <volume>18</volume>, <fpage>417</fpage>&#x2013;<lpage>426</lpage>. <pub-id pub-id-type="doi">10.1016/j.csbj.2020.02.008</pub-id>
</citation>
</ref>
</ref-list>
</back>
</article>