<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Genet.</journal-id>
<journal-title>Frontiers in Genetics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Genet.</abbrev-journal-title>
<issn pub-type="epub">1664-8021</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">876721</article-id>
<article-id pub-id-type="doi">10.3389/fgene.2022.876721</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Genetics</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Interpretable Deep Learning Model Reveals Subsequences of Various Functions for Long Non-Coding RNA Identification</article-title>
<alt-title alt-title-type="left-running-head">Lin and Wichadakul</alt-title>
<alt-title alt-title-type="right-running-head">Xlnc1DCNN</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Lin</surname>
<given-names>Rattaphon</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1679768/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Wichadakul</surname>
<given-names>Duangdao</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/994614/overview"/>
</contrib>
</contrib-group>
<aff id="aff1">
<sup>1</sup>
<institution>Department of Computer Engineering</institution>, <institution>Faculty of Engineering</institution>, <institution>Chulalongkorn University</institution>, <addr-line>Pathumwan</addr-line>, <country>Thailand</country>
</aff>
<aff id="aff2">
<sup>2</sup>
<institution>Center of Excellence in Systems Biology</institution>, <institution>Faculty of Medicine</institution>, <institution>Chulalongkorn University</institution>, <addr-line>Pathumwan</addr-line>, <country>Thailand</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/29084/overview">Sarath Chandra Janga</ext-link>, Indiana University, Purdue University Indianapolis, United States</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1681152/overview">Doaa Salem</ext-link>, Indiana University, Purdue University Indianapolis, United States</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/706176/overview">Tsukasa Fukunaga</ext-link>, Waseda University, Japan</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Duangdao Wichadakul, <email>duangdao.w@chula.ac.th</email>
</corresp>
<fn fn-type="other">
<p>This article was submitted to Computational Genomics, a section of the journal Frontiers in Genetics</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>24</day>
<month>05</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>13</volume>
<elocation-id>876721</elocation-id>
<history>
<date date-type="received">
<day>15</day>
<month>02</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>11</day>
<month>04</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2022 Lin and Wichadakul.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Lin and Wichadakul</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>Long non-coding RNAs (lncRNAs) play crucial roles in many biological processes and are implicated in several diseases. With the next-generation sequencing technologies, substantial unannotated transcripts have been discovered. Classifying unannotated transcripts using biological experiments are more time-consuming and expensive than computational approaches. Several tools are available for identifying long non-coding RNAs. These tools, however, did not explain the features in their tools that contributed to the prediction results. Here, we present Xlnc1DCNN, a tool for distinguishing long non-coding RNAs (lncRNAs) from protein-coding transcripts (PCTs) using a one-dimensional convolutional neural network with prediction explanations. The evaluation results of the human test set showed that Xlnc1DCNN outperformed other state-of-the-art tools in terms of accuracy and F1-score. The explanation results revealed that lncRNA transcripts were mainly identified as sequences with no conserved regions, short patterns with unknown functions, or only regions of transmembrane helices while protein-coding transcripts were mostly classified by conserved protein domains or families. The explanation results also conveyed the probably inconsistent annotations among the public databases, lncRNA transcripts which contain protein domains, protein families, or intrinsically disordered regions (IDRs). Xlnc1DCNN is freely available at <ext-link ext-link-type="uri" xlink:href="https://github.com/cucpbioinfo/Xlnc1DCNN">https://github.com/cucpbioinfo/Xlnc1DCNN</ext-link>.</p>
</abstract>
<kwd-group>
<kwd>long non-coding RNA (lncRNA)</kwd>
<kwd>one-dimensional convolutional neural network (1D CNN)</kwd>
<kwd>deep learning</kwd>
<kwd>explainable artificial intelligence (XAI)</kwd>
<kwd>SHAP (SHapley additive exPlanations)</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>Long non-coding RNAs (lncRNAs) are RNAs that are not translated into proteins and are longer than 200 nucleotides. lncRNAs play important roles in many critical biological processes, including gene expression, gene regulation, gene silencing, chromatin remodeling, acting as molecular scaffolds, etc. (<xref ref-type="bibr" rid="B28">Rinn and Chang, 2012</xref>; <xref ref-type="bibr" rid="B23">Marchese et al., 2017</xref>; <xref ref-type="bibr" rid="B31">Statello et al., 2021</xref>), and have been implicated in human diseases such as cancers and diabetes (<xref ref-type="bibr" rid="B25">Mor&#xe1;n et al., 2012</xref>; <xref ref-type="bibr" rid="B10">Fang and Fullwood, 2016</xref>; <xref ref-type="bibr" rid="B4">Chan and Tay, 2018</xref>; <xref ref-type="bibr" rid="B15">Jin et al., 2020</xref>). The enhancements of next-generation sequencing technology, i.e., RNA sequencing (RNA-Seq) (<xref ref-type="bibr" rid="B35">Wang et al., 2009</xref>; <xref ref-type="bibr" rid="B30">Stark et al., 2019</xref>) have led to numerous discoveries of unannotated transcripts. However, classifying the innumerable number of unclassified sequences using experimental approaches is time-consuming and expensive. In contrast, computational approaches are faster and more convenient.</p>
<p>Most of the existing computational approaches for classifying lncRNA and protein-coding transcripts used feature extraction methods to obtain training features, e.g., the upgraded version of Coding Potential Calculator (CPC2) (<xref ref-type="bibr" rid="B16">Kang et al., 2017</xref>), CNIT (<xref ref-type="bibr" rid="B12">Guo et al., 2019</xref>), PLEK (<xref ref-type="bibr" rid="B18">Li et al., 2014</xref>), CPAT (<xref ref-type="bibr" rid="B36">Wang et al., 2013</xref>), FEELnc (<xref ref-type="bibr" rid="B37">Wucher et al., 2017</xref>), RNAsamba (<xref ref-type="bibr" rid="B3">Camargo et al., 2020</xref>), LncADeep (<xref ref-type="bibr" rid="B39">Yang et al., 2018</xref>), and lncRNA_Mdeep (<xref ref-type="bibr" rid="B9">Fan et al., 2020</xref>). Most of them used similar features such as the Fickett and hexamer scores, the ORF length, and then topped up with additional sequence and structural features. Moreover, none of them explained how the features contributed to the model prediction results.</p>
<p>Deep learning algorithms have become very popular, especially for a dataset with a large number of data points and data dimensions as the features will be learned by the algorithms themselves during the training. Many convolutional neural networks (CNNs), the 2D-CNNs, have been widely used for image classification and segmentation applications (<xref ref-type="bibr" rid="B38">Yamashita et al., 2018</xref>) because of their great capability for extracting features from input data. Recently, many applications such as speech recognition and ECG monitoring (<xref ref-type="bibr" rid="B17">Kiranyaz et al., 2021</xref>) started using 1D-CNN instead of the traditional machine learning approaches. The applications for detecting irregular heartbeats (<xref ref-type="bibr" rid="B1">Acharya et al., 2017</xref>; <xref ref-type="bibr" rid="B19">Li et al., 2019</xref>; <xref ref-type="bibr" rid="B14">Hsieh et al., 2020</xref>) have shown that using only a simple 1D-CNN could achieve high prediction accuracy without explicitly addressing and extracting features as inputs for the models.</p>
<p>While most complex black-box models (e.g., boosting tree algorithms, ensemble models, deep neural networks) typically provide better learning performance, they usually are uninterpretable. To understand how a complex model learns to differentiate things, explainable artificial intelligence (XAI) has recently become one of the popular topics aiming to interpret and explain machine learning or deep learning models (<xref ref-type="bibr" rid="B32">Tjoa and Guan, 2021</xref>). Explainable AI is essential for users to understand and trust the model prediction results. It can help illustrate what the models perceive and explain how these perceptions can be mapped with the underlying knowledge of the human. Some of the favored approaches to obtain an explanation from a complex black-box model are LIME (<xref ref-type="bibr" rid="B27">Ribeiro et al., 2016</xref>) and SHAP (<xref ref-type="bibr" rid="B21">Lundberg and Lee, 2017</xref>). LIME builds a local surrogate model to explain individual prediction. SHAP (Shapley Additive exPlanations) introduced SHAP values representing the unified measure of feature importance together with SHAP value estimation methods. DeepSHAP (<xref ref-type="bibr" rid="B6">Chen et al., 2019</xref>) was built based on the connection between the original SHAP and DeepLIFT (<xref ref-type="bibr" rid="B29">Shrikumar et al., 2017</xref>) to explain the deep learning model and further refined and extended with relative background distributions and stacks of mixed model types.</p>
<p>With still some ambiguities in classifying lncRNA and mRNA sequences based on training features, together with the promising results of 1D-CNN in previous applications, in this paper, we propose Xlnc1DCNN, a 1D-CNN model for classifying lncRNA and mRNA with an explanation. The model solely uses nucleotide sequences as the training set. On the human test set, Xlnc1DCNN outperformed all other models in terms of accuracy and F1-score. For the cross-species dataset, Xlnc1DCNN also had the generalization across testing species. We explained how the Xlnc1DCNN distinguished the lncRNA from mRNA transcript sequences by applying DeepSHAP to generate SHAP values representing how the model captured and visualized the contribution of each nucleotide using an in-house python code. The explanation of true positives (i.e., lncRNA transcript sequences) showed that the model classified a sequence as lncRNA if the sequence did not contain any important regions or contained only an N-terminal signal peptide or transmembrane helices. The explanation of true negatives (i.e., mRNA transcript sequences) showed that the model learned protein domains/families from the input transcript sequences and used them to predict the sequences as mRNAs. The explanation of false positives (i.e., mRNA predicted as lncRNA transcript sequences) showed that the model could not capture any important regions representing protein domains/families or found important regions contributing to both lncRNA and mRNA prediction. A few false positive sequences were also found with inconsistent transcript types among the databases. Lastly, the explanation of false negatives (i.e., lncRNA predicted as mRNA transcript sequences) showed that the model captured protein domains or families within these lncRNA sequences and, hence, misclassified them as mRNAs.</p>
</sec>
<sec id="s2">
<title>2 Materials and Methods</title>
<sec id="s2-1">
<title>2.1 Data Compilation and Pre-Processing</title>
<p>The human transcript datasets for training the model were obtained from GENCODE (<xref ref-type="bibr" rid="B11">Frankish et al., 2018</xref>) and LNCipedia (<xref ref-type="bibr" rid="B34">Volders et al., 2018</xref>). GENCODE (release 32) contains 48,351 sequences of lncRNA transcripts and 100,291 sequences of protein-coding transcripts (PCTs). For LNCipedia (version 5.2), only high confidence sequences were selected, which resulted in 107,039 lncRNA transcripts. To remove lncRNA transcript sequences from LNCipedia that are duplicates of GENCODE, we used CD-HIT-EST-2D (<xref ref-type="bibr" rid="B20">Li and Godzik, 2006</xref>) to compare lncRNA sequences between LNCipedia and GENCODE and filter out the sequences with more than 95% similarity from the LNCipedia dataset. A total of 72,803 lncRNA sequences from LNCipedia remained. We then pre-processed the sequences used for training the Xlnc1DCNN model by discarding the sequences shorter than 200 bases and longer than 3,000 bases. After filtering, one-hot encoding was used to encode the sequences. The total number of remaining sequences after cleansing was 185,030 with 108,578 lncRNAs and 76,453 PCTs (<xref ref-type="table" rid="T1">Table 1</xref>). The lncRNAs and PCTS were set as the positive and negative classes, respectively. The dataset was stratified split by 80% and 20% into the training and test sets.</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Summary of datasets from GENCODE and LNCipedia.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Sequence Type</th>
<th align="center">Species</th>
<th align="center">Data Source</th>
<th align="center">Dataset Size</th>
<th align="center">&#x3c;200 bps</th>
<th align="center">&#x3e;3,000 bps</th>
<th align="center">No.of Transcripts after Cleansing</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">mRNA</td>
<td align="center">Human</td>
<td align="center">GENCODE (release 32)</td>
<td align="center">100,291</td>
<td align="char" char=".">374</td>
<td align="center">23,464</td>
<td align="center">76,453</td>
</tr>
<tr>
<td align="left">lncRNA</td>
<td align="center">Human</td>
<td align="center">GENCODE (release 32)</td>
<td align="center">48,351</td>
<td align="char" char=".">291</td>
<td align="center">3,486</td>
<td align="center">44,574</td>
</tr>
<tr>
<td align="left">lncRNA</td>
<td align="center">Human</td>
<td align="center">LNCipedia (version 5.2)</td>
<td align="center">72,803</td>
<td align="char" char=".">0</td>
<td align="center">8,799</td>
<td align="center">64,004</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Cross-species datasets included the mouse dataset obtained from GENCODE (<xref ref-type="bibr" rid="B11">Frankish et al., 2018</xref>) (release M23) and the gorilla, chicken, and cow datasets obtained from Ensembl (<xref ref-type="bibr" rid="B8">Cunningham et al., 2019</xref>) (release 102). We pre-processed the cross-species datasets by discarding the sequences shorter than 200 bases and longer than 3,000 bases. We then randomly selected the mRNA and lncRNA sequences for each species. The test transcripts of gorilla, chicken, cow, and mouse contained 8,000, 8,000, 11,000, and 32,000 sequences, respectively, each with an equal number of sequences from each class.</p>
</sec>
<sec id="s2-2">
<title>2.2 Model Architecture</title>
<p>In this study, we designed and implemented the Xlnc1DCNN model in Python3 using TensorFlow on NVIDIA GeForce GTX 1080 Ti and Intel Xeon Silver 4112 Processor. The built model could distinguish lncRNAs from the mRNAs (PCTs) and outperformed the existing tools for the human dataset. The model architecture consists of three convolutions with pooling layers, two fully connected layers, and a Softmax layer. We used ReLU as the activation function for convolution and fully connected layers. We also found that adding the dropout layer after the pooling layer made the model perform slightly better.</p>
<p>We used 10% of the data from the training set to perform hyperparameter optimizations over the kernel size, dropout rate, stride size, batch size, and learning rate by using the grid search algorithm. The best kernel size was 57, with the stride size equal to 1. The model performance started to decrease after increasing the stride size for almost every kernel size. For the learning details, the momentum, learning rate, number of epochs, and batch size were 0.9, 0.01, 120, and 128, respectively, with the stochastic gradient descent as an optimizer. The final hypermeters used in the model architecture are shown in <xref ref-type="table" rid="T2">Table 2</xref>.</p>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>Hyperparameters of the proposed 1D-CNN architecture.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Layer</th>
<th align="center">Hyperparameter</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Conv 1D</td>
<td align="center">kernel size &#x3d; 57, stride &#x3d; 1</td>
</tr>
<tr>
<td align="left">Max-Pooling</td>
<td align="center">pool size &#x3d; 2</td>
</tr>
<tr>
<td align="left">Dropout</td>
<td align="center">
<italic>p</italic> &#x3d; 0.3</td>
</tr>
<tr>
<td align="left">Conv 1D</td>
<td align="center">kernel size &#x3d; 57, stride &#x3d; 1</td>
</tr>
<tr>
<td align="left">Max-Pooling</td>
<td align="center">pool size &#x3d; 2</td>
</tr>
<tr>
<td align="left">Dropout</td>
<td align="center">
<italic>p</italic> &#x3d; 0.3</td>
</tr>
<tr>
<td align="left">Conv 1D</td>
<td align="center">kernel size &#x3d; 57, stride &#x3d; 1</td>
</tr>
<tr>
<td align="left">Max-Pooling</td>
<td align="center">pool size &#x3d; 2</td>
</tr>
<tr>
<td align="left">Dropout</td>
<td align="center">
<italic>p</italic> &#x3d; 0.3</td>
</tr>
<tr>
<td align="left">Flatten</td>
<td align="center">-</td>
</tr>
<tr>
<td align="left">Dense</td>
<td align="center">256</td>
</tr>
<tr>
<td align="left">Dropout</td>
<td align="center">
<italic>p</italic> &#x3d; 0.5</td>
</tr>
<tr>
<td align="left">Dense</td>
<td align="center">256</td>
</tr>
<tr>
<td align="left">Dropout</td>
<td align="center">
<italic>p</italic> &#x3d; 0.5</td>
</tr>
<tr>
<td align="left">Softmax</td>
<td align="center">2</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s2-3">
<title>2.3 Model Interpretation</title>
<p>DeepSHAP was used to interpret how the proposed Xlnc1DCNN model could classify the lncRNAs and mRNAs from the input transcript sequences. As DeepSHAP needs background distributions as references to approximate the SHAPley values on conditional expectation, 175 sequences from each class were randomly selected as the representative background. A total of 350 sequences were used as the backgrounds as it was limited by the available GPU.</p>
<p>The output from DeepSHAP is SHAP values representing each nucleotide&#x2019;s contribution to the model. To obtain SHAP values representing each nucleotide within a sequence, we summed up SHAP values inside the array of one-hot encoding and got a single SHAP value of each nucleotide. To visualize SHAP values from DeepSHAP of the input transcript sequence, we further summed up the SHAP values of three consecutive nucleotides, which probably represented an amino acid, and generated the results in three reading frames. We then plotted a color line for each representative amino acid. The blue and red colors, respectively, indicate the contribution of each amino acid for classifying the sequence as an lncRNA and an mRNA (<xref ref-type="fig" rid="F1">Figure 1</xref>).</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>The process to obtain SHAP values for explaining the nucleotide contribution that was captured by the model to differentiate lncRNA from mRNA transcript sequences.</p>
</caption>
<graphic xlink:href="fgene-13-876721-g001.tif"/>
</fig>
</sec>
<sec id="s2-4">
<title>2.4 Evaluation</title>
<sec id="s2-4-1">
<title>2.4.1 Model Evaluation Metrics</title>
<p>To evaluate the performance of the proposed Xlnc1DCNN model with other existing tools, we used the following metrics. True positive (TP) represents the lncRNA transcript sequences that are predicted as lncRNAs. True negative (TN) represents PCTs that are predicted as PCTs. False positive (FP) represents the PCTs that are predicted as lncRNAs. False negative (FN) represents lncRNAs that are predicted as PCTs.<disp-formula id="equ1">
<mml:math id="m1">
<mml:mrow>
<mml:mi mathvariant="italic">Accuracy</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">TP</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi mathvariant="italic">TN</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">TP</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi mathvariant="italic">TN</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi mathvariant="italic">FP</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi mathvariant="italic">FN</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="equ2">
<mml:math id="m2">
<mml:mrow>
<mml:mi mathvariant="italic">Sensitivity</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">TP</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">TP</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi mathvariant="italic">FN</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="equ3">
<mml:math id="m3">
<mml:mrow>
<mml:mi mathvariant="italic">Specificity</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">TN</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">TN</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi mathvariant="italic">FP</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="equ4">
<mml:math id="m4">
<mml:mrow>
<mml:mi mathvariant="italic">Precision</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">TP</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">TP</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi mathvariant="italic">FP</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="equ5">
<mml:math id="m5">
<mml:mrow>
<mml:mi mathvariant="italic">F</mml:mi>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:mi mathvariant="italic">Score</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:mi mathvariant="italic">precision</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi mathvariant="italic">sensitivity</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">precision</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi mathvariant="italic">sensitivity</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
</sec>
<sec id="s2-4-2">
<title>2.4.2 Interpretation Evaluation Method</title>
<p>To compare the explanation results of Xlnc1DCNN on the human test set with known biological knowledge, we utilized the available bioinformatics tools/databases such as TMHMM (<xref ref-type="bibr" rid="B41">Krogh et al., 2001</xref>) to identify transmembrane helices, Pfam (<xref ref-type="bibr" rid="B42">Mistry et al., 2020</xref>), and InterPro (<xref ref-type="bibr" rid="B40">Blum et al., 2020</xref>) to identify protein domains or families for all sequences in the test set. From InterPro, we considered InterPro entries, which include InterPro domain, family, homologous superfamily, repeat, and sites (i.e., active site, binding site, conserved site, PTM site). MobiDB (integrated within InterPro) (<xref ref-type="bibr" rid="B26">Piovesan et al., 2021</xref>) was also used to identify intrinsically disordered regions within sequences.</p>
</sec>
</sec>
</sec>
<sec id="s3">
<title>3 Results</title>
<sec id="s3-1">
<title>3.1 Model Evaluation Results</title>
<p>We compared the performance of Xlnc1DCNN with eight existing tools: CPC2, CPAT, CNIT, PLEK, FEELnc, RNAsamba, LncADeep, and lncRNA_Mdeep (<xref ref-type="bibr" rid="B36">Wang et al., 2013</xref>; <xref ref-type="bibr" rid="B18">Li et al., 2014</xref>; <xref ref-type="bibr" rid="B16">Kang et al., 2017</xref>; <xref ref-type="bibr" rid="B37">Wucher et al., 2017</xref>; <xref ref-type="bibr" rid="B39">Yang et al., 2018</xref>; <xref ref-type="bibr" rid="B12">Guo et al., 2019</xref>; <xref ref-type="bibr" rid="B3">Camargo et al., 2020</xref>; <xref ref-type="bibr" rid="B9">Fan et al., 2020</xref>) with the version listed in <xref ref-type="sec" rid="s11">Supplementary Table S1</xref>. To have a fair and unbiased evaluation, we retrained CPAT, FEELnc, and RNAsamba that provided a training option using our human training dataset and used the pre-trained models of CPC2, CNIT, and LncADeep that did not provide a training option. Although PLEK and lncRNA_Mdeep came with a training option, retraining PLEK and lncRNA_Mdeep was very time-consuming, so we skipped retraining both and used their default pre-trained models.</p>
<sec id="s3-1-1">
<title>3.1.1 Performance Evaluation on the Human Test Set</title>
<p>The results on the human test set (<xref ref-type="table" rid="T3">Table 3</xref>) show that Xlnc1DCNN achieved the highest accuracy (94.53) and F1-Score (95.38), the second-highest precision (94.55) slightly lower than LncADeep, and the third-highest specificity (92.13) slightly lower than LncADeep and FEELnc. CPC2, CNIT, and CPAT achieved high sensitivity but much lower specificity. While FEELnc, RNAsamba, LncADeep, and lncRNA_Mdeep performed well on the average of every metric but overall, still lower than Xlnc1DCNN. We then analyzed the classification power of each tool by plotting a receiver operating characteristic curve (ROC) and measuring the area under the curve (AUC) as shown in <xref ref-type="fig" rid="F2">Figure 2A</xref>, where Xlnc1DCNN achieved the highest AUC (0.9825) on the human test set. <xref ref-type="fig" rid="F2">Figure 2B</xref> shows that Xlnc1DCNN also outperformed all tools on any range of sequence lengths of the human test set (<xref ref-type="sec" rid="s11">Supplementary Table S2</xref>).</p>
<table-wrap id="T3" position="float">
<label>TABLE 3</label>
<caption>
<p>Evaluation results of all tools on the human test set.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Model</th>
<th align="center">TP</th>
<th align="center">FP</th>
<th align="center">TN</th>
<th align="center">FN</th>
<th align="center">Accuracy</th>
<th align="center">Sensitivity</th>
<th align="center">Specificity</th>
<th align="center">Precision</th>
<th align="center">F1</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Xlnc1DCNN</td>
<td align="center">20,895</td>
<td align="center">1,204</td>
<td align="center">14,087</td>
<td align="center">821</td>
<td align="char" char=".">
<bold>94.53</bold>
</td>
<td align="char" char=".">96.22</td>
<td align="char" char=".">92.13</td>
<td align="char" char=".">94.55</td>
<td align="char" char=".">
<bold>95.38</bold>
</td>
</tr>
<tr>
<td align="left">CPC2</td>
<td align="center">21,023</td>
<td align="center">6,457</td>
<td align="center">8,834</td>
<td align="center">693</td>
<td align="char" char=".">80.68</td>
<td align="char" char=".">96.81</td>
<td align="char" char=".">57.77</td>
<td align="char" char=".">76.50</td>
<td align="char" char=".">85.47</td>
</tr>
<tr>
<td align="left">CNIT</td>
<td align="center">21,307</td>
<td align="center">3,580</td>
<td align="center">11,711</td>
<td align="center">409</td>
<td align="char" char=".">89.22</td>
<td align="char" char=".">
<bold>98.12</bold>
</td>
<td align="char" char=".">76.59</td>
<td align="char" char=".">85.61</td>
<td align="char" char=".">91.44</td>
</tr>
<tr>
<td align="left">PLEK</td>
<td align="center">20,704</td>
<td align="center">6,665</td>
<td align="center">8,626</td>
<td align="center">1,012</td>
<td align="char" char=".">79.26</td>
<td align="char" char=".">95.34</td>
<td align="char" char=".">56.41</td>
<td align="char" char=".">75.65</td>
<td align="char" char=".">84.36</td>
</tr>
<tr>
<td align="left">CPAT</td>
<td align="center">20,646</td>
<td align="center">2,597</td>
<td align="center">12,694</td>
<td align="center">1,070</td>
<td align="char" char=".">90.09</td>
<td align="char" char=".">95.07</td>
<td align="char" char=".">83.02</td>
<td align="char" char=".">88.83</td>
<td align="char" char=".">91.84</td>
</tr>
<tr>
<td align="left">FEELNC</td>
<td align="center">20,023</td>
<td align="center">1,182</td>
<td align="center">14,109</td>
<td align="center">1,693</td>
<td align="char" char=".">92.23</td>
<td align="char" char=".">92.20</td>
<td align="char" char=".">92.27</td>
<td align="char" char=".">94.43</td>
<td align="char" char=".">93.30</td>
</tr>
<tr>
<td align="left">RNASAMBA</td>
<td align="center">20,998</td>
<td align="center">1,795</td>
<td align="center">13,496</td>
<td align="center">718</td>
<td align="char" char=".">93.21</td>
<td align="char" char=".">96.69</td>
<td align="char" char=".">88.26</td>
<td align="char" char=".">92.12</td>
<td align="char" char=".">94.35</td>
</tr>
<tr>
<td align="left">lncRNA_Mdeep</td>
<td align="center">20,813</td>
<td align="center">1,799</td>
<td align="center">13,492</td>
<td align="center">903</td>
<td align="char" char=".">92.70</td>
<td align="char" char=".">95.84</td>
<td align="char" char=".">88.23</td>
<td align="char" char=".">92.04</td>
<td align="char" char=".">93.90</td>
</tr>
<tr>
<td align="left">LncADeep</td>
<td align="center">20,232</td>
<td align="center">1,113</td>
<td align="center">14,178</td>
<td align="center">1,484</td>
<td align="char" char=".">92.98</td>
<td align="char" char=".">93.17</td>
<td align="char" char=".">
<bold>92.72</bold>
</td>
<td align="char" char=".">
<bold>94.79</bold>
</td>
<td align="char" char=".">93.97</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>The bold values indicate the highest value within each column.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>
<bold>(A)</bold> ROC curves of all tools and their AUCs on the human test set. <bold>(B)</bold> Accuracy of all tools for any range of sequence lengths of the human test set.</p>
</caption>
<graphic xlink:href="fgene-13-876721-g002.tif"/>
</fig>
</sec>
<sec id="s3-1-2">
<title>3.1.2 Performance Evaluation on Cross-Species Datasets</title>
<p>To evaluate the generalization of Xlnc1DCNN with cross-species datasets, we compared the model with other tools using the mouse, gorilla, chicken, and cow datasets. The evaluation results show that Xlnc1DCNN, which was trained on the human dataset, has a generalization for classifying lncRNAs and mRNAs on other species (<xref ref-type="table" rid="T4">Table 4</xref> and <xref ref-type="sec" rid="s11">Supplementary Tables S3&#x2013;S6</xref>). Xlnc1DCNN achieved the highest accuracy on the gorilla dataset together with RNAsamba and the second highest accuracy on the mouse dataset while LncADeep achieved the highest accuracy on mouse and cow datasets. <xref ref-type="fig" rid="F3">Figure 3</xref> shows that Xlnc1DCNN has the ROC curves and AUCs close to other tools on cross-species datasets. Overall, based on AUCs, LncADeep got the best generalization performance on cross-species datasets.</p>
<table-wrap id="T4" position="float">
<label>TABLE 4</label>
<caption>
<p>Accuracy of the nine models on cross-species datasets.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Model</th>
<th align="center">Mouse</th>
<th align="center">Gorilla</th>
<th align="center">Chicken</th>
<th align="center">Cow</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Xlnc1DCNN</td>
<td align="char" char=".">92.58</td>
<td align="char" char=".">
<bold>96.06</bold>
</td>
<td align="char" char=".">92.35</td>
<td align="char" char=".">95.92</td>
</tr>
<tr>
<td align="left">CPC2</td>
<td align="char" char=".">80.06</td>
<td align="char" char=".">94.96</td>
<td align="char" char=".">93.51</td>
<td align="char" char=".">94.48</td>
</tr>
<tr>
<td align="left">CNIT</td>
<td align="char" char=".">87.68</td>
<td align="char" char=".">94.00</td>
<td align="char" char=".">92.94</td>
<td align="char" char=".">95.18</td>
</tr>
<tr>
<td align="left">PLEK</td>
<td align="char" char=".">73.62</td>
<td align="char" char=".">89.53</td>
<td align="char" char=".">79.54</td>
<td align="char" char=".">86.22</td>
</tr>
<tr>
<td align="left">CPAT</td>
<td align="char" char=".">89.46</td>
<td align="char" char=".">95.1</td>
<td align="char" char=".">93.70</td>
<td align="char" char=".">95.52</td>
</tr>
<tr>
<td align="left">FEELnc</td>
<td align="char" char=".">90.51</td>
<td align="char" char=".">94.8</td>
<td align="char" char=".">92.75</td>
<td align="char" char=".">93.97</td>
</tr>
<tr>
<td align="left">RNAsamba</td>
<td align="char" char=".">91.91</td>
<td align="char" char=".">
<bold>96.06</bold>
</td>
<td align="char" char=".">
<bold>93.98</bold>
</td>
<td align="char" char=".">96.39</td>
</tr>
<tr>
<td align="left">LncADeep</td>
<td align="char" char=".">
<bold>94.95</bold>
</td>
<td align="char" char=".">96.05</td>
<td align="char" char=".">93.46</td>
<td align="char" char=".">
<bold>96.70</bold>
</td>
</tr>
<tr>
<td align="left">lncRNA_Mdeep</td>
<td align="char" char=".">91.38</td>
<td align="char" char=".">95.58</td>
<td align="char" char=".">92.59</td>
<td align="char" char=".">95.63</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>The bold values indicate the highest value within each column.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Receiver operating characteristic curves and AUCs of nine models on the datasets of <bold>(A)</bold> mouse, <bold>(B)</bold> gorilla, <bold>(C)</bold> cow, and <bold>(D)</bold> chicken.</p>
</caption>
<graphic xlink:href="fgene-13-876721-g003.tif"/>
</fig>
</sec>
</sec>
<sec id="s3-2">
<title>3.2 Model Interpretation Results</title>
<p>As Xlnc1DCNN outperformed other tools on the human test set, we assumed that 1D-CNN captured patterns within sequences that could be used to distinguish lncRNAs from mRNAs. To explain the model, we used DeepSHAP to describe the contribution of each nucleotide to the prediction results. The explanation output from DeepSHAP was SHAP values for all nucleotides of the entire sequence. This explanation result was then visualized based on the summed SHAP values of each three consecutive nucleotides, with important representative amino acids highlighted in the sequence.</p>
<p>In the following subsections, we present the explanation results of Xlnc1DCNN focusing on the true positive, true negative, false positive, and false negative sequences predicted by Xlnc1DCNN on the human test set.</p>
<sec id="s3-2-1">
<title>3.2.1 True Positive Sequences</title>
<p>The explanation results of Xlnc1DCNN highlighted the important regions that contributed to the correct classification of an input lncRNA transcript sequence as a lncRNA with blue color. From <xref ref-type="fig" rid="F4">Figures 4A,B</xref>, the explanation results of the ENST00000658844.1 and lnc-REXO4-2:1 suggested that Xlnc1DCNN classified a transcript sequence as a lncRNA if it did not capture any important regions or specific patterns within the sequence. Additional explanation results of the TP sequences are shown in <xref ref-type="sec" rid="s11">Supplementary Figures S1&#x2013;S8</xref>.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Explanation results of Xlnc1DCNN on TP sequences <bold>(A)</bold> ENST00000658844.1, a lncRNA sequence obtained from GENCODE and <bold>(B)</bold> lnc-REXO4-2:1, a lncRNA sequence obtained from LNCipedia.</p>
</caption>
<graphic xlink:href="fgene-13-876721-g004.tif"/>
</fig>
</sec>
<sec id="s3-2-2">
<title>3.2.2 True Negative Sequences</title>
<p>The explanation results of Xlnc1DCNN highlighted the important regions of a protein-coding transcript (i.e., mRNA) as red, as shown in <xref ref-type="fig" rid="F5">Figures 5A&#x2013;C</xref>. <xref ref-type="fig" rid="F5">Figure 5D</xref> shows the transmembrane helix regions of the ENST00000528724.5 transcript predicted by TMHMM, corresponding to the important regions captured by Xlnc1DCNN. The prediction results of TMHMM and the explanation results of Xlnc1DCNN have similar patterns in several other mRNA transcripts within the test set (<xref ref-type="sec" rid="s11">Supplementary Figures S9 and S10</xref>). <xref ref-type="fig" rid="F5">Figure 5E</xref> shows the KRAB box (Kr&#xfc;ppel associated box) identified by Pfam within the transcript ENST00000593088.5, which mostly overlapped with the important region captured by Xlnc1DCNN as shown in <xref ref-type="fig" rid="F5">Figure 5B</xref>. <xref ref-type="fig" rid="F5">Figure 5F</xref> shows the FAM32A family (family with sequence similarity 32 member A) identified by InterPro within the ENST00000589852.5 transcript, which corresponds to the important region of the ENST00000589852.5 identified by Xlnc1DCNN as shown in <xref ref-type="fig" rid="F5">Figure 5C</xref>. This transcript has been linked to an ovarian tumor-associated gene (<xref ref-type="bibr" rid="B5">Chen et al., 2011</xref>). Additional explanation results of the TN sequences are shown in <xref ref-type="sec" rid="s11">Supplementary Figures S11&#x2013;S14</xref>.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>Comparison between the explanation results of Xlnc1DCNN on TN sequences <bold>(A)</bold> ENST00000528724.5 <bold>(B)</bold> ENST00000593088.5, and <bold>(C)</bold> ENST00000589852.5 protein-coding transcripts; and <bold>(D)</bold> prediction result of the TMHMM program on the ENST00000528724.5, <bold>(E)</bold> KRAB (Kr&#xfc;ppel associated box) domain identified by Pfam within the ENST00000593088.5, and <bold>(F)</bold> FAM32A family identified by InterPro within the ENST00000589852.5 transcripts.</p>
</caption>
<graphic xlink:href="fgene-13-876721-g005.tif"/>
</fig>
</sec>
<sec id="s3-2-3">
<title>3.2.3 False Positive Sequences</title>
<p>False positive sequences are mRNA transcript sequences that are predicted as lncRNAs. <xref ref-type="fig" rid="F6">Figure 6A</xref> shows the explanation result of ENST00000408930.6, which did not contain any important regions with red color contributing to the prediction as an mRNA. <xref ref-type="fig" rid="F6">Figures 6B,C</xref> show Pfam and InterPro&#x2019;s results that both could not identify any protein domains or families within the ENST00000408930.6 protein-coding transcript. While the Ensembl database reports the ENST00000408930.6 as a protein-coding transcript of the HEPN1 (ENSG00000221932) gene, the Gene database at NCBI reports HEPN1 as the ncRNA gene (<ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/gene/641654">https://www.ncbi.nlm.nih.gov/gene/641654</ext-link>) and the RefSeq database reports the NR_170,124.1 (ENST00000408930.6) as a long non-coding RNA (<ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/nuccore/NR_170124.1">https://www.ncbi.nlm.nih.gov/nuccore/NR_170124.1</ext-link>). Based on our evaluation, the top five long non-coding RNA identification (our Xlnc1DCNN, RNAsamba, LncADeep, lncRNA_Mdeep, FEELnc) predicted this sequence as lncRNA. This sequence highlights an example of inconsistent annotations among public databases that affect the model performance and evaluation. Additional explanation results of the FP sequences are shown in <xref ref-type="sec" rid="s11">Supplementary Figures S15&#x2013;S19</xref>.</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption>
<p>Comparison between <bold>(A)</bold> the explanation result of Xlnc1DCNN on the ENST00000408930.6 protein-coding transcript, predicted as a lncRNA, <bold>(B)</bold> identification result from Pfam, and <bold>(C)</bold> identification result from InterPro.</p>
</caption>
<graphic xlink:href="fgene-13-876721-g006.tif"/>
</fig>
</sec>
<sec id="s3-2-4">
<title>3.2.4 False Negative Sequences</title>
<p>False negative sequences are lncRNA transcript sequences that are predicted as mRNAs. <xref ref-type="fig" rid="F7">Figures 7A,B</xref> show the explanation results of lncRNAs: LNC-SIGIRR-2:1 and ENST00000616537.4 with important regions that contributed to the wrong prediction as mRNA transcripts. These regions correspond to the identified Anoctamin and the Taxilin InterPro families identified by InterPro, as shown in <xref ref-type="fig" rid="F7">Figures 7C,D</xref>. Additional explanation results of the FN sequences are shown in <xref ref-type="sec" rid="s11">Supplementary Figures S20&#x2013;S25</xref>.</p>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption>
<p>Comparison between the explanation result of Xlnc1DCNN on the long non-coding RNA transcripts <bold>(A)</bold> lnc-SIGIRR-2:1 and <bold>(B)</bold> ENST00000616537.4, predicted as mRNAs; <bold>(C)</bold> Anoctamin family within the lnc-SIGIRR-2:1 transcript and <bold>(D)</bold> Taxilin family within the ENST00000616537.4 transcript identified by InterPro.</p>
</caption>
<graphic xlink:href="fgene-13-876721-g007.tif"/>
</fig>
</sec>
</sec>
</sec>
<sec id="s4">
<title>4 Discussion</title>
<p>The explanation results of Xlnc1DCNN on the true positive sequences (TPs) show that most of the lncRNAs were found with no conserved regions or patterns in short regions with unknown functions, i.e., the highlighted regions do not correspond to any InterPro entries (<xref ref-type="sec" rid="s11">Supplementary Figures S1, S2</xref>). The important regions of some other lncRNA sequences highlighted transmembrane helices (<xref ref-type="sec" rid="s11">Supplementary Figures S3&#x2013;S5</xref>) or signal peptides (<xref ref-type="sec" rid="s11">Supplementary Figures S6&#x2013;S8</xref>). Over recent years, some studies also found a transmembrane helix inside lncRNAs (<xref ref-type="bibr" rid="B2">Anderson et al., 2015</xref>; <xref ref-type="bibr" rid="B22">Makarewich, 2020</xref>) and hidden peptides encoded within non-coding RNAs (<xref ref-type="bibr" rid="B24">Matsumoto and Nakayama, 2018</xref>). These findings correspond to what Xlnc1DCNN has learned and highlighted via the explanation result as important regions for classifying a sequence as lncRNA. Out of 20,895&#xa0;TPs, only 1,692 (8.10%) TPs were found with InterPro entries, 9,833 (47.06%) TPs were found with only intrinsically disordered regions (IDRs), and 11,490 (36.91%) TPs were found with transmembrane helices identified by TMHMM without any InterPro entries. Although 8.10% of TPs were found with InterPro entries, top protein domains and families of the TPs were found in only a few TNs (&#x2264;5) on the test set (<xref ref-type="sec" rid="s11">Supplementary Tables S7, S8</xref>).</p>
<p>On the true negative sequences (TNs), the explanation results of Xlnc1DCNN show that the model could capture the regions representing the protein domains or families in the transcript sequences. Out of 14,087&#xa0;TNs, 13,079 (92.86%), 882 (5.84%), and 289 (2.05%) TNs were found with InterPro entries, only IDRs, and transmembrane helices were identified by TMHMM without any InterPro entries. Hence, it could classify most of the input mRNA sequences correctly as the protein-coding transcripts.</p>
<p>The explanation results of false positive sequences (FPs) typically do not contain the important regions (red color) that contributed to the model prediction as mRNAs. Out of 1,204 FPs, 500 (42.53%) FPs were found without any InterPro entries, 359 (29.81%) FPs were found with only IDRs, and 161 (13.37%) FPs were found with transmembrane helices without any InterPro entries.</p>
<p>For false negative sequences (FNs), from a total of 821 FNs, there were 463 (56.39%) FNs found with InterPro entries, and the explanation results of FNs also correspond to these entries as shown in <xref ref-type="fig" rid="F7">Figure 7</xref>, and <xref ref-type="sec" rid="s11">Supplementary Figures S20&#x2013;S25</xref>, 264 (32.16%) FNs were found with only IDRs and 104 (12.67%) FNs were found with transmembrane helices.</p>
<p>We summarized the TP, TN, FP, and FN sequences of the test set annotated with InterPro entries in <xref ref-type="table" rid="T5">Table 5</xref>. For TPs, most of the sequences were found without InterPro entries, in contrast with TNs. The number of TPs annotated with only IDRs, or transmembrane helices also highlighted the contributions of these regions to the predicted sequences as lncRNAs. The 704 out of 1,204 (58.47%) and 358 out of 821 (43.61%) annotated FPs and FNs with and without InterPro entries indicated the limitations of Xlnc1DCNN. We then further analyzed the misclassified FPs and FNs by top tools (Xlnc1DCNN, RNAsamba, LncADeep lncRNA_Mdeep, FEELnc). The 93 out of 344 (27.03%) and 15 out of 105 (14.92%) annotated FPs and FNs with and without InterPro entries misclassified by all top tools suggested sequences that were difficult to identify. Finally, the 251 out of 344 (72.97%) and 90 out of 105 (85.71%) annotated FPs and FNs without and with InterPro entries misclassified by all top tools suggested the possible limitations of all top tools or inconsistent annotations across the public databases.</p>
<table-wrap id="T5" position="float">
<label>TABLE 5</label>
<caption>
<p>Summary of test set sequences annotated with InterPro entries.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Metrics</th>
<th align="center">Amount</th>
<th align="center">Found with InterPro Entries</th>
<th align="center">Found without InterPro Entries</th>
<th align="center">Contain IDRs without InterPro Entries</th>
<th align="center">Contain Transmembrane Helices without InterPro Entries</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">TP</td>
<td align="center">20,895</td>
<td align="center">1,692 (8.10%)</td>
<td align="center">19,203 (91.9%)</td>
<td align="center">9,833 (47.06%)</td>
<td align="center">7,713 (36.91%)</td>
</tr>
<tr>
<td align="left">TN</td>
<td align="center">14,087</td>
<td align="center">13,085 (92.89%)</td>
<td align="center">1,002 (7.11%)</td>
<td align="center">822 (5.84%)</td>
<td align="center">289 (2.05%)</td>
</tr>
<tr>
<td align="left">FP</td>
<td align="center">1,204</td>
<td align="center">704 (58.47%)</td>
<td align="center">500 (41.53%)</td>
<td align="center">359 (29.82%)</td>
<td align="center">161 (13.37%)</td>
</tr>
<tr>
<td align="left">FN</td>
<td align="center">821</td>
<td align="center">463 (56.39%)</td>
<td align="center">358 (43.61%)</td>
<td align="center">264 (32.16%)</td>
<td align="center">104 (12.67%)</td>
</tr>
<tr>
<td align="left">All missed FP</td>
<td align="center">344</td>
<td align="center">93 (27.03%)</td>
<td align="center">251 (72.97%)</td>
<td align="center">164 (47.67%)</td>
<td align="center">94 (27.33%)</td>
</tr>
<tr>
<td align="left">All missed FN</td>
<td align="center">105</td>
<td align="center">90 (85.71%)</td>
<td align="center">15 (14.92%)</td>
<td align="center">5 (4.76%)</td>
<td align="center">4 (3.81%)</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>We also analyzed the contribution of each nucleotide by plotting the mean of absolute SHAP values on the test set for a single nucleotide, dinucleotide, and trinucleotide (codon). The higher mean of absolute SHAP values indicates the higher impact of that genetic code (<xref ref-type="fig" rid="F8">Figure 8</xref>). For lncRNA, we found that the top three codons with the highest contribution were all stop codons (TAA, TGA, TTA), and for mRNA, the top three were the stop codon, start codon, and arginine (TGA, ATG, CGA). For dinucleotide, CG has the highest mean of absolute SHAP values for classifying as mRNA, which is consistent with those of (<xref ref-type="bibr" rid="B33">Ulveling et al., 2014</xref>).</p>
<fig id="F8" position="float">
<label>FIGURE 8</label>
<caption>
<p>Mean of absolute SHAP values for <bold>(A)</bold> single nucleotide, <bold>(B)</bold> dinucleotide, and <bold>(C)</bold> trinucleotide, indicating the impact of each genetic code on the model prediction as lncRNA or mRNA.</p>
</caption>
<graphic xlink:href="fgene-13-876721-g008.tif"/>
</fig>
<p>As recent studies found that some putative lncRNAs contain a short open reading frame (sORF) (<xref ref-type="bibr" rid="B13">Hartford and Lal, 2020</xref>), we further analyzed the association of lncRNAs and sORF using the explanation results of Xlnc1DCNN. Some false negative sequences were randomly selected and checked if they contained sORF using MetamORF (<xref ref-type="bibr" rid="B7">Choteau et al., 2021</xref>). While MetamORF found sORFs in some of these sequences, the reported regions of these sORFs did not correspond to the important regions highlighted by the explanation results.</p>
</sec>
<sec id="s5">
<title>5 Conclusion</title>
<p>In this study, we proposed Xlnc1DCNN, a simple but effective 1D-CNN model for classifying and explaining lncRNA and protein-coding transcripts. We have shown that using 1D-CNN as a feature extractor can lead to a better prediction performance than other existing tools using traditional feature extraction methods. The explanation results provided insights into what the model learned to distinguish the lncRNA from protein-coding transcripts. The transmembrane helix region highlighted by the explanation results of several true positive lncRNA transcripts agreed with the recent findings of transmembrane microproteins within lncRNAs. Disordered proteins without any important regions highlighted in the explanation results were misclassified as lncRNAs. Several explanation results of lncRNA misclassified as protein-coding transcripts contained important regions that correspond to protein domains or families in Pfam and/or InterPro. These insights revealed the complexity of long non-coding RNAs and the need to evaluate cross-referenced gene annotation among public databases periodically.</p>
</sec>
</body>
<back>
<sec id="s6">
<title>Data Availability Statement</title>
<p>The original contributions presented in the study are included in the article/<xref ref-type="sec" rid="s11">Supplementary Materials</xref>, further inquiries can be directed to the corresponding author.</p>
</sec>
<sec id="s7">
<title>Author Contributions</title>
<p>RL and DW: conceptualization, collecting resources, validation, investigation, writing&#x2014;review and editing. RL: methodology, software, formal analysis, data curation, writing&#x2014;original draft preparation, and visualization. DW: supervision, project administration, and funding acquisition. All authors contributed to manuscript revision, read, and approved the submitted version.</p>
</sec>
<sec id="s8">
<title>Funding</title>
<p>The Ratchadaphiseksomphot Endowment Fund Part of the &#x201c;Research Grant for New Scholar CU Researcher&#x2019;s Project&#x201d;, grant number (RGN_2559_025_06_21).</p>
</sec>
<sec sec-type="COI-statement" id="s9">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s10">
<title>Publisher&#x2019;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ack>
<p>A preprint of this article was previously deposited to bioRxiv and can be found here with this doi: <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1101/2022.02.11.479495">https://doi.org/10.1101/2022.02.11.479495</ext-link>
</p>
</ack>
<sec id="s11">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fgene.2022.876721/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fgene.2022.876721/full&#x23;supplementary-material</ext-link>
</p>
<supplementary-material xlink:href="DataSheet1.PDF" id="SM1" mimetype="application/PDF" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Acharya</surname>
<given-names>U. R.</given-names>
</name>
<name>
<surname>Fujita</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Lih</surname>
<given-names>O. S.</given-names>
</name>
<name>
<surname>Hagiwara</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Tan</surname>
<given-names>J. H.</given-names>
</name>
<name>
<surname>Adam</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Automated Detection of Arrhythmias Using Different Intervals of Tachycardia ECG Segments with Convolutional Neural Network</article-title>. <source>Inf. Sci.</source> <volume>405</volume>, <fpage>81</fpage>&#x2013;<lpage>90</lpage>. <pub-id pub-id-type="doi">10.1016/j.ins.2017.04.012</pub-id> </citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Anderson</surname>
<given-names>D. M.</given-names>
</name>
<name>
<surname>Anderson</surname>
<given-names>K. M.</given-names>
</name>
<name>
<surname>Chang</surname>
<given-names>C.-L.</given-names>
</name>
<name>
<surname>Makarewich</surname>
<given-names>C. A.</given-names>
</name>
<name>
<surname>Nelson</surname>
<given-names>B. R.</given-names>
</name>
<name>
<surname>McAnally</surname>
<given-names>J. R.</given-names>
</name>
<etal/>
</person-group> (<year>2015</year>). <article-title>A Micropeptide Encoded by a Putative Long Noncoding RNA Regulates Muscle Performance</article-title>. <source>Cell</source> <volume>160</volume> (<issue>4</issue>), <fpage>595</fpage>&#x2013;<lpage>606</lpage>. <pub-id pub-id-type="doi">10.1016/j.cell.2015.01.009</pub-id> </citation>
</ref>
<ref id="B40">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Blum</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Chang</surname>
<given-names>H.-Y.</given-names>
</name>
<name>
<surname>Chuguransky</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Grego</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Kandasaamy</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Mitchell</surname>
<given-names>A.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>The InterPro Protein Families and Domains Database: 20 Years on</article-title>. <source>Nucleic Acids Res.</source> <volume>49</volume> (<issue>D1</issue>), <fpage>D344</fpage>&#x2013;<lpage>D354</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkaa977</pub-id> </citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Camargo</surname>
<given-names>A. P.</given-names>
</name>
<name>
<surname>Sourkov</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Pereira</surname>
<given-names>G. A. G.</given-names>
</name>
<name>
<surname>Carazzolle</surname>
<given-names>M. F.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>RNAsamba: Neural Network-Based Assessment of the Protein-Coding Potential of RNA Sequences</article-title>. <source>NAR Genomics and Bioinformatics</source> <volume>2</volume> (<issue>1</issue>), <fpage>lqz024</fpage>. <pub-id pub-id-type="doi">10.1093/nargab/lqz024</pub-id> </citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chan</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Tay</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Noncoding RNA:RNA Regulatory Networks in Cancer</article-title>. <source>Int. J. Mol. Sci.</source> <volume>19</volume> (<issue>5</issue>), <fpage>1310</fpage>. <pub-id pub-id-type="doi">10.3390/ijms19051310</pub-id> </citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Aravindakshan</surname>
<given-names>J. P.</given-names>
</name>
<name>
<surname>Gotlieb</surname>
<given-names>W. H.</given-names>
</name>
<name>
<surname>Sairam</surname>
<given-names>M. R.</given-names>
</name>
</person-group> (<year>2011</year>). <article-title>Anti-proliferative and Pro-apoptotic Actions of a Novel Human and Mouse Ovarian Tumor-Associated Gene OTAG-12: Downregulation, Alternative Splicing and Drug Sensitization</article-title>. <source>Oncogene</source> <volume>30</volume> (<issue>25</issue>), <fpage>2874</fpage>&#x2013;<lpage>2887</lpage>. <pub-id pub-id-type="doi">10.1038/onc.2011.11</pub-id> </citation>
</ref>
<ref id="B6">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Lundberg</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>S.-I.</given-names>
</name>
</person-group> (<year>2019</year>). <source>Explaining Models by Propagating Shapley Values of Local Components</source>. </citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Choteau</surname>
<given-names>S. A.</given-names>
</name>
<name>
<surname>Wagner</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Pierre</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Spinelli</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Brun</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>MetamORF: a Repository of Unique Short Open reading Frames Identified by Both Experimental and Computational Approaches for Gene and Metagene Analyses</article-title>. <source>Database</source> <volume>2021</volume>, <fpage>baab032</fpage>. <pub-id pub-id-type="doi">10.1093/database/baab032</pub-id> </citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cunningham</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Achuthan</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Akanni</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Allen</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Amode</surname>
<given-names>M. R.</given-names>
</name>
<name>
<surname>Armean</surname>
<given-names>I. M.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Ensembl 2019</article-title>. <source>Nucleic Acids Res.</source> <volume>47</volume> (<issue>D1</issue>), <fpage>D745</fpage>&#x2013;<lpage>D688</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gky1113</pub-id> </citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fan</surname>
<given-names>X.-N.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>S.-W.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>S.-Y.</given-names>
</name>
<name>
<surname>Ni</surname>
<given-names>J.-J.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>lncRNA_Mdeep: An Alignment-free Predictor for Distinguishing Long Non-coding RNAs from Protein-Coding Transcripts by Multimodal Deep Learning</article-title>. <source>Int. J. Mol. Sci.</source> <volume>21</volume> (<issue>15</issue>), <fpage>5222</fpage>. <pub-id pub-id-type="doi">10.3390/ijms21155222</pub-id> </citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Fullwood</surname>
<given-names>M. J.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Roles, Functions, and Mechanisms of Long Non-coding RNAs in Cancer</article-title>. <source>Genomics, Proteomics &#x26; Bioinformatics</source> <volume>14</volume> (<issue>1</issue>), <fpage>42</fpage>&#x2013;<lpage>54</lpage>. <pub-id pub-id-type="doi">10.1016/j.gpb.2015.09.006</pub-id> </citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Frankish</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Diekhans</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Ferreira</surname>
<given-names>A.-M.</given-names>
</name>
<name>
<surname>Johnson</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Jungreis</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Loveland</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>GENCODE Reference Annotation for the Human and Mouse Genomes</article-title>. <source>Nucleic Acids Res.</source> <volume>47</volume> (<issue>D1</issue>), <fpage>D766</fpage>&#x2013;<lpage>D773</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gky955</pub-id> </citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Guo</surname>
<given-names>J.-C.</given-names>
</name>
<name>
<surname>Fang</surname>
<given-names>S.-S.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>J.-H.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>CNIT: a Fast and Accurate Web Tool for Identifying Protein-Coding and Long Non-coding Transcripts Based on Intrinsic Sequence Composition</article-title>. <source>Nucleic Acids Res.</source> <volume>47</volume> (<issue>W1</issue>), <fpage>W516</fpage>&#x2013;<lpage>W522</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkz400</pub-id> </citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hartford</surname>
<given-names>C. C. R.</given-names>
</name>
<name>
<surname>Lal</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>When Long Noncoding Becomes Protein Coding</article-title>. <source>Mol. Cel Biol</source> <volume>40</volume> (<issue>6</issue>), <fpage>e00528</fpage>&#x2013;<lpage>00519</lpage>. <pub-id pub-id-type="doi">10.1128/MCB.00528-19</pub-id> </citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hsieh</surname>
<given-names>C.-H.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y.-S.</given-names>
</name>
<name>
<surname>Hwang</surname>
<given-names>B.-J.</given-names>
</name>
<name>
<surname>Hsiao</surname>
<given-names>C.-H.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Detection of Atrial Fibrillation Using 1D Convolutional Neural Network</article-title>. <source>Sensors</source> <volume>20</volume> (<issue>7</issue>), <fpage>2136</fpage>. <pub-id pub-id-type="doi">10.3390/s20072136</pub-id> </citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jin</surname>
<given-names>K.-T.</given-names>
</name>
<name>
<surname>Yao</surname>
<given-names>J.-Y.</given-names>
</name>
<name>
<surname>Fang</surname>
<given-names>X.-L.</given-names>
</name>
<name>
<surname>Di</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>Y.-Y.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Roles of lncRNAs in Cancer: Focusing on Angiogenesis</article-title>. <source>Life Sci.</source> <volume>252</volume>, <fpage>117647</fpage>. <pub-id pub-id-type="doi">10.1016/j.lfs.2020.117647</pub-id> </citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kang</surname>
<given-names>Y.-J.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>D.-C.</given-names>
</name>
<name>
<surname>Kong</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Hou</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Meng</surname>
<given-names>Y.-Q.</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>L.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). <article-title>CPC2: a Fast and Accurate Coding Potential Calculator Based on Sequence Intrinsic Features</article-title>. <source>Nucleic Acids Res.</source> <volume>45</volume> (<issue>W1</issue>), <fpage>W12</fpage>&#x2013;<lpage>W16</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkx428</pub-id> </citation>
</ref>
<ref id="B17">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kiranyaz</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Avci</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Abdeljaber</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Ince</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Gabbouj</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Inman</surname>
<given-names>D. J.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>1D Convolutional Neural Networks and Applications: A Survey</article-title>. <source>Mech. Syst. Signal Process.</source> <volume>151</volume>, <fpage>107398</fpage>. <pub-id pub-id-type="doi">10.1016/j.ymssp.2020.107398</pub-id> </citation>
</ref>
<ref id="B41">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Krogh</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Larsson</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>von Heijne</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Sonnhammer</surname>
<given-names>E. L.</given-names>
</name>
</person-group> (<year>2001</year>). <article-title>Predicting Transmembrane Protein Topology With a Hidden Markov Model: Application to Complete Genomes</article-title>. <source>J. Mol. Biol.</source> <volume>305</volume> (<issue>3</issue>), <fpage>567</fpage>&#x2013;<lpage>580</lpage>. <pub-id pub-id-type="doi">10.1006/jmbi.2000.4315</pub-id> </citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>Z.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>PLEK: a Tool for Predicting Long Non-coding RNAs and Messenger RNAs Based on an Improved K-Mer Scheme</article-title>. <source>BMC Bioinformatics</source> <volume>15</volume> (<issue>1</issue>), <fpage>311</fpage>. <pub-id pub-id-type="doi">10.1186/1471-2105-15-311</pub-id> </citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Kong</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Dong</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>X.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Feature Extraction and Classification of Heart Sound Using 1D Convolutional Neural Networks</article-title>. <source>EURASIP J. Adv. Signal. Process.</source> <volume>2019</volume> (<issue>1</issue>), <fpage>59</fpage>. <pub-id pub-id-type="doi">10.1186/s13634-019-0651-3</pub-id> </citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Godzik</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2006</year>). <article-title>Cd-hit: a Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences</article-title>. <source>Bioinformatics</source> <volume>22</volume> (<issue>13</issue>), <fpage>1658</fpage>&#x2013;<lpage>1659</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btl158</pub-id> </citation>
</ref>
<ref id="B21">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Lundberg</surname>
<given-names>S. M.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>S.-I.</given-names>
</name>
</person-group> (<year>2017</year>). <source>A Unified Approach to Interpreting Model Predictions</source>, <fpage>4765</fpage>&#x2013;<lpage>4774</lpage>. </citation>
</ref>
<ref id="B22">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Makarewich</surname>
<given-names>C. A.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>The Hidden World of Membrane Microproteins</article-title>. <source>Exp. Cel Res.</source> <volume>388</volume> (<issue>2</issue>), <fpage>111853</fpage>. <pub-id pub-id-type="doi">10.1016/j.yexcr.2020.111853</pub-id> </citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marchese</surname>
<given-names>F. P.</given-names>
</name>
<name>
<surname>Raimondi</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Huarte</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>The Multidimensional Mechanisms of Long Noncoding RNA Function</article-title>. <source>Genome Biol.</source> <volume>18</volume> (<issue>1</issue>), <fpage>206</fpage>. <pub-id pub-id-type="doi">10.1186/s13059-017-1348-2</pub-id> </citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Matsumoto</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Nakayama</surname>
<given-names>K. I.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Hidden Peptides Encoded by Putative Noncoding RNAs</article-title>. <source>Cell Struct. Funct.</source> <volume>43</volume> (<issue>1</issue>), <fpage>75</fpage>&#x2013;<lpage>83</lpage>. <pub-id pub-id-type="doi">10.1247/csf.18005</pub-id> </citation>
</ref>
<ref id="B42">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mistry</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Chuguransky</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Williams</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Qureshi</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Salazar,</surname> </name>
<name>
<surname>Gustavo</surname>
<given-names>A.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Pfam: The Protein Families Database in 2021</article-title>. <source>Nucleic Acids Res.</source> <volume>49</volume> (<issue>D1</issue>), <fpage>D412</fpage>&#x2013;<lpage>D419</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkaa913</pub-id> </citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mor&#xe1;n</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Akerman</surname>
<given-names>&#x130;.</given-names>
</name>
<name>
<surname>van de Bunt</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Xie</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Benazra</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Nammo</surname>
<given-names>T.</given-names>
</name>
<etal/>
</person-group> (<year>2012</year>). <article-title>Human &#x3b2; Cell Transcriptome Analysis Uncovers lncRNAs that Are Tissue-specific, Dynamically Regulated, and Abnormally Expressed in Type 2 Diabetes</article-title>. <source>Cel Metab.</source> <volume>16</volume> (<issue>4</issue>), <fpage>435</fpage>&#x2013;<lpage>448</lpage>. <pub-id pub-id-type="doi">10.1016/j.cmet.2012.08.010</pub-id> </citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Piovesan</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Necci</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Escobedo</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Monzon</surname>
<given-names>A. M.</given-names>
</name>
<name>
<surname>Hatos</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Mi&#x10d;eti&#x107;</surname>
<given-names>I.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>MobiDB: Intrinsically Disordered Proteins in 2021</article-title>. <source>Nucleic Acids Res.</source> <volume>49</volume> (<issue>D1</issue>), <fpage>D361</fpage>&#x2013;<lpage>d367</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkaa1058</pub-id> </citation>
</ref>
<ref id="B27">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Ribeiro</surname>
<given-names>M. T.</given-names>
</name>
<name>
<surname>Singh</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Guestrin</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>"Why Should I Trust You?": Explaining the Predictions of Any Classifier</article-title>,&#x201d; in <conf-name>Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</conf-name>, <conf-loc>San Francisco, California, USA</conf-loc> (<publisher-name>Association for Computing Machinery</publisher-name>). <pub-id pub-id-type="doi">10.1145/2939672.2939778</pub-id> </citation>
</ref>
<ref id="B28">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rinn</surname>
<given-names>J. L.</given-names>
</name>
<name>
<surname>Chang</surname>
<given-names>H. Y.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>Genome Regulation by Long Noncoding RNAs</article-title>. <source>Annu. Rev. Biochem.</source> <volume>81</volume> (<issue>1</issue>), <fpage>145</fpage>&#x2013;<lpage>166</lpage>. <pub-id pub-id-type="doi">10.1146/annurev-biochem-051410-092902</pub-id> </citation>
</ref>
<ref id="B29">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Shrikumar</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Greenside</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Kundaje</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Learning Important Features through Propagating Activation Differences</article-title>,&#x201d; in <conf-name>Proceedings of the 34th International Conference on Machine Learning - Volume 70</conf-name>, <conf-loc>Sydney, NSW, Australia</conf-loc> (<publisher-name>JMLR.org</publisher-name>). <pub-id pub-id-type="doi">10.48550/arXiv.1704.02685</pub-id> </citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Stark</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Grzelak</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Hadfield</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>RNA Sequencing: the Teenage Years</article-title>. <source>Nat. Rev. Genet.</source> <volume>20</volume> (<issue>11</issue>), <fpage>631</fpage>&#x2013;<lpage>656</lpage>. <pub-id pub-id-type="doi">10.1038/s41576-019-0150-2</pub-id> </citation>
</ref>
<ref id="B31">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Statello</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Guo</surname>
<given-names>C.-J.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>L.-L.</given-names>
</name>
<name>
<surname>Huarte</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Gene Regulation by Long Non-coding RNAs and its Biological Functions</article-title>. <source>Nat. Rev. Mol. Cel Biol</source> <volume>22</volume> (<issue>2</issue>), <fpage>96</fpage>&#x2013;<lpage>118</lpage>. <pub-id pub-id-type="doi">10.1038/s41580-020-00315-9</pub-id> </citation>
</ref>
<ref id="B32">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tjoa</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Guan</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>A Survey on Explainable Artificial Intelligence (XAI): Toward Medical XAI</article-title>. <source>IEEE Trans. Neural Netw. Learn. Syst.</source> <volume>32</volume> (<issue>11</issue>), <fpage>4793</fpage>&#x2013;<lpage>4813</lpage>. <pub-id pub-id-type="doi">10.1109/tnnls.2020.3027314</pub-id> </citation>
</ref>
<ref id="B33">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ulveling</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Dinger</surname>
<given-names>M. E.</given-names>
</name>
<name>
<surname>Francastel</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Hub&#xc3;&#xa9;</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>Identification of a Dinucleotide Signature that Discriminates Coding from Non-coding Long RNAs</article-title>. <source>Front. Genet.</source> <volume>5</volume>, <fpage>316</fpage>. <pub-id pub-id-type="doi">10.3389/fgene.2014.00316</pub-id> </citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Volders</surname>
<given-names>P.-J.</given-names>
</name>
<name>
<surname>Anckaert</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Verheggen</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Nuytens</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Martens</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Mestdagh</surname>
<given-names>P.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>LNCipedia 5: towards a Reference Set of Human Long Non-coding RNAs</article-title>. <source>Nucleic Acids Res.</source> <volume>47</volume> (<issue>D1</issue>), <fpage>D135</fpage>&#x2013;<lpage>D139</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gky1031</pub-id> </citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Gerstein</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Snyder</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2009</year>). <article-title>RNA-seq: a Revolutionary Tool for Transcriptomics</article-title>. <source>Nat. Rev. Genet.</source> <volume>10</volume> (<issue>1</issue>), <fpage>57</fpage>&#x2013;<lpage>63</lpage>. <pub-id pub-id-type="doi">10.1038/nrg2484</pub-id> </citation>
</ref>
<ref id="B36">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>H. J.</given-names>
</name>
<name>
<surname>Dasari</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Kocher</surname>
<given-names>J.-P.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>W.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>CPAT: Coding-Potential Assessment Tool Using an Alignment-free Logistic Regression Model</article-title>. <source>Nucleic Acids Res.</source> <volume>41</volume> (<issue>6</issue>), <fpage>e74</fpage>. <pub-id pub-id-type="doi">10.1093/nar/gkt006</pub-id> </citation>
</ref>
<ref id="B37">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wucher</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Legeai</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>H&#xe9;dan</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Rizk</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Lagoutte</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Leeb</surname>
<given-names>T.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). <article-title>FEELnc: a Tool for Long Non-coding RNA Annotation and its Application to the Dog Transcriptome</article-title>. <source>Nucleic Acids Res.</source> <volume>45</volume> (<issue>8</issue>), <fpage>gkw1306</fpage>&#x2013;<lpage>e57</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkw1306</pub-id> </citation>
</ref>
<ref id="B38">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yamashita</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Nishio</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Do</surname>
<given-names>R. K. G.</given-names>
</name>
<name>
<surname>Togashi</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Convolutional Neural Networks: an Overview and Application in Radiology</article-title>. <source>Insights Imaging</source> <volume>9</volume> (<issue>4</issue>), <fpage>611</fpage>&#x2013;<lpage>629</lpage>. <pub-id pub-id-type="doi">10.1007/s13244-018-0639-9</pub-id> </citation>
</ref>
<ref id="B39">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yang</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Xie</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>M. D.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>LncADeep: Anab initiolncRNA Identification and Functional Annotation Tool Based on Deep Learning</article-title>. <source>Bioinformatics</source> <volume>34</volume> (<issue>22</issue>), <fpage>3825</fpage>&#x2013;<lpage>3834</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/bty428</pub-id> </citation>
</ref>
</ref-list>
</back>
</article>