<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Archiving and Interchange DTD v2.3 20070202//EN" "archivearticle.dtd">
<article article-type="methods-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Genet.</journal-id>
<journal-title>Frontiers in Genetics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Genet.</abbrev-journal-title>
<issn pub-type="epub">1664-8021</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">896925</article-id>
<article-id pub-id-type="doi">10.3389/fgene.2022.896925</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Genetics</subject>
<subj-group>
<subject>Methods</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>i5hmCVec: Identifying 5-Hydroxymethylcytosine Sites of <italic>Drosophila</italic> RNA Using Sequence Feature Embeddings</article-title>
<alt-title alt-title-type="left-running-head">Liu and Du</alt-title>
<alt-title alt-title-type="right-running-head">i5hmcVec</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Liu</surname>
<given-names>Hang-Yu</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/1722456/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Du</surname>
<given-names>Pu-Feng</given-names>
</name>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/778584/overview"/>
</contrib>
</contrib-group>
<aff>
<institution>College of Intelligence and Computing</institution>, <institution>Tianjin University</institution>, <addr-line>Tianjin</addr-line>, <country>China</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/142613/overview">Hongmin Cai</ext-link>, South China University of Technology, China</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/808177/overview">Wang-Ren Qiu</ext-link>, Jingdezhen Ceramic Institute, China</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/611410/overview">Yan Xu</ext-link>, University of Science and Technology Beijing, China</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Pu-Feng Du, <email>pdu@tju.edu.cn</email>
</corresp>
<fn fn-type="other">
<p>This article was submitted to Computational Genomics, a section of the journal Frontiers in Genetics</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>03</day>
<month>05</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>13</volume>
<elocation-id>896925</elocation-id>
<history>
<date date-type="received">
<day>15</day>
<month>03</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>30</day>
<month>03</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2022 Liu and Du.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Liu and Du</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>5-Hydroxymethylcytosine (5hmC), one of the most important RNA modifications, plays an important role in many biological processes. Accurately identifying RNA modification sites helps understand the function of RNA modification. In this work, we propose a computational method for identifying 5hmC-modified regions using machine learning algorithms. We applied a sequence feature embedding method based on the dna2vec algorithm to represent the RNA sequence. The results showed that the performance of our model is better that of than state-of-art methods. All dataset and source codes used in this study are available at: <ext-link ext-link-type="uri" xlink:href="https://github.com/liu-h-y/5hmC_model">https://github.com/liu-h-y/5hmC_model</ext-link>.</p>
</abstract>
<kwd-group>
<kwd>5-hydroxymethylcytosine</kwd>
<kwd>dna2vec</kwd>
<kwd>machine learning</kwd>
<kwd>cross-validation</kwd>
<kwd>i5hmcVec</kwd>
</kwd-group>
<contract-num rid="cn001">61872268</contract-num>
<contract-num rid="cn002">2018YFC0910405</contract-num>
<contract-sponsor id="cn001">National Natural Science Foundation of China<named-content content-type="fundref-id">10.13039/501100001809</named-content>
</contract-sponsor>
<contract-sponsor id="cn002">National Key Research and Development Program of China<named-content content-type="fundref-id">10.13039/501100012166</named-content>
</contract-sponsor>
</article-meta>
</front>
<body>
<sec id="s1">
<title>Introduction</title>
<p>Posttranscriptional modifications have been extensively studied over the last few years. More than 160 types of modification have been identified across all kingdoms of life (<xref ref-type="bibr" rid="B5">Boccaletto et al., 2018</xref>). Posttranscriptional modifications play important roles in various biological processes, such as RNA degradation (<xref ref-type="bibr" rid="B38">Sommer et al., 1978</xref>), RNA splicing (<xref ref-type="bibr" rid="B26">Lindstrom et al., 2003</xref>), and transcriptional regulations (<xref ref-type="bibr" rid="B8">Cowling, 2009</xref>). To understand the mechanism of RNA modifications, it is important to pinpoint the modification sites in the RNA sequences (<xref ref-type="bibr" rid="B13">Dominissini et al., 2012</xref>; <xref ref-type="bibr" rid="B29">Meyer et al., 2012</xref>).</p>
<p>With the rapid development of high-throughput technology, several experimental methods for identifying RNA modification sites have been developed, such as MERIP (<xref ref-type="bibr" rid="B29">Meyer et al., 2012</xref>) and m6A-seq (<xref ref-type="bibr" rid="B13">Dominissini et al., 2012</xref>). These methods are more capable of picking up the modified transcripts or regions on the transcripts, rather than accurately pinpointing the modification sites. With the advances in modern life sciences, especially the cross-linking technology, methods for identifying RNA modification sites at single-base resolution were also proposed, including miCLIP (<xref ref-type="bibr" rid="B25">Linder et al., 2015</xref>), PA-m6A-seq (<xref ref-type="bibr" rid="B21">Kai Chen et al., 2015</xref>), and m7G-MeRIP-seq (<xref ref-type="bibr" rid="B46">Zhang et al., 2019</xref>). However, these experimental methods are still costly and time-consuming. Therefore, computational methods have been proposed as alternative approaches. A series of bioinformatics tools using machine learning algorithms for predicting m6A (<xref ref-type="bibr" rid="B41">Wei Chen et al., 2015</xref>; <xref ref-type="bibr" rid="B47">Zhou et al., 2016</xref>; <xref ref-type="bibr" rid="B18">Huang et al., 2018</xref>; <xref ref-type="bibr" rid="B24">Kunqi Chen et al., 2019</xref>; <xref ref-type="bibr" rid="B48">Zou et al., 2019</xref>), m5C (<xref ref-type="bibr" rid="B35">Qiu et al., 2017</xref>; <xref ref-type="bibr" rid="B36">Sabooh et al., 2018</xref>; <xref ref-type="bibr" rid="B2">Akbar et al., 2020</xref>; <xref ref-type="bibr" rid="B14">Dou et al., 2020</xref>), m7G (<xref ref-type="bibr" rid="B42">Wei Chen et al., 2019</xref>, 7; <xref ref-type="bibr" rid="B27">Liu X. et al., 2020</xref>; <xref ref-type="bibr" rid="B43">Yang et al., 2020</xref>; <xref ref-type="bibr" rid="B9">Dai et al., 2021</xref>), and many others have been developed. A recent review article has elaborated on the differences between these studies, in the aspect of benchmarking datasets, feature encoding schemes, and the main algorithms (<xref ref-type="bibr" rid="B7">Chen et al., 2020</xref>).</p>
<p>5-Hydroxymethylcytosine (5hmC) plays a key role in various cellular processes. 5hmC modification exists on both RNA and DNA sequences (<xref ref-type="bibr" rid="B45">Zhang et al., 2016</xref>). Most of the existing studies focused on the DNA 5hmC modifications (<xref ref-type="bibr" rid="B39">Szwagierczak et al., 2010</xref>; <xref ref-type="bibr" rid="B34">Pastor et al., 2011</xref>; <xref ref-type="bibr" rid="B44">Yu et al., 2012</xref>; <xref ref-type="bibr" rid="B4">Bachman et al., 2014</xref>). The RNA 5hmC modifications were much less studied (<xref ref-type="bibr" rid="B15">Fu et al., 2014</xref>; <xref ref-type="bibr" rid="B20">Huber et al., 2015</xref>; <xref ref-type="bibr" rid="B11">Delatte et al., 2016</xref>; <xref ref-type="bibr" rid="B30">Miao et al., 2016</xref>). Fu et al. first found that the m5C site can be catalyzed by the Tet enzyme to form 5hmC sites with a ratio of about 0.02% <italic>in vitro</italic> in mammalian RNA (<xref ref-type="bibr" rid="B15">Fu et al., 2014</xref>). In addition, a discovery that Tet-mediated oxidation of m5C in RNA is much less efficient than that in DNA (<xref ref-type="bibr" rid="B15">Fu et al., 2014</xref>). Huber et al. verified that 5hmC is the result of m5C oxidation <italic>in vivo</italic> in a mouse model using an isotope-tracing methodology (<xref ref-type="bibr" rid="B20">Huber et al., 2015</xref>). They also found that in worms and plants, the formation of 5hmC in RNA does not require a Tet-mediated oxidation mechanism. <xref ref-type="bibr" rid="B30">Miao et al. (2016)</xref> found that 5hmC in RNA is rich in the mouse brain, which is potentially related to brain functions. <xref ref-type="bibr" rid="B11">Delatte et al. (2016</xref>) systematically identified 5hmC modifications in <italic>Drosophila</italic> transcriptome using the hMeRIP-seq method. Using the data from Delatte et al., Liu et al. developed a predictor iRNA5hmC for computationally identifying 5hmC modifications with machine learning algorithms (<xref ref-type="bibr" rid="B28">Liu Y. et al., 2020</xref>). Ahmed et al. also constructed a predictor iRNA5hmC-PS (<xref ref-type="bibr" rid="B1">Ahmed et al., 2020</xref>) by using position-specific binary indicators of RNA sequences. However, Delatte et al. did not provide the exact location of 5hmC modification sites in the transcriptome (<xref ref-type="bibr" rid="B11">Delatte et al., 2016</xref>). Liu et al. provided the exact location by randomly selecting cytosine sites within the peak region detecting by MeRIP-seq (<xref ref-type="bibr" rid="B28">Liu Y. et al., 2020</xref>). However, such a strategy may lead to many false-positive samples (<xref ref-type="bibr" rid="B24">Kunqi Chen et al., 2019</xref>). To avoid such uncertainty, we proposed a model based on low-resolution data.</p>
<p>The rapid development of deep learning has promoted natural language processing studies. Word2vec is a remarkable achievement in natural language processing technology (<xref ref-type="bibr" rid="B31">Mikolov et al., 2013</xref>). Distributed representation of word vector is the core idea of word2vec, which means the representation of a word can be inferred from its context. With the development of high-throughput sequencing technology, the sequencing quality of biological sequences can be guaranteed. Therefore, some researchers in bioinformatics regard the biological sequences as a sentence, and k-mers as words. The word2vec method can then be applied to represent the biological sequences. Asgari et al. proposed BioVec based on the skip-gram model for biological sequences representation (<xref ref-type="bibr" rid="B3">Asgari and Mofrad, 2015</xref>). Kimothi et al. developed a model named seq2vec based on doc2vec, which is an extension of the original word2vec (<xref ref-type="bibr" rid="B23">Kimothi et al., 2016</xref>). The dna2vec model is dedicated to representing variable-length words (<xref ref-type="bibr" rid="B32">Ng, 2017a</xref>). It has been applied to several topics in bioinformatics. For example, Deng et al. proposed D2VCB for predicting protein&#x2013;DNA-binding sites based on k-mer embeddings (<xref ref-type="bibr" rid="B12">Deng et al., 2019</xref>). Hong et al. applied the pretrained k-mer embeddings to encode enhancers and promoters (<xref ref-type="bibr" rid="B16">Hong et al., 2020</xref>). We employed the dna2vec embeddings to represent k-mers of <italic>Drosophila</italic> genomic sequences.</p>
<p>In this study, we represent the RNA sequences by using feature embeddings. We applied an SVM classifier to create a model for predicting 5hmC modification sites. Our model was trained on the low-resolution modification datasets, which is more reliable than the 1-base resolution set. The result suggests that our model is effective in identifying 5hmC sites.</p>
</sec>
<sec sec-type="materials|methods" id="s2">
<title>Materials and Methods</title>
<sec id="s2-1">
<title>Datasets</title>
<p>In this study, we constructed the benchmarking dataset according to the experimental result from <xref ref-type="bibr" rid="B11">Delatte et al. (2016</xref>). The result from Delatte et al. contains 3058 peak regions distributed on chromosomes, which contain chr2L, chr2R, chr3L, chr3R, chr4, chrX, chr2RHet, chr3LHet, chr3RHet, chrYHet, chrU, and chrUextra. According to <xref ref-type="bibr" rid="B17">Hoskins et al. (2015</xref>), the genome sequences are of high quality on chr2L, chr2R, chr3L, chr3R, chr4, and chrX, while the remaining chromosome sequences are of low quality. Therefore, we only used the sequence data from chr2L, chr2R, chr3L, chr3R, chr4, and chrX. We got 2616 peak regions containing 5hmC modification sites. Subsequently, we obtained the transcription direction of every region by querying the UCSC genome browser tracks (<xref ref-type="bibr" rid="B22">Karolchik et al., 2003</xref>). Finally, 2616 positive samples were curated, which are regions containing 5hmC modification sites. Non-peak regions within transcripts carrying peak regions are curated as negative samples. The non-peak regions were cropped to the same lengths as the peak regions in a one-vs.-one strategy. A total of 2616 positive samples and 2616 negative samples were finally curated. We plot the density distribution of sequence lengths in <xref ref-type="fig" rid="F1">Figure 1</xref>.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>Sequence length distribution. The X-axis represents the length of sequences. The Y-axis represents the density of distribution. <bold>(A)</bold> Histogram density for the distribution of the length of positive sequences. <bold>(B)</bold> Histogram density for the distribution of the length of positive sequences.</p>
</caption>
<graphic xlink:href="fgene-13-896925-g001.tif"/>
</fig>
</sec>
<sec id="s2-2">
<title>
<italic>K</italic>-Mer Embeddings</title>
<p>
<italic>K</italic>-mer is a common and efficient way to represent RNA sequences, which divided the biological sequences into short segments of the length <italic>k</italic>. We employed the <italic>k</italic>-mer embeddings for representing the <italic>k</italic>-mer instead of one-hot encoding. <italic>K</italic>-mer embeddings can capture semantic and linguistic analogies and avoid the curse of dimensionality (<xref ref-type="bibr" rid="B31">Mikolov et al., 2013</xref>). The dna2vec model was used in this study for training <italic>k</italic>-mer embeddings (<xref ref-type="bibr" rid="B32">Ng, 2017a</xref>, 2). The corpus was collected from dm3 (<xref ref-type="bibr" rid="B22">Karolchik et al., 2003</xref>) genome assembly. We selected high-quality six chromosome sequences from dm3, including ch2L, chr2R, chr3L, chr3R, chr4, and chrX. The corpus was used as the input of the dna2vec. <italic>K</italic>-mer embeddings were obtained by training dna2vec. Let <italic>p</italic> (<italic>k</italic>, <italic>i</italic>) (<italic>i</italic> &#x3d; 1, 2,&#x2026;4<sup>
<italic>k</italic>
</sup>) represent the <italic>i</italic>-th type <italic>k</italic>-mer fragment. The process of the dna2vec model can be expressed as follows:<disp-formula id="e1">
<mml:math id="m1">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mo>&#x2192;</mml:mo>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mover>
</mml:mrow>
<mml:mi mathvariant="bold">v</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(1)</label>
</disp-formula>where <italic>h</italic>(.) is the mapping from a <italic>k</italic>-mer fragment to <italic>k</italic>-mer embedding and <bold>v</bold>(<italic>p</italic> (<italic>k</italic>, <italic>i</italic>)) is the embedding vector of the <italic>i</italic>-th type of <italic>k</italic>-mer. In this study, we chose <italic>k</italic> from 3 to 8. The dimension of <bold>v</bold>(<italic>p</italic> (<italic>k</italic>, <italic>i</italic>)) was set to 100.</p>
</sec>
<sec id="s2-3">
<title>Distribution Representation of RNA Sequences</title>
<p>Given an RNA sequence <italic>r</italic> with length <italic>l</italic>, it can be represented as follows:<disp-formula id="e2">
<mml:math id="m2">
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>&#x22ef;</mml:mo>
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mi>l</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(2)</label>
</disp-formula>where <italic>n<sub>u</sub>
</italic> (<italic>u</italic> &#x3d; 1, 2,&#x2026;, <italic>l</italic>) represents <italic>u</italic>-th nucleotide in RNA sequence. The RNA sequences are segmented into <italic>k</italic>-mers in an overlapping way. For example, we convert AUAGC into three 3-mers: &#x201c;AUA,&#x201d; &#x201c;UAG,&#x201d; &#x201c;AGC.&#x201d; Therefore, sequence <italic>r</italic> divided by <italic>k</italic> can be represented as follows:<disp-formula id="e3">
<mml:math id="m3">
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>&#x002B;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(3)</label>
</disp-formula>where <italic>w</italic>
<sub>
<italic>j</italic>
</sub> (<italic>j</italic> &#x3d; 1, 2,&#x2026;, <italic>l&#x2212;k&#x002B;1</italic>) &#x2208; {<italic>p</italic>(<italic>k</italic>, <italic>i</italic>) &#x7c;<italic>k</italic> &#x3d; 3, 4,&#x2026;, 8, <italic>i</italic> &#x3d; 1, 2,&#x2026;, 4<sup>
<italic>k</italic>
</sup>}. The fragment of <italic>k</italic>-mer RNA sequence can be considered as an RNA word. With the mapping <italic>h</italic>(.) from dna2vec, <italic>w</italic>
<sub>
<italic>i</italic>
</sub> was converted into the corresponding embedding vector. Sequence <italic>r</italic> can be expressed in a matrix as follows:<disp-formula id="e4">
<mml:math id="m4">
<mml:mrow>
<mml:mi mathvariant="bold">E</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="italic">r</mml:mi>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mi mathvariant="normal">&#x3d;</mml:mi>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mi mathvariant="bold">v</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mi mathvariant="bold">v</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
<mml:mtd>
<mml:mo>&#x2026;</mml:mo>
</mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mi mathvariant="bold">v</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="italic">w</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>&#x002B;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(4)</label>
</disp-formula>
</p>
<p>Since dna2vec was trained by a corpus of DNA sequences, the <italic>k-</italic>mers from dna2vec do not contain uracil. We replaced thymine with uracil on <italic>k</italic>-mers for using the mapping. Considering the sum of dna2vec embeddings along the sequence is related to concatenating <italic>k</italic>-mers (<xref ref-type="bibr" rid="B33">Ng, 2017b</xref>), we sum the embedding vector in <italic>E(r, k)</italic> for representing the sequence <italic>r</italic>, as follows:<disp-formula id="e5">
<mml:math id="m5">
<mml:mrow>
<mml:mi mathvariant="bold">e</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>&#x002B;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:mrow>
<mml:mi mathvariant="bold">v</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mstyle>
<mml:mo>/</mml:mo>
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>&#x002B;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
<label>(5)</label>
</disp-formula>
</p>
<p>In this study, we chose <italic>k</italic> &#x3d; 3, 4, 5, 6, 7, and 8. The final feature vector is formed by concatenating <bold>e</bold>(<italic>r</italic>, <italic>k</italic>) with different <italic>k</italic>, as follows:<disp-formula id="e6">
<mml:math id="m6">
<mml:mrow>
<mml:mi mathvariant="bold">e</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mi mathvariant="bold">e</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn>3</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mi mathvariant="normal">T</mml:mi>
</mml:msup>
</mml:mrow>
</mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mi mathvariant="bold">e</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn>4</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mi mathvariant="normal">T</mml:mi>
</mml:msup>
</mml:mrow>
</mml:mtd>
<mml:mtd>
<mml:mo>&#x2026;</mml:mo>
</mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mi mathvariant="bold">e</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn>8</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mi mathvariant="normal">T</mml:mi>
</mml:msup>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mi mathvariant="normal">T</mml:mi>
</mml:msup>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
<label>(6)</label>
</disp-formula>
</p>
</sec>
<sec id="s2-4">
<title>Model Construction Algorithm</title>
<p>We evaluated three machine learning algorithms in this task, including SVM, CNN, and C4.5 classification tree. For the SVM classifier, we applied the radial basis function (RBF) kernel, as follows:<disp-formula id="e7">
<mml:math id="m7">
<mml:mrow>
<mml:mi>&#x3ba;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">e</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">e</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>exp</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>&#x3b3;</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>&#x2016;</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">e</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mi mathvariant="bold">-</mml:mi>
<mml:msub>
<mml:mi mathvariant="bold">e</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>&#x2016;</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(7)</label>
</disp-formula>where <italic>&#x3b3;</italic> is a parameter and &#x7c;&#x7c;.&#x7c;&#x7c; vector norm operator.</p>
<p>For the CNN classifier, the max-pooling layer and dropout layer are used to avoid the over-fitting problem. The sigmoid function followed by a fully connected network is applied for performing the output. We used stochastic gradient descent to optimize parameters (<xref ref-type="bibr" rid="B6">Bottou, 2012</xref>). The binary cross-entropy function is used as the loss function (<xref ref-type="bibr" rid="B10">de Boer et al., 2005</xref>), as follows:<disp-formula id="e8">
<mml:math id="m8">
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi mathvariant="normal">&#x3b8;</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mi>N</mml:mi>
</mml:mfrac>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>N</mml:mi>
</mml:munderover>
<mml:mrow>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mi mathvariant="italic">log</mml:mi>
</mml:mrow>
</mml:mstyle>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mi>&#x3b8;</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi mathvariant="bold">e</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mi mathvariant="italic">log</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="normal">1-</mml:mi>
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mi>&#x3b8;</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi mathvariant="bold">e</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(8)</label>
</disp-formula>where <italic>y<sub>i</sub>
</italic> is the label of the i-th sample, h<sub>
<italic>&#x3b8;</italic>
</sub>(<bold>e</bold>) the output of the neural network, and N the number of samples.</p>
<p>For C4.5 algorithm, the information gain ratio for selecting appropriate features is defined as follows:<disp-formula id="e9">
<mml:math id="m9">
<mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mi>r</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>e</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>G</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>e</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mi>V</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>e</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(9)</label>
</disp-formula>where <italic>D</italic> is the whole dataset, <italic>G</italic>
<sub>
<italic>r</italic>
</sub> (<italic>D, e</italic>
<sub>
<italic>i</italic>
</sub>) the information gain, <italic>IV</italic>(<italic>e</italic>
<sub>
<italic>i</italic>
</sub>) the intrinsic value of <italic>e</italic>
<sub>
<italic>i</italic>
</sub> (<xref ref-type="bibr" rid="B37">Salzberg, 1994</xref>), and <italic>e</italic>
<sub>
<italic>i</italic>
</sub> the <italic>i</italic>-th feature of feature <bold>e</bold>.</p>
</sec>
<sec id="s2-5">
<title>Degree of Separation</title>
<p>To measure the degree of separation in the visualization analysis, we introduced the J-score. We first define the intra-class divergence sw and interclass divergence sb, as follows:<disp-formula id="e10">
<mml:math id="m10">
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>b</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="bold">e</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="bold">e</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="bold">e</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="bold">e</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mi>T</mml:mi>
</mml:msup>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(10)</label>
</disp-formula>
<disp-formula id="e11">
<mml:math id="m11">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="italic">s</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>m</mml:mi>
<mml:mo>&#x2b;</mml:mo>
</mml:msub>
</mml:mrow>
</mml:munderover>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">e</mml:mi>
<mml:mo>&#x2b;</mml:mo>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="normal">e</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">e</mml:mi>
<mml:mo>&#x2b;</mml:mo>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="normal">e</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mi>T</mml:mi>
</mml:msup>
</mml:mrow>
</mml:mstyle>
<mml:mo>&#x2b;</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>m</mml:mi>
<mml:mo>-</mml:mo>
</mml:msub>
</mml:mrow>
</mml:munderover>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">e</mml:mi>
<mml:mo>-</mml:mo>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="normal">e</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>-</mml:mo>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">e</mml:mi>
<mml:mo>-</mml:mo>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="normal">e</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>-</mml:mo>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mi>T</mml:mi>
</mml:msup>
</mml:mrow>
</mml:mstyle>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(11)</label>
</disp-formula>where<disp-formula id="e12">
<mml:math id="m12">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="normal">e</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:msub>
<mml:mi>m</mml:mi>
<mml:mo>&#x2b;</mml:mo>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>m</mml:mi>
<mml:mo>&#x2b;</mml:mo>
</mml:msub>
</mml:mrow>
</mml:munderover>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">e</mml:mi>
<mml:mo>&#x2b;</mml:mo>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mstyle>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(12)</label>
</disp-formula>
<disp-formula id="e13">
<mml:math id="m13">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="normal">e</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>-</mml:mo>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:msub>
<mml:mi>m</mml:mi>
<mml:mo>-</mml:mo>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>m</mml:mi>
<mml:mo>-</mml:mo>
</mml:msub>
</mml:mrow>
</mml:munderover>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">e</mml:mi>
<mml:mo>-</mml:mo>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mstyle>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(13)</label>
</disp-formula>where <italic>e<sub>&#x2b;</sub>(r<sub>j</sub>)</italic> is the feature vector of the j-th positive sample, <italic>e<sub>-</sub>(r<sub>j</sub>)</italic> is the feature vector of the j-th negative sample, and m&#x2b; and m- are the number of positive and negative samples, respectively.</p>
<p>The J-score can now be defined as follows:<disp-formula id="e14">
<mml:math id="m14">
<mml:mrow>
<mml:mi mathvariant="italic">J</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>b</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
<label>(14)</label>
</disp-formula>
</p>
<p>The higher J-score indicates a better degree of separation between positives and negatives.</p>
</sec>
<sec id="s2-6">
<title>Framework of This Study</title>
<p>The framework of i5hmcVec is illustrated in <xref ref-type="fig" rid="F2">Figure 2</xref>. We obtained the <italic>k</italic>-mer embeddings using dna2vec (<xref ref-type="bibr" rid="B32">Ng, 2017a</xref>), which is trained by the <italic>Drosophila</italic> genome sequences version dm3. RNA sequences were encoded by the embedding vectors for variable-length <italic>k</italic>-mers. SVM was applied as a classifier to distinguish the positive and negative samples.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>Flowchart of this study. Step 1: RNA sequences are segmented into <italic>k</italic>-mers in the overlapping way, where <italic>k</italic> &#x3d; 3, 4, 5, 6, 7, 8. Step 2: <italic>k</italic>-mers embeddings were trained by the dna2vec model with corpus from dm3. Step 3: We perform summation and concatenation on these <italic>k</italic>-mers embeddings to encode RNA sequences. Step 4: SVM is used as a classifier for distinguishing positive and negative samples.</p>
</caption>
<graphic xlink:href="fgene-13-896925-g002.tif"/>
</fig>
</sec>
<sec id="s2-7">
<title>Parameter Calibration</title>
<p>In this section, we give a detailed introduction to optimizing parameters. SVM was implemented by the Python package scikit-learn. We chose to use the radial basis function (RBF) as the kernel function. A grid search strategy was applied to find the optimal parameters <italic>c</italic> and <italic>&#x3b3;</italic>. The parameter <italic>c</italic> is the cost parameter in SVM, while <italic>&#x3b3;</italic> is the parameter in the RBF kernel function. The range of parameter c is (2<sup>-5</sup>, 2<sup>15</sup>), while the range for parameter <italic>&#x3b3;</italic> is (2<sup>-15</sup>, 2<sup>-5</sup>). The step for generating the logarithm searching grid is 2 and 2-1 for c and <italic>&#x3b3;</italic>, respectively. The CNN algorithm is implemented by Keras. The batch size was set to 16. A logarithm grid search strategy was used to find the optimal parameters epoch e and learning rate a. The range of parameter a: 10<sup>-4</sup>, 5 &#xd7; 10<sup>-4</sup>, 10<sup>-3</sup>, 5 &#xd7; 10<sup>-3</sup>, 10<sup>-2</sup>, and 5 &#xd7; 10<sup>-2</sup>. The range of parameters e is 100, 150, 200, 250, and 300. We used the weka package to implement C4.5. We evaluated the performance on different parameters C, which is the confidence threshold for pruning. The range of C is [0.2, 0.5] with a step of 0.05.</p>
</sec>
<sec id="s2-8">
<title>Performance Measures</title>
<p>Four statistics, including sensitivity (Sen), specificity (Spe), accuracy (Acc), and Matthews correlation coefficient (MCC), were used to measure the prediction performance of our method. These performance measures can be defined as follows:<disp-formula id="e15">
<mml:math id="m15">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>n</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(15)</label>
</disp-formula>
<disp-formula id="e16">
<mml:math id="m16">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>e</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(16)</label>
</disp-formula>
<disp-formula id="e17">
<mml:math id="m17">
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>c</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
<mml:mtext>and</mml:mtext>
</mml:mrow>
</mml:math>
<label>(17)</label>
</disp-formula>
<disp-formula id="e18">
<mml:math id="m18">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi>C</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(18)</label>
</disp-formula>where TP, TN, FP, and FN are the number of true positives, true negatives, false positives, and false negatives in the cross-validation process, respectively.</p>
<p>In addition, we also draw the receiver operating characteristic (ROC) curve and precision&#x2013;recall (PR) curve to describe the performance of our method. The area under the ROC curve (AUROC) and the area under the PR (AUPR) curve were also recorded as performance indicators.</p>
</sec>
</sec>
<sec sec-type="results" id="s3">
<title>Results</title>
<sec id="s3-1">
<title>Performance of Diffident Kind Features and Classifiers</title>
<p>In this study, nine kinds of <italic>k</italic>-mer embeddings were obtained, including six kinds of single <italic>k</italic> value embeddings and 3 kinds of multiple <italic>k</italic> value combinations. The single <italic>k</italic> values range from 3 to 8. The multiple <italic>k</italic> value combinations include the 4, 5, 6-mer combination, 6, 7, 8-mer combination, and 3, 4, 5, 6, 7, 8-mer combination. We first evaluate the performance of each single <italic>k</italic> value embedding. After that, we evaluate three multiple <italic>k</italic> value combinations.</p>
<p>Three machine learning-based classifiers were applied in this study. They are SVM, CNN, and C4.5. The parameters of these classifiers are optimized as in the method section. The optimization process is recorded as mesh surf plots in <xref ref-type="sec" rid="s11">Supplementary Figures S1&#x2013;S3</xref> in the supplementary materials. The data for quantitative analysis is recorded in <xref ref-type="sec" rid="s11">Supplementary Tables S1&#x2013;S27</xref>. The optimal parameters for different classifiers are: the c and <italic>&#x3b3;</italic> of SVM on the 3-mer, 4-mer, 5-mer, 6-mer, 7-mer, 8-mer, 4, 5, 6-mer, 6, 7, 8-mer and 3, 4, 5, 6, 7, 8-mer are (29, 2&#x2013;5), (27, 2&#x2013;5), (27, 2&#x2013;5), (27, 2&#x2013;5), (27, 2&#x2013;5), (27, 2&#x2013;5), (25, 2&#x2013;5), (25, 2&#x2013;5), and (24, 2&#x2013;5); the a and e of CNN on the 3-mer, 4-mer, 5-mer, 6-mer, 7-mer, 8-mer, 4, 5, 6-mer, 6, 7, 8-mer and 3, 4, 5, 6, 7, 8-mer are (5 &#xd7; 10<sup>-2</sup>, 200), (5 &#xd7; 10<sup>-2</sup>, 150), (5 &#xd7; 10<sup>-2</sup>, 150), (5 &#xd7; 10<sup>-2</sup>, 250), (5 &#xd7; 10<sup>-2</sup>, 150), (5 &#xd7; 10<sup>-2</sup>, 250), (5 &#xd7; 10<sup>-2</sup>, 150), (5 &#xd7; 10<sup>-2</sup>, 100), and (5 &#xd7; 10<sup>-2</sup>, 150); the C of C4.5 on the 3-mer, 4-mer, 5-mer, 6-mer, 7-mer, 8-mer, 4, 5, 6-mer, 6, 7, 8-mer and 3, 4, 5, 6, 7, 8-mer are 0.45, 0.3, 0.2, 0.5, 0.2, 0.45, 0.25, 0.3, and 0.3. The performances of all models are evaluated by 10 times 5-fold cross-validations. The optimal performance is recorded in <xref ref-type="fig" rid="F3">Figure 3</xref> and <xref ref-type="sec" rid="s11">Supplementary Tables S28&#x2013;S54</xref>.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Performance of different kinds features on SVM, CNN, and C4.5. Cyan, orange, gray, yellow, blue, and green, respectively, represent the performance of 3-mer, 4-mer, 5-mer, 6-mer, 7-mer, and 8-mer embedding features. Purple, pink, and red, respectively, represent the performance of 4, 5, 6-mer concatenated embeddings, 6, 7, 8-mer concatenated embeddings, and 3, 4, 5, 6, 7, 8-mer concatenated embeddings. <bold>(A,B)</bold> Performance of different kinds of feature on SVM. The standard deviation of SVM on 3-mer, 4-mer, 5-mer, 6-mer, 7-mer, 8-mer, 4, 5, 6-mer, 6, 7, 8-mer, and 3, 4, 5, 6, 7, 8-mer is in the range (0.001, 0.003), (0.001, 0.003), (0.001, 0.003), (0.001, 0.003), (0.001, 0.003), (0.001, 0.004), (0.001, 0.004), (0.001, 0.003), and (0.001, 0.003); <bold>(C,D)</bold> Performance of different kinds of feature on CNN. The standard deviation of CNN on 3-mer, 4-mer, 5-mer, 6-mer, 7-mer, 8-mer, 4, 5, 6-mer, 6, 7, 8-mer, and 3, 4, 5, 6, 7, 8-mer is in the range (0.003, 0.026), (0.008, 0.071), (0.006, 0.049), (0.015, 0.056), (0.016, 0.055), (0.016, 0.059), (0.008, 0.044), (0.011, 0.058), and (0.008, 0.045); <bold>(E,F)</bold> Performance of different kinds of features on C4.5. The standard deviation of CNN on 3-mer, 4-mer, 5-mer, 6-mer, 7-mer, 8-mer, 4, 5, 6-mer, 6, 7, 8-mer, and 3, 4, 5, 6, 7, 8-mer is in the range (0.005, 0.545), (0.007, 0.531), (0.005, 0.049), (0.007, 0.685), (0.003, 0.440), (0.007, 0.489), (0.008, 0.630), (0.006, 0.567), and (0.005, 0.518).</p>
</caption>
<graphic xlink:href="fgene-13-896925-g003.tif"/>
</fig>
</sec>
<sec id="s3-2">
<title>Semantic Symmetry of <italic>K</italic>-Mer Embeddings</title>
<p>One of the most important functions of word2vec is that the word embeddings can solve semantic and linguistic analogies (<xref ref-type="bibr" rid="B31">Mikolov et al., 2013</xref>). Therefore, the semantic relation of the <italic>k</italic>-mer embeddings from dna2vec needs to be discussed. Principal component analysis (PCA) was applied to reveal the relationship of <italic>k</italic>-mer fragments. For 5-mer embeddings, the number of words is 1024. To present the results clearly, we only plot the PCA results of 3-mer and 4-mer embeddings in <xref ref-type="fig" rid="F4">Figure 4</xref>. As in <xref ref-type="fig" rid="F4">Figure 4</xref>, many words show symmetry trends about the horizontal axis, such as (CGC, GCG), (CTT, AAG), and (TACT, AGTA). Many words with such property have the characteristics of complement or reverse complement. Zou et al. regarded this phenomenon as semantic symmetric in the human genome (<xref ref-type="bibr" rid="B48">Zou et al., 2019</xref>). We observe and confirm this phenomenon in <italic>Drosophila</italic> genome.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Visualization of <italic>k</italic>-mer embeddings with PCA. Each dot represents a <italic>k</italic>-mer embedding vector. <bold>(A)</bold> 3-mer embedding. <bold>(B)</bold> 4-mer embedding.</p>
</caption>
<graphic xlink:href="fgene-13-896925-g004.tif"/>
</fig>
</sec>
<sec id="s3-3">
<title>Feature Visualization</title>
<p>We used the t-distributed stochastic neighbor embedding (t-SNE) (<xref ref-type="bibr" rid="B40">van der Maaten and Hinton, 2008</xref>) method to help visualize the sequence features. The t-SNE algorithm is an effective way of reducing dimensions for visualization purposes. According to the visualization of t-SNE, we can judge whether the positive and negative samples are separable in the feature space. We applied the t-SNE for reducing the dimension of the feature to 2 and 3. We also calculated the J-score, which has been elaborated in the method section, as a quantitative separation measure in the reduced feature space. As shown in <xref ref-type="fig" rid="F5">Figure 5</xref>, positive and negative samples are highly separable. The J-score of 2 and 3 dimensions of t-SNE are 0.202 and 0.165, indicating an acceptable level of separation.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>Visualization of sequence features. The red dots represent the positive samples. The blue dots represent the negative samples. <bold>(A)</bold> Visualization of 2-dimensional t-SNE. <bold>(B)</bold> Visualization of 3-dimensional t-SNE.</p>
</caption>
<graphic xlink:href="fgene-13-896925-g005.tif"/>
</fig>
</sec>
<sec id="s3-4">
<title>Performance Comparison With Existing Methods</title>
<p>The i5hmCVec is constructed based on a low-resolution modification dataset. WeakRM (<xref ref-type="bibr" rid="B19">Huang et al., 2021</xref>) was also proposed for identifying the 5hmC modification sites on low-resolution data. We summarized the dataset distribution used in the i5hmCVec and WeakRM in <xref ref-type="table" rid="T1">Table 1</xref>.</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Dataset distributions of i5hmcVec and WeakRM.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Method</th>
<th align="center">Positive<xref ref-type="table-fn" rid="Tfn1">
<sup>a</sup>
</xref>
</th>
<th align="center">Negative<xref ref-type="table-fn" rid="Tfn2">
<sup>b</sup>
</xref>
</th>
<th align="center">Window size</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">i5hmCVec</td>
<td align="char" char=".">2616</td>
<td align="char" char=".">2616</td>
<td align="center">209&#xa0;nt&#x223c;8097&#xa0;nt</td>
</tr>
<tr>
<td align="left">WeakRM (training)</td>
<td align="char" char=".">1875</td>
<td align="char" char=".">1875</td>
<td align="center">210&#xa0;nt&#x223c;8090&#xa0;nt</td>
</tr>
<tr>
<td align="left">WeakRM (validation)</td>
<td align="char" char=".">235</td>
<td align="char" char=".">235</td>
<td align="center">210&#xa0;nt&#x223c;8090&#xa0;nt</td>
</tr>
<tr>
<td align="left">WeakRM (testing)</td>
<td align="char" char=".">234</td>
<td align="char" char=".">234</td>
<td align="center">210&#xa0;nt&#x223c;8090&#xa0;nt</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="Tfn1">
<label>a</label>
<p>Positive samples are sequences, which contain the 5hmC sites.</p>
</fn>
<fn id="Tfn2">
<label>b</label>
<p>Negative samples are sequences, which do not contain the 5hmC sites.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>We used the dataset from WeakRM for training the i5hmCVec model. We also reproduced WeakRM for obtaining more types of performance metrics. Due to inevitable randomness errors, our reproduced performances are slightly different from the original reports. The differences are so tiny that the comparison results would not change. As in <xref ref-type="table" rid="T2">Table 2</xref>, i5hmCVec achieved 0.846, 0.920, 0.908, and 0.692 on Acc, AUROC, AUPR, and MCC, respectively, which are higher than the performance values of WeakRM. In addition, we make a comparison of training time between i5hmcVec and WeakRM. Training WeakRM takes about 500&#xa0;s, while i5hmCVec takes about 25&#xa0;s. To describe the results more intuitively, we displayed the ROC curve and PR curve of two models, as in <xref ref-type="fig" rid="F6">Figure 6</xref>. As in <xref ref-type="fig" rid="F6">Figure 6</xref>, both the AUROC and AUPR of i5hmCVec are slightly better than the WeakRM. In total, iRNA5hmCVec achieved better performances than WeakRM on a low-resolution modification dataset.</p>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>Performance of i5hmcVec and WeakRM on the dataset from WeakRM.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Method</th>
<th align="center">Acc<xref ref-type="table-fn" rid="Tfn3">
<sup>a</sup>
</xref>
</th>
<th align="center">Sen<xref ref-type="table-fn" rid="Tfn4">
<sup>b</sup>
</xref>
</th>
<th align="center">Spe<xref ref-type="table-fn" rid="Tfn5">
<sup>c</sup>
</xref>
</th>
<th align="center">AUROR<xref ref-type="table-fn" rid="Tfn6">
<sup>d</sup>
</xref>
</th>
<th align="center">AUPR<xref ref-type="table-fn" rid="Tfn7">
<sup>e</sup>
</xref>
</th>
<th align="center">MCC<xref ref-type="table-fn" rid="Tfn8">
<sup>f</sup>
</xref>
</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">WeakRM</td>
<td align="char" char=".">0.790</td>
<td align="char" char=".">0.617</td>
<td align="char" char=".">
<bold>0.967</bold>
</td>
<td align="char" char=".">0.892</td>
<td align="char" char=".">0.905</td>
<td align="char" char=".">0.619</td>
</tr>
<tr>
<td align="left">i5hmCVec</td>
<td align="char" char=".">
<bold>0.846</bold>
<xref ref-type="table-fn" rid="Tfn9">
<sup>g</sup>
</xref>
</td>
<td align="char" char=".">
<bold>0.838</bold>
</td>
<td align="char" char=".">0.855</td>
<td align="char" char=".">
<bold>0.920</bold>
</td>
<td align="char" char=".">
<bold>0.908</bold>
</td>
<td align="char" char=".">
<bold>0.692</bold>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="Tfn3">
<label>a</label>
<p>
<italic>Acc</italic> is short for accuracy.</p>
</fn>
<fn id="Tfn4">
<label>b</label>
<p>
<italic>Sen</italic> is short for sensitivity.</p>
</fn>
<fn id="Tfn5">
<label>c</label>
<p>
<italic>Spe</italic> is short for specificity.</p>
</fn>
<fn id="Tfn6">
<label>d</label>
<p>AUROC means the area under the ROC curve.</p>
</fn>
<fn id="Tfn7">
<label>e</label>
<p>AUPR means the area under the PR curve.</p>
</fn>
<fn id="Tfn8">
<label>f</label>
<p>
<italic>MCC</italic> is short for Matthews correlation coefficient.</p>
</fn>
<fn id="Tfn9">
<label>g</label>
<p>Boldface indicates the best performance on each metric among methods.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption>
<p>ROC and PR curves of i5hmcVec and WeakRM on the dataset from WeakRM. <bold>(A)</bold> ROC curve. The X-axis is the false positive rate, and the Y-axis is the true positive rate. <bold>(B)</bold> PR curve. The X-axis is the recall, and the Y-axis is the precision.</p>
</caption>
<graphic xlink:href="fgene-13-896925-g006.tif"/>
</fig>
</sec>
</sec>
<sec sec-type="discussion" id="s4">
<title>Discussion</title>
<p>Identifying modification sites is an important work for studying 5hmC modification. In this study, we used machine learning methods to construct the model. There are three key steps for a machine learning problem.</p>
<p>First, a high-quality dataset is essential for building an effective model. We constructed the low-resolution benchmarking dataset from experimental results (<xref ref-type="bibr" rid="B11">Delatte et al., 2016</xref>). We did not use the strategy of randomly selecting cytosine sites within peak regions like <xref ref-type="bibr" rid="B28">Liu Y. et al. (2020</xref>). Because such a strategy may lead to many false-positive samples (<xref ref-type="bibr" rid="B24">Kunqi Chen et al., 2019</xref>). In addition, to ensure high quality of sequences, we only employed the high-quality chromosomes sequences in the genome assembly.</p>
<p>Second, the samples from the dataset should be represented by an informative digital vector. We encode RNA sequences using the k-mer embeddings, which are derived from dna2vec. According to our results, the feature vector can effectively separate positive and negative samples. These results suggest that this encoding scheme is suitable for our study.</p>
<p>Finally, a suitable classifier should be used for constructing the model. We compared the performance of SVM, C4.5, and CNN. The SVM classifier has the best performance. In addition, we optimize the parameters using a grid search strategy.</p>
<p>Although our model was trained on low-resolution data, we tried to evaluate the performance of our model on high-resolution data. We performed 10 times 5-fold cross-validations on the benchmarking dataset from iRNA5hmC (<xref ref-type="bibr" rid="B28">Liu Y. et al., 2020</xref>). The sequence data in iRNA5hmC are 41&#xa0;nt. The results are recorded in <xref ref-type="table" rid="T3">Table 3</xref>. According to the results, the i5hmCVec does not receive expected performance on a high-resolution modification dataset. We speculated that there may be two reasons for this phenomenon. One is the low quality of the high-resolution dataset. The high-resolution dataset of 5hmC modification was developed by Liu et al. with a random site picking strategy (<xref ref-type="bibr" rid="B28">Liu Y. et al., 2020</xref>), which may lead to many false positives.</p>
<table-wrap id="T3" position="float">
<label>TABLE 3</label>
<caption>
<p>Performance of i5hmcVec and iRNA5hmC on the benchmark dataset from iRNA5hmC.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Method</th>
<th align="center">Acc<xref ref-type="table-fn" rid="Tfn10">
<sup>a</sup>
</xref>
</th>
<th align="center">Sen<xref ref-type="table-fn" rid="Tfn11">
<sup>b</sup>
</xref>
</th>
<th align="center">Spe<xref ref-type="table-fn" rid="Tfn12">
<sup>c</sup>
</xref>
</th>
<th align="center">AUROC<xref ref-type="table-fn" rid="Tfn13">
<sup>d</sup>
</xref>
</th>
<th align="center">AUPR<xref ref-type="table-fn" rid="Tfn14">
<sup>e</sup>
</xref>
</th>
<th align="center">MCC<xref ref-type="table-fn" rid="Tfn15">
<sup>f</sup>
</xref>
</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">iRNA5hmC</td>
<td align="char" char=".">
<bold>0.655</bold>
<xref ref-type="table-fn" rid="Tfn16">
<bold>
<sup>g</sup>
</bold>
</xref>
</td>
<td align="char" char=".">
<bold>0.677</bold>
</td>
<td align="char" char=".">0.644</td>
<td align="char" char=".">
<bold>0.697</bold>
</td>
<td align="char" char=".">
<bold>0.685</bold>
</td>
<td align="char" char=".">
<bold>0.310</bold>
</td>
</tr>
<tr>
<td rowspan="2" align="left">i5hmcVec<xref ref-type="table-fn" rid="Tfn17">
<sup>h</sup>
</xref>
</td>
<td align="char" char=".">0.642</td>
<td align="char" char=".">0.636</td>
<td align="char" char=".">
<bold>0.647</bold>
</td>
<td align="char" char=".">0.684</td>
<td align="char" char=".">0.676</td>
<td align="char" char=".">0.284</td>
</tr>
<tr>
<td align="char" char=".">&#xb1;0.008</td>
<td align="char" char=".">&#xb1;0.010</td>
<td align="char" char=".">
<bold>&#xb1;0.009</bold>
</td>
<td align="char" char=".">&#xb1;0.007</td>
<td align="char" char=".">&#xb1;0.007</td>
<td align="char" char=".">&#xb1;0.016</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="Tfn10">
<label>a</label>
<p>
<italic>Acc</italic> is short for accuracy.</p>
</fn>
<fn id="Tfn11">
<label>b</label>
<p>
<italic>Sen</italic> is short for sensitivity.</p>
</fn>
<fn id="Tfn12">
<label>c</label>
<p>
<italic>Spe</italic> is short for specificity.</p>
</fn>
<fn id="Tfn13">
<label>d</label>
<p>AUROC means the area under the ROC curve.</p>
</fn>
<fn id="Tfn14">
<label>e</label>
<p>AUPR means the area under the PR curve.</p>
</fn>
<fn id="Tfn15">
<label>f</label>
<p>
<italic>MCC</italic> is short for Matthews correlation coefficient.</p>
</fn>
<fn id="Tfn16">
<label>g</label>
<p>Boldface indicates the best performance on each metric among different methods.</p>
</fn>
<fn id="Tfn17">
<label>h</label>
<p>Performance of i5hmcVec on the benchmark dataset from iRNA5hmC with 10 times 5-fold cross-validation. Results are expressed as the mean and standard deviation of 10 times experiments.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>The other is the limitation of resolution in our model. The length of low-resolution sequences is between 209&#xa0;nt and 8097&#xa0;nt, while the length of high-resolution sequences is 41&#xa0;nt, which is much shorter than the lower bound of the low-resolution dataset. To estimate the resolution of our model, we evaluate the performance of the 5hmC on negative samples with different length restrictions. We re-select RNA sequences with sequence lengths ranging from 20 to 8100 on the non-peak region within the transcript carrying peak region as an independent testing dataset. It is worth noting that to prevent information leakage, there is no regional intersection between these negative samples and the negative samples in the benchmarking dataset. In addition, since there are only labels for negative samples, Spe is used as a performance metric. As shown in <xref ref-type="fig" rid="F7">Figure 7</xref>, when the length of the sequence is less than 1000&#xa0;nt, the performance of spe gradually drops. When the sequence length is around 100, the performance value takes a deep dive. Although the performance increases drastically when the sequence length is less than 100, we believe this is caused by over-fittings on negative samples. Therefore, the i5hmCVec model is not suitable for working on the high-resolution dataset.</p>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption>
<p>Performance of i5hmCVec on the negative datasets of different sequence lengths with an independent test. The X-axis is the length of sequences in the negative dataset. The Y-axis is the performance of i5hmCVec on spe.</p>
</caption>
<graphic xlink:href="fgene-13-896925-g007.tif"/>
</fig>
</sec>
<sec sec-type="conclusion" id="s5">
<title>Conclusion</title>
<p>In this study, we proposed a novel model named i5hmCVec for identifying 5hmC modification sites. We proposed a high-quality low-resolution 5hmC modification dataset. We construct the i5hmCVec based on dna2vec technology. The i5hmCvec achieved better performances than state-of-the-art methods on a low-resolution dataset. In addition, we analyze the semantic symmetric with the <italic>Drosophila</italic> genome. We hope our findings may be useful for future studies.</p>
</sec>
</body>
<back>
<sec id="s6">
<title>Data Availability Statement</title>
<p>Publicly available datasets were analyzed in this study. These data can be found at: <ext-link ext-link-type="uri" xlink:href="https://github.com/liu-h-y/5hmC_model">https://github.com/liu-h-y/5hmC_model</ext-link>.</p>
</sec>
<sec id="s7">
<title>Author Contributions</title>
<p>H-YL collected the data, implemented the algorithm, performed the experiments, analyzed the results, and wrote the manuscript. P-FD directed the whole study, conceptualized the algorithm, supervised the experiments, analyzed the results, and wrote the manuscript.</p>
</sec>
<sec id="s8">
<title>Funding</title>
<p>This work was supported by the National Natural Science Foundation of China (NSFC 61872268) and the National Key R&#x26;D Program of China (2018YFC0910405).</p>
</sec>
<sec sec-type="COI-statement" id="s9">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s10">
<title>Publisher&#x2019;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<sec id="s11">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fgene.2022.896925/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fgene.2022.896925/full&#x23;supplementary-material</ext-link>
</p>
<supplementary-material xlink:href="Image3.TIF" id="SM1" mimetype="application/TIF" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Image2.TIF" id="SM2" mimetype="application/TIF" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Image1.TIF" id="SM3" mimetype="application/TIF" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Table2.DOCX" id="SM4" mimetype="application/DOCX" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Table1.XLSX" id="SM5" mimetype="application/XLSX" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ahmed</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Hossain</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Uddin</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Taherzadeh</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Sharma</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Shatabda</surname>
<given-names>S.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Accurate Prediction of RNA 5-hydroxymethylcytosine Modification by Utilizing Novel Position-specific Gapped K-Mer Descriptors</article-title>. <source>Comput. Struct. Biotechnol. J.</source> <volume>18</volume>, <fpage>3528</fpage>&#x2013;<lpage>3538</lpage>. <pub-id pub-id-type="doi">10.1016/j.csbj.2020.10.032</pub-id> </citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Akbar</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Hayat</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Iqbal</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Tahir</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>iRNA-PseTNC: Identification of RNA 5-methylcytosine Sites Using Hybrid Vector Space of Pseudo Nucleotide Composition</article-title>. <source>Front. Comput. Sci.</source> <volume>14</volume>, <fpage>451</fpage>&#x2013;<lpage>460</lpage>. <pub-id pub-id-type="doi">10.1007/s11704-018-8094-9</pub-id> </citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Asgari</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Mofrad</surname>
<given-names>M. R. K.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics</article-title>. <source>PLoS One</source> <volume>10</volume>, <fpage>e0141287</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0141287</pub-id> </citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bachman</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Uribe-Lewis</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Williams</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Murrell</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Balasubramanian</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>5-Hydroxymethylcytosine Is a Predominantly Stable DNA Modification</article-title>. <source>Nat. Chem</source> <volume>6</volume>, <fpage>1049</fpage>&#x2013;<lpage>1055</lpage>. <pub-id pub-id-type="doi">10.1038/nchem.2064</pub-id> </citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Boccaletto</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Machnicka</surname>
<given-names>M. A.</given-names>
</name>
<name>
<surname>Purta</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Pi&#x105;tkowski</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Bagi&#x144;ski</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Wirecki</surname>
<given-names>T. K.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>MODOMICS: a Database of RNA Modification Pathways. 2017 Update</article-title>. <source>Nucleic Acids Res.</source> <volume>46</volume>, <fpage>D303</fpage>&#x2013;<lpage>D307</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkx1030</pub-id> </citation>
</ref>
<ref id="B6">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Bottou</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2012</year>). &#x201c;<article-title>Stochastic Gradient Descent Tricks</article-title>,&#x201d; in <source>
<italic>Neural Networks: Tricks of the Trade: Second Edition</italic> Lecture Notes in Computer Science.</source> Editors <person-group person-group-type="editor">
<name>
<surname>Montavon</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Orr</surname>
<given-names>G. B.</given-names>
</name>
<name>
<surname>M&#xfc;ller</surname>
<given-names>K.-R.</given-names>
</name>
</person-group> (<publisher-loc>Berlin, Heidelberg</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>421</fpage>&#x2013;<lpage>436</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-642-35289-8_25</pub-id> </citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Smith</surname>
<given-names>A. I.</given-names>
</name>
<name>
<surname>Webb</surname>
<given-names>G. I.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Comprehensive Review and Assessment of Computational Methods for Predicting RNA post-transcriptional Modification Sites from RNA Sequences</article-title>. <source>Brief Bioinform</source> <volume>21</volume>, <fpage>1676</fpage>&#x2013;<lpage>1696</lpage>. <pub-id pub-id-type="doi">10.1093/bib/bbz112</pub-id> </citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cowling</surname>
<given-names>V. H.</given-names>
</name>
</person-group> (<year>2009</year>). <article-title>Regulation of mRNA Cap Methylation</article-title>. <source>Biochem. J.</source> <volume>425</volume>, <fpage>295</fpage>&#x2013;<lpage>302</lpage>. <pub-id pub-id-type="doi">10.1042/BJ20091352</pub-id> </citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dai</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Feng</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Cui</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Su</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Iterative Feature Representation Algorithm to Improve the Predictive Performance of N7-Methylguanosine Sites</article-title>. <source>Brief Bioinform</source> <volume>22</volume>, <fpage>bbaa278</fpage>. <pub-id pub-id-type="doi">10.1093/bib/bbaa278</pub-id> </citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>de Boer</surname>
<given-names>P.-T.</given-names>
</name>
<name>
<surname>Kroese</surname>
<given-names>D. P.</given-names>
</name>
<name>
<surname>Mannor</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Rubinstein</surname>
<given-names>R. Y.</given-names>
</name>
</person-group> (<year>2005</year>). <article-title>A Tutorial on the Cross-Entropy Method</article-title>. <source>Ann. Oper. Res.</source> <volume>134</volume>, <fpage>19</fpage>&#x2013;<lpage>67</lpage>. <pub-id pub-id-type="doi">10.1007/s10479-005-5724-z</pub-id> </citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Delatte</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Ngoc</surname>
<given-names>L. V.</given-names>
</name>
<name>
<surname>Collignon</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Bonvin</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Deplus</surname>
<given-names>R.</given-names>
</name>
<etal/>
</person-group> (<year>2016</year>). <article-title>Transcriptome-wide Distribution and Function of RNA Hydroxymethylcytosine</article-title>. <source>Science</source> <volume>351</volume>, <fpage>282</fpage>&#x2013;<lpage>285</lpage>. <pub-id pub-id-type="doi">10.1126/science.aac5253</pub-id> </citation>
</ref>
<ref id="B12">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Deng</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>D2VCB: A Hybrid Deep Neural Network for the Prediction of <italic>In-Vivo</italic> Protein-DNA Binding from Combined DNA Sequence</article-title>,&#x201d; in <source>2019 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2019</source>. Editors <person-group person-group-type="editor">
<name>
<surname>Yoo</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Bi</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>X.</given-names>
</name>
</person-group> (<publisher-loc>San Diego, CA, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>74</fpage>&#x2013;<lpage>77</lpage>. <comment>November 18-21, 2019</comment>. <pub-id pub-id-type="doi">10.1109/BIBM47256.2019.8983051</pub-id> </citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dominissini</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Moshitch-Moshkovitz</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Schwartz</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Salmon-Divon</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Ungar</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Osenberg</surname>
<given-names>S.</given-names>
</name>
<etal/>
</person-group> (<year>2012</year>). <article-title>Topology of the Human and Mouse m6A RNA Methylomes Revealed by m6A-Seq</article-title>. <source>Nature</source> <volume>485</volume>, <fpage>201</fpage>&#x2013;<lpage>206</lpage>. <pub-id pub-id-type="doi">10.1038/nature11112</pub-id> </citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dou</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Ding</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Xiang</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Prediction of m5C Modifications in RNA Sequences by Combining Multiple Sequence Features</article-title>. <source>Mol. Ther. - Nucleic Acids</source> <volume>21</volume>, <fpage>332</fpage>&#x2013;<lpage>342</lpage>. <pub-id pub-id-type="doi">10.1016/j.omtn.2020.06.004</pub-id> </citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fu</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Guerrero</surname>
<given-names>C. R.</given-names>
</name>
<name>
<surname>Zhong</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Amato</surname>
<given-names>N. J.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>S.</given-names>
</name>
<etal/>
</person-group> (<year>2014</year>). <article-title>Tet-mediated Formation of 5-hydroxymethylcytosine in RNA</article-title>. <source>J. Am. Chem. Soc.</source> <volume>136</volume>, <fpage>11582</fpage>&#x2013;<lpage>11585</lpage>. <pub-id pub-id-type="doi">10.1021/ja505305z</pub-id> </citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hong</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Zeng</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Identifying Enhancer-Promoter Interactions with Neural Network Based on Pre-trained DNA Vectors and Attention Mechanism</article-title>. <source>Bioinformatics</source> <volume>36</volume>, <fpage>1037</fpage>&#x2013;<lpage>1043</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btz694</pub-id> </citation>
</ref>
<ref id="B17">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hoskins</surname>
<given-names>R. A.</given-names>
</name>
<name>
<surname>Carlson</surname>
<given-names>J. W.</given-names>
</name>
<name>
<surname>Wan</surname>
<given-names>K. H.</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Mendez</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Galle</surname>
<given-names>S. E.</given-names>
</name>
<etal/>
</person-group> (<year>2015</year>). <article-title>The Release 6 Reference Sequence of the <italic>Drosophila melanogaster</italic> Genome</article-title>. <source>Genome Res.</source> <volume>25</volume>, <fpage>445</fpage>&#x2013;<lpage>458</lpage>. <pub-id pub-id-type="doi">10.1101/gr.185579.114</pub-id> </citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>He</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>BERMP: a Cross-Species Classifier for Predicting m6A Sites by Integrating a Deep Learning Algorithm and a Random forest Approach</article-title>. <source>Int. J. Biol. Sci.</source> <volume>14</volume>, <fpage>1669</fpage>&#x2013;<lpage>1677</lpage>. <pub-id pub-id-type="doi">10.7150/ijbs.27819</pub-id> </citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huang</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Song</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Su</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Coenen</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Meng</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Weakly Supervised Learning of RNA Modifications from Low-Resolution Epitranscriptome Data</article-title>. <source>Bioinformatics</source> <volume>37</volume>, <fpage>i222</fpage>&#x2013;<lpage>i230</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btab278</pub-id> </citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huber</surname>
<given-names>S. M.</given-names>
</name>
<name>
<surname>van Delft</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Mendil</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Bachman</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Smollett</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Werner</surname>
<given-names>F.</given-names>
</name>
<etal/>
</person-group> (<year>2015</year>). <article-title>Formation and Abundance of 5-hydroxymethylcytosine in RNA</article-title>. <source>Chembiochem</source> <volume>16</volume>, <fpage>752</fpage>&#x2013;<lpage>755</lpage>. <pub-id pub-id-type="doi">10.1002/cbic.201500013</pub-id> </citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kai Chen</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Luo</surname>
<given-names>G.-Z.</given-names>
</name>
<name>
<surname>He</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>High-Resolution Mapping of N6-Methyladenosine in Transcriptome and Genome Using a Photo-Crosslinking-Assisted Strategy</article-title>. <source>Methods Enzymol.</source> <volume>560</volume>, <fpage>161</fpage>&#x2013;<lpage>185</lpage>. <pub-id pub-id-type="doi">10.1016/bs.mie.2015.03.012</pub-id> </citation>
</ref>
<ref id="B22">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Karolchik</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Baertsch</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Diekhans</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Furey</surname>
<given-names>T. S.</given-names>
</name>
<name>
<surname>Hinrichs</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>Y. T.</given-names>
</name>
<etal/>
</person-group> (<year>2003</year>). <article-title>The UCSC Genome Browser Database</article-title>. <source>Nucleic Acids Res.</source> <volume>31</volume>, <fpage>51</fpage>&#x2013;<lpage>54</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkg129</pub-id> </citation>
</ref>
<ref id="B23">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Kimothi</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Soni</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Biyani</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Hogan</surname>
<given-names>J. M.</given-names>
</name>
</person-group> (<year>2016</year>). <source>Distributed Representations for Biological Sequence Analysis</source>. <comment>CoRR abs/1608.05949. Available at: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1608.05949">http://arxiv.org/abs/1608.05949</ext-link> (Accessed January 29, 2022)</comment>. </citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kunqi Chen</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Rong</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>Z.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>WHISTLE: a High-Accuracy Map of the Human N6-Methyladenosine (m6A) Epitranscriptome Predicted Using a Machine Learning Approach</article-title>. <source>Nucleic Acids Res.</source> <volume>47</volume>, <fpage>e41</fpage>. <pub-id pub-id-type="doi">10.1093/nar/gkz074</pub-id> </citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Linder</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Grozhik</surname>
<given-names>A. V.</given-names>
</name>
<name>
<surname>Olarerin-George</surname>
<given-names>A. O.</given-names>
</name>
<name>
<surname>Meydan</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Mason</surname>
<given-names>C. E.</given-names>
</name>
<name>
<surname>Jaffrey</surname>
<given-names>S. R.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Single-nucleotide-resolution Mapping of m6A and m6Am throughout the Transcriptome</article-title>. <source>Nat. Methods</source> <volume>12</volume>, <fpage>767</fpage>&#x2013;<lpage>772</lpage>. <pub-id pub-id-type="doi">10.1038/nmeth.3453</pub-id> </citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lindstrom</surname>
<given-names>D. L.</given-names>
</name>
<name>
<surname>Squazzo</surname>
<given-names>S. L.</given-names>
</name>
<name>
<surname>Muster</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Burckin</surname>
<given-names>T. A.</given-names>
</name>
<name>
<surname>Wachter</surname>
<given-names>K. C.</given-names>
</name>
<name>
<surname>Emigh</surname>
<given-names>C. A.</given-names>
</name>
<etal/>
</person-group> (<year>2003</year>). <article-title>Dual Roles for Spt5 in Pre-mRNA Processing and Transcription Elongation Revealed by Identification of Spt5-Associated Proteins</article-title>. <source>Mol. Cel Biol</source> <volume>23</volume>, <fpage>1368</fpage>&#x2013;<lpage>1378</lpage>. <pub-id pub-id-type="doi">10.1128/MCB.23.4.1368-1378.2003</pub-id> </citation>
</ref>
<ref id="B27">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Mao</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Q.</given-names>
</name>
</person-group> (<year>2020a</year>). <article-title>m7GPredictor: An Improved Machine Learning-Based Model for Predicting Internal m7G Modifications Using Sequence Properties</article-title>. <source>Anal. Biochem.</source> <volume>609</volume>, <fpage>113905</fpage>. <pub-id pub-id-type="doi">10.1016/j.ab.2020.113905</pub-id> </citation>
</ref>
<ref id="B28">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Su</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2020b</year>). <article-title>iRNA5hmC: The First Predictor to Identify RNA 5-Hydroxymethylcytosine Modifications Using Machine Learning</article-title>. <source>Front. Bioeng. Biotechnol.</source> <volume>8</volume>, <fpage>227</fpage>. <pub-id pub-id-type="doi">10.3389/fbioe.2020.00227</pub-id> </citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Meyer</surname>
<given-names>K. D.</given-names>
</name>
<name>
<surname>Saletore</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zumbo</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Elemento</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Mason</surname>
<given-names>C. E.</given-names>
</name>
<name>
<surname>Jaffrey</surname>
<given-names>S. R.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>Comprehensive Analysis of mRNA Methylation Reveals Enrichment in 3&#x2032; UTRs and Near Stop Codons</article-title>. <source>Cell</source> <volume>149</volume>, <fpage>1635</fpage>&#x2013;<lpage>1646</lpage>. <pub-id pub-id-type="doi">10.1016/j.cell.2012.05.003</pub-id> </citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Miao</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Xin</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Hua</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Leng</surname>
<given-names>C.</given-names>
</name>
<etal/>
</person-group> (<year>2016</year>). <article-title>5-hydroxymethylcytosine Is Detected in RNA from Mouse Brain Tissues</article-title>. <source>Brain Res.</source> <volume>1642</volume>, <fpage>546</fpage>&#x2013;<lpage>552</lpage>. <pub-id pub-id-type="doi">10.1016/j.brainres.2016.04.055</pub-id> </citation>
</ref>
<ref id="B31">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Mikolov</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Corrado</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Dean</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2013</year>). &#x201c;<article-title>Efficient Estimation of Word Representations in Vector Space</article-title>,&#x201d; in <source>1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings</source>. Editors <person-group person-group-type="editor">
<name>
<surname>Bengio</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>LeCun</surname>
<given-names>Y.</given-names>
</name>
</person-group>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1301.3781">http://arxiv.org/abs/1301.3781</ext-link>
</comment>. </citation>
</ref>
<ref id="B32">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ng</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2017a</year>). <source>dna2vec: Consistent Vector Representations of Variable-Length K-Mers</source>. <comment>CoRR abs/1701.06279. Available at: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1701.06279">http://arxiv.org/abs/1701.06279</ext-link>
</comment>. </citation>
</ref>
<ref id="B33">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ng</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2017b</year>). <source>dna2vec: Consistent Vector Representations of Variable-Length K-Mers</source>. <comment>arXiv:1701.06279 [cs, q-bio, stat]. Available at: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1701.06279">http://arxiv.org/abs/1701.06279</ext-link> (Accessed January 23, 2022)</comment>. </citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pastor</surname>
<given-names>W. A.</given-names>
</name>
<name>
<surname>Pape</surname>
<given-names>U. J.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Henderson</surname>
<given-names>H. R.</given-names>
</name>
<name>
<surname>Lister</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Ko</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2011</year>). <article-title>Genome-wide Mapping of 5-hydroxymethylcytosine in Embryonic Stem Cells</article-title>. <source>Nature</source> <volume>473</volume>, <fpage>394</fpage>&#x2013;<lpage>397</lpage>. <pub-id pub-id-type="doi">10.1038/nature10102</pub-id> </citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Qiu</surname>
<given-names>W.-R.</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>S.-Y.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>Z.-C.</given-names>
</name>
<name>
<surname>Xiao</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>K.-C.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>iRNAm5C-PseDNC: Identifying RNA 5-methylcytosine Sites by Incorporating Physical-Chemical Properties into Pseudo Dinucleotide Composition</article-title>. <source>Oncotarget</source> <volume>8</volume>, <fpage>41178</fpage>&#x2013;<lpage>41188</lpage>. <pub-id pub-id-type="doi">10.18632/oncotarget.17104</pub-id> </citation>
</ref>
<ref id="B36">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sabooh</surname>
<given-names>M. F.</given-names>
</name>
<name>
<surname>Iqbal</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Khan</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Khan</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Maqbool</surname>
<given-names>H. F.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Identifying 5-methylcytosine Sites in RNA Sequence Using Composite Encoding Feature into Chou&#x27;s PseKNC</article-title>. <source>J. Theor. Biol.</source> <volume>452</volume>, <fpage>1</fpage>&#x2013;<lpage>9</lpage>. <pub-id pub-id-type="doi">10.1016/j.jtbi.2018.04.037</pub-id> </citation>
</ref>
<ref id="B37">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Salzberg</surname>
<given-names>S. L.</given-names>
</name>
</person-group> (<year>1994</year>). <article-title>C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993</article-title>. <source>Mach Learn.</source> <volume>16</volume>, <fpage>235</fpage>&#x2013;<lpage>240</lpage>. <pub-id pub-id-type="doi">10.1007/BF00993309</pub-id> </citation>
</ref>
<ref id="B38">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sommer</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Lavi</surname>
<given-names>U.</given-names>
</name>
<name>
<surname>Darnell</surname>
<given-names>J. E.</given-names>
</name>
</person-group> (<year>1978</year>). <article-title>The Absolute Frequency of Labeled N-6-Methyladenosine in HeLa Cell Messenger RNA Decreases with Label Time</article-title>. <source>J. Mol. Biol.</source> <volume>124</volume>, <fpage>487</fpage>&#x2013;<lpage>499</lpage>. <pub-id pub-id-type="doi">10.1016/0022-2836(78)90183-3</pub-id> </citation>
</ref>
<ref id="B39">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Szwagierczak</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Bultmann</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Schmidt</surname>
<given-names>C. S.</given-names>
</name>
<name>
<surname>Spada</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Leonhardt</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>Sensitive Enzymatic Quantification of 5-hydroxymethylcytosine in Genomic DNA</article-title>. <source>Nucleic Acids Res.</source> <volume>38</volume>, <fpage>e181</fpage>. <pub-id pub-id-type="doi">10.1093/nar/gkq684</pub-id> </citation>
</ref>
<ref id="B40">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>van der Maaten</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Hinton</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2008</year>). <article-title>Viualizing Data Using T-SNE</article-title>. <source>J. Machine Learn. Res.</source> <volume>9</volume>, <fpage>2579</fpage>&#x2013;<lpage>2605</lpage>. </citation>
</ref>
<ref id="B41">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wei Chen</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Feng</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Ding</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>K.-C.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>iRNA-Methyl: Identifying N6-Methyladenosine Sites Using Pseudo Nucleotide Composition</article-title>. <source>Anal. Biochem.</source> <volume>490</volume>, <fpage>26</fpage>&#x2013;<lpage>33</lpage>. <pub-id pub-id-type="doi">10.1016/j.ab.2015.08.021</pub-id> </citation>
</ref>
<ref id="B42">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wei Chen</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Feng</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Song</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Lv</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>iRNA-m7G: Identifying N7-Methylguanosine Sites by Fusing Multiple Features</article-title>. <source>Mol. Ther. - Nucleic Acids</source> <volume>18</volume>, <fpage>269</fpage>&#x2013;<lpage>274</lpage>. <pub-id pub-id-type="doi">10.1016/j.omtn.2019.08.022</pub-id> </citation>
</ref>
<ref id="B43">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yang</surname>
<given-names>Y.-H.</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J.-S.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Ding</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>S.-G.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Prediction of N7-Methylguanosine Sites in Human RNA Based on Optimal Sequence Features</article-title>. <source>Genomics</source> <volume>112</volume>, <fpage>4342</fpage>&#x2013;<lpage>4347</lpage>. <pub-id pub-id-type="doi">10.1016/j.ygeno.2020.07.035</pub-id> </citation>
</ref>
<ref id="B44">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yu</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Hon</surname>
<given-names>G. C.</given-names>
</name>
<name>
<surname>Szulwach</surname>
<given-names>K. E.</given-names>
</name>
<name>
<surname>Song</surname>
<given-names>C.-X.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>A.</given-names>
</name>
<etal/>
</person-group> (<year>2012</year>). <article-title>Base-Resolution Analysis of 5-Hydroxymethylcytosine in the Mammalian Genome</article-title>. <source>Cell</source> <volume>149</volume>, <fpage>1368</fpage>&#x2013;<lpage>1380</lpage>. <pub-id pub-id-type="doi">10.1016/j.cell.2012.04.027</pub-id> </citation>
</ref>
<ref id="B45">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>H.-Y.</given-names>
</name>
<name>
<surname>Xiong</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Qi</surname>
<given-names>B.-L.</given-names>
</name>
<name>
<surname>Feng</surname>
<given-names>Y.-Q.</given-names>
</name>
<name>
<surname>Yuan</surname>
<given-names>B.-F.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>The Existence of 5-hydroxymethylcytosine and 5-formylcytosine in Both DNA and RNA in Mammals</article-title>. <source>Chem. Commun.</source> <volume>52</volume>, <fpage>737</fpage>&#x2013;<lpage>740</lpage>. <pub-id pub-id-type="doi">10.1039/c5cc07354e</pub-id> </citation>
</ref>
<ref id="B46">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>L.-S.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Dai</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>H.-L.</given-names>
</name>
<name>
<surname>Luo</surname>
<given-names>G.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Transcriptome-wide Mapping of Internal N7-Methylguanosine Methylome in Mammalian mRNA</article-title>. <source>Mol. Cel</source> <volume>74</volume>, <fpage>1304</fpage>&#x2013;<lpage>1316</lpage>. <comment>e8</comment>. <pub-id pub-id-type="doi">10.1016/j.molcel.2019.03.036</pub-id> </citation>
</ref>
<ref id="B47">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zeng</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y.-H.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Cui</surname>
<given-names>Q.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>SRAMP: Prediction of Mammalian N6-Methyladenosine (m6A) Sites Based on Sequence-Derived Features</article-title>. <source>Nucleic Acids Res.</source> <volume>44</volume>, <fpage>e91</fpage>. <pub-id pub-id-type="doi">10.1093/nar/gkw104</pub-id> </citation>
</ref>
<ref id="B48">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zou</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Xing</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Gene2vec: Gene Subsequence Embedding for Prediction of Mammalian N6-Methyladenosine Sites from mRNA</article-title>. <source>RNA</source> <volume>25</volume>, <fpage>205</fpage>&#x2013;<lpage>218</lpage>. <pub-id pub-id-type="doi">10.1261/rna.069112.118</pub-id> </citation>
</ref>
</ref-list>
</back>
</article>