<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="2.3" xml:lang="EN">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Microbiol.</journal-id>
<journal-title>Frontiers in Microbiology</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Microbiol.</abbrev-journal-title>
<issn pub-type="epub">1664-302X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fmicb.2022.1061122</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Microbiology</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>iProm-phage: A two-layer model to identify phage promoters and their types using a convolutional neural network</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author"><name><surname>Shujaat</surname><given-names>Muhammad</given-names></name>
<xref rid="aff1" ref-type="aff"><sup>1</sup></xref>
<uri xlink:href="https://loop.frontiersin.org/people/2031938/overview"/>
</contrib>
<contrib contrib-type="author"><name><surname>Jin</surname><given-names>Joe Sung</given-names></name>
<xref rid="aff2" ref-type="aff"><sup>2</sup></xref>
<uri xlink:href="https://loop.frontiersin.org/people/2068512/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes"><name><surname>Tayara</surname><given-names>Hilal</given-names></name>
<xref rid="aff3" ref-type="aff"><sup>3</sup></xref>
<xref rid="c001" ref-type="corresp"><sup>&#x002A;</sup></xref>
<uri xlink:href="https://loop.frontiersin.org/people/667071/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes"><name><surname>Chong</surname><given-names>Kil To</given-names></name>
<xref rid="aff1" ref-type="aff"><sup>1</sup></xref>
<xref rid="aff4" ref-type="aff"><sup>4</sup></xref>
<xref rid="c001" ref-type="corresp"><sup>&#x002A;</sup></xref>
<uri xlink:href="https://loop.frontiersin.org/people/710191/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Electronics and Information Engineering, Jeonbuk National University</institution>, <addr-line>Jeonju</addr-line>, <country>South Korea</country></aff>
<aff id="aff2"><sup>2</sup><institution>Graduate School of Integrated Energy AI, Jeonbuk National University</institution>, <addr-line>Jeonju</addr-line>, <country>South Korea</country></aff>
<aff id="aff3"><sup>3</sup><institution>School of International Engineering and Science, Jeonbuk National University</institution>, <addr-line>Jeonju</addr-line>, <country>South Korea</country></aff>
<aff id="aff4"><sup>4</sup><institution>Advances Electronics and Information Research Center, Jeonbuk National University</institution>, <addr-line>Jeonju</addr-line>, <country>South Korea</country></aff>
<author-notes>
<fn id="fn0001" fn-type="edited-by">
<p>Edited by: Hao Lin, University of Electronic Science and Technology of China, China</p>
</fn>
<fn id="fn0002" fn-type="edited-by">
<p>Reviewed by: Leyi Wei, Shandong University, China; Yongqiang Xing, Inner Mongolia University of Science and Technology, China</p>
</fn>
<corresp id="c001">&#x002A;Correspondence: Hilal Tayara, <email>hilaltayara@jbnu.ac.kr</email>; Kil To Chong, <email>kitchong@jbnu.ac.kr</email></corresp>
<fn id="fn0003" fn-type="other">
<p>This article was submitted to Evolutionary and Genomic Microbiology, a section of the journal Frontiers in Microbiology</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>04</day>
<month>11</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>13</volume>
<elocation-id>1061122</elocation-id>
<history>
<date date-type="received">
<day>04</day>
<month>10</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>18</day>
<month>10</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2022 Shujaat, Jin, Tayara and Chong.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Shujaat, Jin, Tayara and Chong</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>The increased interest in phages as antibacterial agents has resulted in a rise in the number of sequenced phage genomes, necessitating the development of user-friendly bioinformatics tools for genome annotation. A promoter is a DNA sequence that is used in the annotation of phage genomes. In this study we proposed a two layer model called &#x201C;iProm-phage&#x201D; for the prediction and classification of phage promoters. Model first layer identify query sequence as promoter or non-promoter and if the query sequence is predicted as promoter then model second layer classify it as phage or host promoter. Furthermore, rather than using non-coding regions of the genome as a negative set, we created a more challenging negative dataset using promoter sequences. The presented approach improves discrimination while decreasing the frequency of erroneous positive predictions. For feature selection, we investigated 10 distinct feature encoding approaches and utilized them with several machine-learning algorithms and a 1-D convolutional neural network model. We discovered that the one-hot encoding approach and the CNN model outperformed based on performance metrics. Based on the results of the 5-fold cross validation, the proposed predictor has a high potential. Furthermore, to make it easier for other experimental scientists to obtain the results they require, we set up a freely accessible and user-friendly web server at <ext-link xlink:href="http://nsclbio.jbnu.ac.kr/tools/iProm-phage/" ext-link-type="uri">http://nsclbio.jbnu.ac.kr/tools/iProm-phage/</ext-link>.</p>
</abstract>
<kwd-group>
<kwd>DNA promoters</kwd>
<kwd>convolutional neural networks</kwd>
<kwd>bioinformatics</kwd>
<kwd>computational biology</kwd>
<kwd>phages</kwd>
</kwd-group>
<counts>
<fig-count count="9"/>
<table-count count="3"/>
<equation-count count="26"/>
<ref-count count="25"/>
<page-count count="13"/>
<word-count count="6463"/>
</counts>
</article-meta>
</front>
<body>
<sec id="sec1" sec-type="intro">
<title>Introduction</title>
<p>Bacteriophages, commonly referred to as phages, are viruses that infect and destroy bacteria (<xref ref-type="bibr" rid="ref16">Salmond and Fineran, 2015</xref>). The number of sequenced phage genomes has increased exponentially in recent decades, primarily owing to their small size and ability to bacterial infections (<xref ref-type="bibr" rid="ref21">Silva and Echeverrigaray, 2012</xref>). This richness of genomic data necessitates the development of user-friendly bioinformatics tools to aid biologists in genome analyses. Recognition of regulatory elements is the most difficult phase in phage genome analysis. Promoters are DNA sequences responsible for transcription initiation. These sequences are difficult to identify because they are composed of short, nonconserved components. However, it is essential to comprehend and describe the genetic regulatory networks of phages, which may permit the engineering of improved phages for medicinal or biotechnological applications (<xref ref-type="bibr" rid="ref5">Guzina and Djordjevic, 2015</xref>).</p>
<p>Several attempts have been made to develop promoter prediction tools for bacterial genomes. The majority of these tools use computational techniques based on-10 and-35 motifs (<xref ref-type="bibr" rid="ref20">Sierro et al., 2008</xref>; <xref ref-type="bibr" rid="ref13">Mishra et al., 2020</xref>; <xref ref-type="bibr" rid="ref24">Wang et al., 2020</xref>). In contrast to these promoters with typical motifs, phage genome promoters are composed of host and phage promoters with varying motifs (<xref ref-type="bibr" rid="ref17">Sampaio et al., 2019</xref>).</p>
<p>Therefore, existing tools are not suitable for identifying promoters in phages. Computational tools are required to predict promoters in phages. Prediction of phage promoters has seldom been studied. The PHIRE method (<xref ref-type="bibr" rid="ref11">Lavigne et al., 2004</xref>) systematically scans a bacteriophage genome to determine the frequency of subsequences in a sequence. All sequences are compared, which significantly increases the running time. PromoterHunter (<xref ref-type="bibr" rid="ref10">Klucar et al., 2010</xref>) is an online tool to identify phage promoters; however, it requires additional information as input, such as weight matrices of the two promoter elements and is limited concerning the size of the input genome sequences. The PhagePromoter tool (<xref ref-type="bibr" rid="ref17">Sampaio et al., 2019</xref>) can be used to identify promoters across the entire phage genome. It was created using machine learning (ML) methods, such as artificial neural networks or support vector machines, in conjunction with sequence characteristics (size and score of motifs, frequency of adenine and thymine, and free energy value). Additionally, PhagePromoter can distinguish host promoters from phage promoters. However, PhagePromoter has to be used in a deterministic manner with some previous experimental or predictive knowledge, such as phage family, host bacterium species, and phage type (temperature or virulence), which limits the effectiveness of PhagePromoter. DPProm (<xref ref-type="bibr" rid="ref25">Wang et al., 2022</xref>) is a proposed convolutional neural network (CNN)-based method for predicting phage promoters and their types as phages or hosts. However, the proposed sequence-processing workflow requires a long time for a query sequence.</p>
<p>Significant progress has been achieved in the essential aspects of phage promoter identification, although improvements are required in different aspects. We identified the following shortcomings of prior research:</p>
<list list-type="order">
<list-item>
<p>Most of the aforementioned studies only predicted the promoter sequence as phage or non-promoter. Classification of predicted promoter sequences as phages or hosts was rare.</p>
</list-item>
<list-item>
<p>Most studies utilized ML models to classify predicted sequences.</p>
</list-item>
<list-item>
<p>Not all studies created a user-friendly and publicly available web server, which has proven inconvenient for practical use by experimental scientists.</p>
</list-item>
<list-item>
<p>Performance analysis of different feature encoding schemes on different ML and CNN models was not performed.</p>
</list-item>
<list-item>
<p>In the previously proposed tools, the number of false positive values for promoter prediction requires further improvement.</p>
</list-item>
<list-item>
<p>Previous studies selected non-coding regions as negative dataset, that&#x2019;s makes a very easy task for the classifier on other hand trained model cannot perform well on difficult test datasets.</p>
</list-item>
</list>
<p>In this study, we focused on overcoming these drawbacks to improve the prediction capabilities in identifying phage promoters. First, high-quality benchmark datasets were constructed. Subsequently, we extracted the best feature representation vector and model from a variety of encoding techniques, ML, and CNN models. To achieve this, we sequentially fed encoded vector sequences from all encoding methods into various ML and CNN algorithms. Based on performance evaluation, we chose the one-hot encoding technique and CNN algorithm. We investigated the sequence and properties of phage promoters and presented a two-layer model designated &#x201C;iProm-phage.&#x201D; In the first layer model, the query sequence is identified as a promoter or non-promoter. If it is a promoter sequence, then the second layer classifies the identified sequence as a phage promoter or host promoter. To assess model performance, we measured the accuracy (Acc), sensitivity (Sn), specificity (Sp), and Matthew&#x2019;s correlation coefficient (MCC). All these parameters are frequently used in state-of-the-art methods in computational biology and bioinformatics (<xref ref-type="bibr" rid="ref14">Rahman et al., 2019</xref>; <xref ref-type="bibr" rid="ref1">Ali et al., 2020</xref>; <xref ref-type="bibr" rid="ref19">Shujaat et al., 2020</xref>; <xref ref-type="bibr" rid="ref15">Rehman et al., 2021</xref>). In addition, we evaluated the model using five-fold cross validation and receiver operating characteristic (ROC) curves. Finally, the iProm-phage web server was built in compliance with the suggested paradigm. The proposed flow diagram of the study is shown in <xref rid="fig1" ref-type="fig">Figure 1</xref>.</p>
<fig position="float" id="fig1">
<label>Figure 1</label>
<caption>
<p>Flow diagram of iProm-phage.</p>
</caption>
<graphic xlink:href="fmicb-13-1061122-g001.tif"/>
</fig>
</sec>
<sec id="sec2" sec-type="materials|methods">
<title>Materials and methods</title>
<sec id="sec3">
<title>Benchmark dataset</title>
<p>While developing an effective biological predictor, it is critical to select an appropriate benchmark dataset to evaluate the proposed predictive model. We prepared separate datasets for each layer of the model, as described in Sections &#x201C;Dataset for the first layer&#x201D; and &#x201C;Dataset for the second layer.&#x201D;</p>
<sec id="sec4">
<title>Dataset for the first layer</title>
<p>The promoters of phage genomes have been poorly characterized. Only the phiSITE database has identified the promoters of phage genomes (<xref ref-type="bibr" rid="ref10">Klucar et al., 2010</xref>). The phage promoter sequence utilized in this study is the same as that used in previous studies (<xref ref-type="bibr" rid="ref17">Sampaio et al., 2019</xref>; <xref ref-type="bibr" rid="ref25">Wang et al., 2022</xref>). For the model&#x2019;s first layer, 1,140 promoter sequences from 69 phages were collected and divided into training and test datasets; 901 promoter sequences were utilized as the training dataset and 198 promoter sequences were utilized as the test dataset. <xref ref-type="supplementary-material" rid="SM1">Supplementary Table S1</xref> in <xref ref-type="supplementary-material" rid="SM1">Supplementary file</xref> summarize the promoter sequences from each phage genome.</p>
<p>The selection of a negative dataset is an important step in ensuring model performance. In previous studies, non-promoter regions were randomly selected to build a negative dataset. However, this method tends to be illogical because there is no intersection between positive and negative sets. Consequently, the model immediately detected the key differences between the two groups. Therefore, precision could not be maintained when tested on more difficult datasets. To overcome this problem, we propose a negative dataset generation technique. We created a negative dataset from positive promoter sequences by the following three steps. First, each positive sequence is divided into eight subsequences. Second, five subsequences are randomly selected and placed. Thirdly, the remaining three subsequences are placed at the same position. Using this method, each positive promoter sequence creates one negative sequence with 35&#x2013;40% conserved portions from the promoter sequence. This proportion is ideal as a reliable predictor of promoter activity.</p>
</sec>
<sec id="sec5">
<title>Dataset for the second layer</title>
<p>To create the positive and negative sets for the second layer of the model, promoter sequence type information as a host or phage was retrieved. The collection contains several promoters of unknown types. Finally, we collected 139 phage promoter-negative and 478 host promoter-positive samples. We randomly chose 80% of these positive and negative samples as the training dataset and 20% as the test dataset. <xref rid="tab1" ref-type="table">Table 1</xref> lists the dataset parameters for both layers.</p>
<table-wrap position="float" id="tab1">
<label>Table 1</label>
<caption>
<p>Summary of the Benchmark dataset.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="top">Model Layer</th>
<th align="left" valign="top">Dataset</th>
<th align="center" valign="top">Promoter</th>
<th align="center" valign="top">Non-promoter</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top" char="." rowspan="2">First layer</td>
<td align="char" valign="top" char="&#x00B1;">Training</td>
<td align="center" valign="top">901</td>
<td align="center" valign="top">901</td>
</tr>
<tr>
<td align="left" valign="top" char="&#x00B1;">Test</td>
<td align="center" valign="top">198</td>
<td align="center" valign="top">198</td>
</tr>
<tr>
<td align="left" valign="top" char="." rowspan="3">Second layer</td>
<td/>
<td align="char" valign="top" char="&#x00B1;">
<bold>Phage</bold>
</td>
<td align="char" valign="top" char="&#x00B1;">
<bold>Host</bold>
</td>
</tr>
<tr>
<td align="left" valign="top" char="&#x00B1;">Training</td>
<td align="center" valign="top">111</td>
<td align="center" valign="top">382</td>
</tr>
<tr>
<td align="left" valign="top" char="&#x00B1;">Test</td>
<td align="center" valign="top">28</td>
<td align="center" valign="top">96</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="sec6">
<title>Methods</title>
<p>In this section, we briefly explain the proposed model, feature encoding techniques, and baseline models.</p>
<sec id="sec7">
<title>Proposed model</title>
<p>The proposed two-layer model is designated &#x201C;iProm-phage.&#x201D; The model&#x2019;s first layer predicts the query sequence as a phage promoter or non-promoter. If the predicted sequence is a phage promoter then the model&#x2019;s second layer classifies it as a phage or host. <xref rid="fig2" ref-type="fig">Figure 2</xref> illustrates the proposed model.</p>
<fig position="float" id="fig2">
<label>Figure 2</label>
<caption>
<p>Flow diagram of the two-layer model.</p>
</caption>
<graphic xlink:href="fmicb-13-1061122-g002.tif"/>
</fig>
<p>Based on performance measures, we opted for the CNN model and one-hot encoding technique for this two-layer predictor. The selection of the model and encoding technique are briefly explained in the performance measure section.</p>
</sec>
<sec id="sec8">
<title>Convolutional neural network model architecture</title>
<p>The CNN is composed of 2 one-dimensional convolutional layers (Conv1D), which are followed by maximum (max) pooling and dropout layers. The filter and kernel sizes of both Conv1D is 16 and 5, respectively. The max pooling size is four with strides of two in both the max pooling layers. A dropout layer is utilized after each max pooling layer, with a value of 0.5. A flattened layer was utilized, followed by a dense layer with 64 nodes. Subsequently, we used a dropout layer with a value of 0.5. The ReLU activation function was utilized in all the Conv1D and dense layers. Finally, the dense layer is employed as an output layer with a single node and sigmoid activation function that classifies the input sequence as positive or negative based on the probability scores. The mathematical expression for the sigmoid activation function is as follows:</p>
<disp-formula id="E1">
<mml:math id="M1">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>+</mml:mo>
<mml:mi>exp</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<p>We used L2 regularization and bias regularization in the convolution and dense layers to ensure that the model did not overfit. The values for both regularizations were set to 0.0001. The loss function of the model is binary cross-entropy. Adam was used as the optimizer. The batch size was set to 20 with a total of 85 epochs. iProm-phage was created and trained using the Keras framework. The CNN architecture is illustrated in <xref rid="fig3" ref-type="fig">Figure 3</xref>.</p>
<fig position="float" id="fig3">
<label>Figure 3</label>
<caption>
<p>iProm-phage CNN architecture.</p>
</caption>
<graphic xlink:href="fmicb-13-1061122-g003.tif"/>
</fig>
</sec>
<sec id="sec9">
<title>Feature encoding techniques</title>
<p>A DNA sequence is comprised of the <italic>A</italic>, <italic>C</italic>, <italic>G</italic>, and <italic>T</italic> nucleotides. To perform computational operations, the sequence must be translated into a numerical representation. Feature encoding schemes play a vital role in creating optimal predictors. The input size should be the same for all sequences. We apply the zero-filled method to make every DNA sequence with an equal length of 99&#x2009;bp. This technique was previously applied by DPProm (<xref ref-type="bibr" rid="ref25">Wang et al., 2022</xref>). In this study, we find the best feature encoding technique among the 10 different techniques. The details of each encoding scheme are presented below.</p>
<sec id="sec10">
<title>One-hot feature encoding</title>
<p>One-hot encoding techniques are used by many state-of-the-art bioinformatics tools (<xref ref-type="bibr" rid="ref23">Umarov and Solovyev, 2017</xref>; <xref ref-type="bibr" rid="ref12">Liu and Li, 2019</xref>; <xref ref-type="bibr" rid="ref18">Shujaat et al., 2021</xref>; <xref ref-type="bibr" rid="ref9">Kim et al., 2022</xref>). Each nucleotide in a DNA sequence is represented by a four-dimensional vector, which is a vector of zeros with a single one. Nucleotide <italic>A</italic> is encoded as (1,0,0,0), <italic>C</italic> (0,1,0,0), <italic>G</italic> (0,0,1,0), and <italic>T</italic> (0, 0,0,1). Each DNA sequence can be represented by a (99,4) two-dimensional vector.</p>
</sec>
<sec id="sec11">
<title>Nucleotide chemical property feature encoding</title>
<p>The chemical characteristics of the four DNA nucleic acids differ (<xref ref-type="bibr" rid="ref7">Jeong et al., 2014</xref>). Nucleotides are classified into three types based on their chemical characteristics: hydrogen-bond strength, base type, and functional groups. Purines with two rings are represented by the letters <italic>A</italic> and <italic>G</italic>, whereas pyrimidines with one ring are represented by the letters <italic>C</italic> and <italic>T</italic>. The hydrogen bonds between <italic>A</italic> and <italic>T</italic> are weak, whereas the hydrogen bonds between <italic>C</italic> and <italic>G</italic> are strong. In terms of functional groups, the amino group includes <italic>A</italic> and <italic>C</italic>, whereas the keto group includes <italic>G</italic> and <italic>T</italic>. Each DNA sequence is represented by a three-dimensional vector (<italic>b</italic>, <italic>c</italic>, <italic>p</italic>) based on chemical properties, where <inline-formula>
<mml:math id="M2">
<mml:mrow>
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> denotes the nucleotide <italic>n</italic> at position <italic>i;</italic> hence, <italic>b</italic>, <italic>c</italic>, and, <italic>p</italic> were computed as follows:</p>
<disp-formula id="E2">
<mml:math id="M3">
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mi>b</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mspace width="0.25em"/>
<mml:mspace width="0.25em"/>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mspace width="thickmathspace"/>
<mml:mi mathvariant="normal">if</mml:mi>
<mml:mspace width="thickmathspace"/>
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:mi>C</mml:mi>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mspace width="thickmathspace"/>
<mml:mi mathvariant="normal">if</mml:mi>
<mml:mspace width="thickmathspace"/>
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mi>G</mml:mi>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mi mathvariant="normal">,</mml:mi>
</mml:mrow>
</mml:mrow>
<mml:mspace width="thickmathspace"/>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mspace width="thickmathspace"/>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mtd>
<mml:mtd>
<mml:mo>=</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mspace width="thickmathspace"/>
<mml:mi mathvariant="normal">if</mml:mi>
<mml:mspace width="thickmathspace"/>
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:mi>G</mml:mi>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mspace width="thickmathspace"/>
<mml:mi mathvariant="normal">if</mml:mi>
<mml:mspace width="thickmathspace"/>
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mi>C</mml:mi>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mi mathvariant="normal">,</mml:mi>
</mml:mrow>
</mml:mrow>
<mml:mspace width="thickmathspace"/>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mspace width="thickmathspace"/>
<mml:mi mathvariant="normal">if</mml:mi>
<mml:mspace width="thickmathspace"/>
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mspace width="thickmathspace"/>
<mml:mi mathvariant="normal">if</mml:mi>
<mml:mspace width="thickmathspace"/>
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mi>C</mml:mi>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:mi>G</mml:mi>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mspace width="thickmathspace"/>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
</sec>
<sec id="sec12">
<title>Dinucleotide-based auto-cross covariance feature encoding</title>
<p>DACC is a combination of dinucleotide-based auto-covariance (DAC) and dinucleotide-based cross covariance (DCC) encoding. DAC computes the correlation of the same physicochemical index between two dinucleotides separated by a lag distance along the sequence. DAC is calculated as:</p>
<disp-formula id="E3">
<mml:math id="M4">
<mml:mrow>
<mml:mi mathvariant="normal">DAC</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:mi mathvariant="normal">,lag</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi mathvariant="normal">lag</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>u</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>P</mml:mi>
<mml:mo>&#x21BC;</mml:mo>
</mml:mover>
<mml:mi>u</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>u</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">lag</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">lag</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>P</mml:mi>
<mml:mo>&#x21BC;</mml:mo>
</mml:mover>
<mml:mi>u</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi mathvariant="normal">lag</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math id="M5">
<mml:mi>u</mml:mi>
</mml:math>
</inline-formula>, <inline-formula>
<mml:math id="M6">
<mml:mi>L</mml:mi>
</mml:math>
</inline-formula> represent the physicochemical index and length of the sequence, respectively, and the physicochemical index <inline-formula>
<mml:math id="M7">
<mml:mi>u</mml:mi>
</mml:math>
</inline-formula> for the dinucleotide <inline-formula>
<mml:math id="M8">
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> at position <inline-formula>
<mml:math id="M9">
<mml:mi>i</mml:mi>
</mml:math>
</inline-formula> is expressed numerically as <inline-formula>
<mml:math id="M10">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>u</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>. <inline-formula>
<mml:math id="M11">
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>P</mml:mi>
<mml:mo>&#x21BC;</mml:mo>
</mml:mover>
<mml:mi>u</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> represents the average value of the physicochemical index <inline-formula>
<mml:math id="M12">
<mml:mi>u</mml:mi>
</mml:math>
</inline-formula> along the whole sequence, and is calculated as:</p>
<disp-formula id="E4">
<mml:math id="M13">
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>P</mml:mi>
<mml:mo>&#x21BC;</mml:mo>
</mml:mover>
<mml:mi>u</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>u</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>The DAC feature vector has a dimension of <inline-formula>
<mml:math id="M14">
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x00D7;</mml:mo>
<mml:mi mathvariant="normal">LAG</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, where LAG is the maximum lag (lag&#x2009;=&#x2009;1, 2,&#x2026;, LAG) and <italic>N</italic> is the total number of physicochemical indices. DCC computes the correlation of two different physicochemical indices between two dinucleotides along the sequence separated by <italic>lag</italic> nucleic acids. Mathematically, DCC can be represented as</p>
<disp-formula id="E5">
<mml:math id="M15">
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="normal">DCC</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>u</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:msub>
<mml:mi>u</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mi mathvariant="normal">,lag</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi mathvariant="normal">lag</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:msub>
<mml:mi>u</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>P</mml:mi>
<mml:mo>&#x21BC;</mml:mo>
</mml:mover>
<mml:mrow>
<mml:msub>
<mml:mi>u</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="normal">                                        </mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:msub>
<mml:mi>u</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">lag</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">lag</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi>P</mml:mi>
<mml:mo>&#x21BC;</mml:mo>
</mml:mover>
<mml:mrow>
<mml:msub>
<mml:mi>u</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi mathvariant="normal">lag</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
<p>where<inline-formula>
<mml:math id="M16">
<mml:mrow>
<mml:mspace width="thickmathspace"/>
<mml:msub>
<mml:mi>u</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>u</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mspace width="thickmathspace"/>
<mml:mi mathvariant="normal">and</mml:mi>
<mml:mspace width="thickmathspace"/>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> represent the physicochemical indices and length of the nucleotide sequence, respectively, <inline-formula>
<mml:math id="M17">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:msub>
<mml:mi>u</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is the numerical value of the physicochemical index <inline-formula>
<mml:math id="M18">
<mml:mrow>
<mml:msub>
<mml:mi>u</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> for the dinucleotide <inline-formula>
<mml:math id="M19">
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> at position <inline-formula>
<mml:math id="M20">
<mml:mi>i</mml:mi>
</mml:math>
</inline-formula>, and <inline-formula>
<mml:math id="M21">
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>P</mml:mi>
<mml:mo>&#x21BC;</mml:mo>
</mml:mover>
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:mi mathvariant="normal">a</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the average value for the physicochemical index <inline-formula>
<mml:math id="M22">
<mml:mrow>
<mml:msub>
<mml:mi>u</mml:mi>
<mml:mi>a</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> along the whole sequence, calculated as:</p>
<disp-formula id="E6">
<mml:math id="M23">
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>P</mml:mi>
<mml:mo>&#x21BC;</mml:mo>
</mml:mover>
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:mi>a</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:mi>a</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>The DCC feature vector has dimensions of <inline-formula>
<mml:math id="M24">
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x00D7;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x00D7;</mml:mo>
<mml:mi mathvariant="normal">LAG</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, where LAG is the maximum lag (lag&#x2009;=&#x2009;1, 2,.., LAG) and <italic>N</italic> is the total number of physicochemical indices. Thus, the dimension of the DACC encoding is <italic>N</italic>&#x2009;&#x00D7;&#x2009;<italic>N</italic>&#x2009;&#x00D7;&#x2009;LAG, where <italic>N</italic> is the number of physicochemical indices and LAG is the maximum lag (lag&#x2009;=&#x2009;1, 2, &#x2026;, LAG).</p>
</sec>
<sec id="sec13">
<title>Pseudo dinucleotide composition</title>
<p>PseDNC encoding incorporates both contiguous local and global sequence order information into a feature vector of the nucleotide sequence. PseDNC is mathematically defined as follows:</p>
<disp-formula id="E7">
<mml:math id="M25">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:mo>&#x2026;</mml:mo>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mn>16</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mn>16</mml:mn>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:mo>&#x2026;</mml:mo>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mn>16</mml:mn>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:mo>&#x2026;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="normal">,s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>16</mml:mn>
<mml:mo>+</mml:mo>
<mml:mi>&#x03BB;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mi>T</mml:mi>
</mml:msup>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Whereas:</p>
<disp-formula id="E8">
<mml:math id="M26">
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msubsup>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>16</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
<mml:msubsup>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>&#x03BB;</mml:mi>
</mml:msubsup>
<mml:msub>
<mml:mi>&#x03B8;</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>&#x2264;</mml:mo>
<mml:mn>16</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:msub>
<mml:mi>&#x03B8;</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>16</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msubsup>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>16</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
<mml:msubsup>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>&#x03BB;</mml:mi>
</mml:msubsup>
<mml:msub>
<mml:mi>&#x03B8;</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>17</mml:mn>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>&#x2264;</mml:mo>
<mml:mn>16</mml:mn>
<mml:mo>+</mml:mo>
<mml:mi>&#x03BB;</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mspace width="thickmathspace"/>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math id="M27">
<mml:mrow>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> (<italic>k</italic>&#x2009;=&#x2009;1, 2,&#x2026;, 16) is the normalized frequency of dinucleotide occurrence in the nucleotide sequence, <inline-formula>
<mml:math id="M28">
<mml:mi>&#x03BB;</mml:mi>
</mml:math>
</inline-formula> represents the highest counted rank (or tie) of the correlation along the nucleotide sequence, w is the weight factor ranging from 0 to 1, and <inline-formula>
<mml:math id="M29">
<mml:mrow>
<mml:msub>
<mml:mi>&#x03B8;</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> (<italic>j</italic>&#x2009;=&#x2009;1,2,&#x2026;, <inline-formula>
<mml:math id="M30">
<mml:mi>&#x03BB;</mml:mi>
</mml:math>
</inline-formula>) is the jth correlation factor and is defined as</p>
<disp-formula id="E9">
<mml:math id="M31">
<mml:mrow>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:msub>
<mml:mi>&#x03B8;</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:mspace width="thickmathspace"/>
<mml:munderover>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:mi>&#x03B8;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mspace width="thickmathspace"/>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:msub>
<mml:mi>&#x03B8;</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:mspace width="thickmathspace"/>
<mml:munderover>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:mi>&#x03B8;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mspace width="thickmathspace"/>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:msub>
<mml:mi>&#x03B8;</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:mspace width="thickmathspace"/>
<mml:munderover>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:mi>&#x03B8;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mspace width="thickmathspace"/>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>&#x03BB;</mml:mi>
<mml:mo>&#x003C;</mml:mo>
<mml:mi>L</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mo>&#x2026;</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mspace width="thickmathspace"/>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:msub>
<mml:mi>&#x03B8;</mml:mi>
<mml:mi>&#x03BB;</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>&#x03BB;</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mspace width="thickmathspace"/>
<mml:munderover>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>&#x03BB;</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mi>&#x03B8;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>&#x03BB;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>&#x03BB;</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mspace width="thickmathspace"/>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>The correlation function is given as follows:</p>
<disp-formula id="E10">
<mml:math id="M32">
<mml:mrow>
<mml:mi>&#x03B8;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mi>&#x03BC;</mml:mi>
</mml:mfrac>
<mml:munderover>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>&#x03BC;</mml:mi>
</mml:munderover>
<mml:mo stretchy="false">[</mml:mo>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>&#x03BC;</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>&#x03BC;</mml:mi>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where physicochemical indices are represented by <italic>&#x03BC;</italic>, <inline-formula>
<mml:math id="M33">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>&#x03BC;</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> measures are the numerical values of the <italic>u</italic>-th (<italic>u</italic>&#x2009;=&#x2009;1, 2, &#x2026;, <italic>&#x03BC;</italic>) physicochemical index of the dinucleotide <inline-formula>
<mml:math id="M34">
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mspace width="thickmathspace"/>
</mml:mrow>
</mml:math>
</inline-formula>at position <inline-formula>
<mml:math id="M35">
<mml:mi>i</mml:mi>
</mml:math>
</inline-formula> and <inline-formula>
<mml:math id="M36">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>&#x03BC;</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> represents the corresponding value of the dinucleotide <inline-formula>
<mml:math id="M37">
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> at position <inline-formula>
<mml:math id="M38">
<mml:mi>j</mml:mi>
</mml:math>
</inline-formula>.Pseudo k-tupler composition (PseKNC).</p>
<p>PseKNC encoding uses a k-tuple nucleotide composition defined as</p>
<disp-formula id="E11">
<mml:math id="M39">
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:mo>&#x2026;</mml:mo>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mi>k</mml:mi>
</mml:msup>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mi>k</mml:mi>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:mo>&#x2026;</mml:mo>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mi>k</mml:mi>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:mi>&#x03BB;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mi>T</mml:mi>
</mml:msup>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Whereas:</p>
<disp-formula id="E12">
<mml:math id="M40">
<mml:mrow>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>u</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msubsup>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mi>k</mml:mi>
</mml:msup>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
<mml:msubsup>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>&#x03BB;</mml:mi>
</mml:msubsup>
<mml:msub>
<mml:mi>&#x03B8;</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>u</mml:mi>
<mml:mo>&#x2264;</mml:mo>
<mml:mn>4</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:msub>
<mml:mi>&#x03B8;</mml:mi>
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mi>k</mml:mi>
</mml:msup>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msubsup>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mi>k</mml:mi>
</mml:msup>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
<mml:msubsup>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>&#x03BB;</mml:mi>
</mml:msubsup>
<mml:msub>
<mml:mi>&#x03B8;</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mi>k</mml:mi>
</mml:msup>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>u</mml:mi>
<mml:mo>&#x2264;</mml:mo>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mi>k</mml:mi>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:mi>&#x03BB;</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math id="M41">
<mml:mi>&#x03BB;</mml:mi>
</mml:math>
</inline-formula> is the total number of ranks of correlations along a nucleotide sequence, <inline-formula>
<mml:math id="M42">
<mml:mrow>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>u</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:mo>&#x2026;</mml:mo>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mi>k</mml:mi>
</mml:msup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is the frequency ofoligonucleotides normalized to <inline-formula>
<mml:math id="M43">
<mml:mrow>
<mml:munderover>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mi>k</mml:mi>
</mml:msup>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, w is the factor, and <inline-formula>
<mml:math id="M44">
<mml:mrow>
<mml:msub>
<mml:mi>&#x03B8;</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is defined as follows:</p>
<disp-formula id="E13">
<mml:math id="M45">
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mi>&#x03B8;</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:munderover>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:mi>&#x0398;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="normal">                               </mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mi mathvariant="normal">,,,;</mml:mi>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="normal">,,,;</mml:mi>
<mml:mo>&#x2026;</mml:mo>
<mml:mi mathvariant="normal">,,,;</mml:mi>
<mml:mi>&#x03BB;</mml:mi>
<mml:mi mathvariant="normal">,,,;</mml:mi>
<mml:mi>&#x03BB;</mml:mi>
<mml:mo>&#x003C;</mml:mo>
<mml:mi>L</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
<p>The correlation function is defined as:</p>
<disp-formula id="E14">
<mml:math id="M46">
<mml:mrow>
<mml:mi mathvariant="normal">&#x0398;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mi>&#x03BC;</mml:mi>
</mml:mfrac>
<mml:munderover>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>v</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>&#x03BC;</mml:mi>
</mml:munderover>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>v</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>v</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math id="M47">
<mml:mi>&#x03BC;</mml:mi>
</mml:math>
</inline-formula> represents the physicochemical index. <inline-formula>
<mml:math id="M48">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>v</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is a numerical value <italic>v</italic>-th (<italic>v</italic>&#x2009;=&#x2009;1, 2, &#x2026;, <italic>&#x03BC;</italic>). The physicochemical index of dinucleotide <inline-formula>
<mml:math id="M49">
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> at position <italic>i</italic> and <inline-formula>
<mml:math id="M50">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>v</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> represents the corresponding value of dinucleotide <inline-formula>
<mml:math id="M51">
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> at position <italic>i</italic>&#x2009;+&#x2009;<italic>j</italic>.</p>
</sec>
<sec id="sec14">
<title>Electron-ion interaction pseudopotentials of trinucleotide</title>
<p>The values of nucleotides <italic>A</italic>, <italic>G</italic>, <italic>C</italic>, and <italic>T</italic> electron-ion interaction pseudopotentials (EIIP) were determined as previously described using Nair (<xref ref-type="bibr" rid="ref11">Lavigne et al., 2004</xref>; <italic>A</italic>: 0.1260, <italic>C</italic>: 0.1340, <italic>G</italic>: 0.0806, <italic>T</italic>: 0.1335). Nucleotides in the DNA sequence are directly represented by EIIP using the EIIP value. EIIPA, EIIPT, EIIPG, and EIIPC represent the EIIP values of nucleotides <italic>A</italic>, <italic>T</italic>, <italic>G</italic>, and <italic>C</italic>, respectively, in PseEIIP encoding. A feature vector is created using the mean EIIP value of the trinucleotides in each sample, as follows:</p>
<disp-formula id="E15">
<mml:math id="M52">
<mml:mrow>
<mml:mi>V</mml:mi>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mi>E</mml:mi>
<mml:mi>I</mml:mi>
<mml:mi>I</mml:mi>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>A</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x00B7;</mml:mo>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>A</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:mi>E</mml:mi>
<mml:mi>I</mml:mi>
<mml:mi>I</mml:mi>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x00B7;</mml:mo>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:mo>&#x2026;</mml:mo>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:mi>E</mml:mi>
<mml:mi>I</mml:mi>
<mml:mi>I</mml:mi>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>T</mml:mi>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x00B7;</mml:mo>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>T</mml:mi>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
</sec>
<sec id="sec15">
<title>Parallel correlation pseudo dinucleotide composition</title>
<p>Similar to PseDNC, PCPseDNC encoding differs in that it uses 38 default physiochemical indices for DNA instead of the six indices used in PseDNC encoding. <xref ref-type="supplementary-material" rid="SM1">Supplementary Table S2</xref> in <xref ref-type="supplementary-material" rid="SM1">Supplementary file</xref> presents a list of 38 physicochemical indices.</p>
</sec>
<sec id="sec16">
<title>Parallel correlation pseudo trinucleotide composition</title>
<p>PCPseTNC encoding is described as:</p>
<disp-formula id="E16">
<mml:math id="M53">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:mo>&#x2026;</mml:mo>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mn>64</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mn>64</mml:mn>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:mo>&#x2026;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="normal">,s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>64</mml:mn>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">&#x03BB;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mi>T</mml:mi>
</mml:msup>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Whereas:</p>
<disp-formula id="E17">
<mml:math id="M54">
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msubsup>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>64</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
<mml:msubsup>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>&#x03BB;</mml:mi>
</mml:msubsup>
<mml:msub>
<mml:mi>&#x03B8;</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>&#x2264;</mml:mo>
<mml:mn>64</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:msub>
<mml:mi>&#x03B8;</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>64</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msubsup>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>64</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
<mml:msubsup>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>&#x03BB;</mml:mi>
</mml:msubsup>
<mml:msub>
<mml:mi>&#x03B8;</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>65</mml:mn>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>&#x2264;</mml:mo>
<mml:mn>64</mml:mn>
<mml:mo>+</mml:mo>
<mml:mi>&#x03BB;</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mspace width="thickmathspace"/>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math id="M55">
<mml:mrow>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> (<italic>k</italic>&#x2009;=&#x2009;1, 2,&#x2026;, 64) is the normalized frequency of dinucleotide occurrence in the nucleotide sequence, <inline-formula>
<mml:math id="M56">
<mml:mi>&#x03BB;</mml:mi>
</mml:math>
</inline-formula> represents the highest counted rank (or tie) of the correlation along the nucleotide sequence, <italic>w</italic> is the weight factor ranging from 0 to 1, and <inline-formula>
<mml:math id="M57">
<mml:mrow>
<mml:msub>
<mml:mi>&#x03B8;</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> (<italic>j</italic>&#x2009;=&#x2009;1,2,&#x2026;, <inline-formula>
<mml:math id="M58">
<mml:mi>&#x03BB;</mml:mi>
</mml:math>
</inline-formula>) is the <italic>j</italic>th correlation factor and is defined as:</p>
<disp-formula id="E18">
<mml:math id="M59">
<mml:mrow>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:msub>
<mml:mi>&#x03B8;</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:munderover>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:mi>&#x0398;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:msub>
<mml:mi>&#x03B8;</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:munderover>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:mi>&#x0398;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:msub>
<mml:mi>&#x03B8;</mml:mi>
<mml:mn>3</mml:mn>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:munderover>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:mi>&#x0398;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mspace width="0.25em"/>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>&#x03BB;</mml:mi>
<mml:mo>&#x003C;</mml:mo>
<mml:mi>L</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:msub>
<mml:mi>&#x03B8;</mml:mi>
<mml:mi>&#x03BB;</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>&#x03BB;</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:munderover>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>&#x03BB;</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mi>&#x0398;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>&#x03BB;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>&#x03BB;</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>&#x03BB;</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>The correlation function is defined as:</p>
<disp-formula id="E19">
<mml:math id="M60">
<mml:mrow>
<mml:mi>&#x0398;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mi>&#x03BC;</mml:mi>
</mml:mfrac>
<mml:munderover>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>&#x03BC;</mml:mi>
</mml:munderover>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>u</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>u</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where physicochemical indices are represented by <italic>&#x03BC;</italic>, <inline-formula>
<mml:math id="M61">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>&#x03BC;</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> measures are the numerical values of the <italic>u</italic>-th (<italic>u</italic>&#x2009;=&#x2009;1, 2, &#x2026;, <italic>&#x03BC;</italic>) physicochemical index of the dinucleotide <inline-formula>
<mml:math id="M62">
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mspace width="thickmathspace"/>
</mml:mrow>
</mml:math>
</inline-formula>at position <inline-formula>
<mml:math id="M63">
<mml:mi>i</mml:mi>
</mml:math>
</inline-formula> and <inline-formula>
<mml:math id="M64">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>&#x03BC;</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> represents the corresponding value of the dinucleotide <inline-formula>
<mml:math id="M65">
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> at position <inline-formula>
<mml:math id="M66">
<mml:mi>j</mml:mi>
</mml:math>
</inline-formula>.</p>
</sec>
<sec id="sec17">
<title>Moran correlation</title>
<p>The distribution of amino acid characteristics along the sequence is used to create autocorrelation descriptors (<xref ref-type="bibr" rid="ref6">Horne, 1988</xref>; <xref ref-type="bibr" rid="ref4">Feng and Zhang, 2000</xref>; <xref ref-type="bibr" rid="ref22">Sokal and Thomson, 2006</xref>). The amino acid properties used here are different types of amino acid indices retrieved from the AAindex Database (<xref ref-type="bibr" rid="ref8">Kawashima et al., 2008</xref>) available at <ext-link xlink:href="http://www.genome.jp/dbget/aaindex.html" ext-link-type="uri">http://www.genome.jp/dbget/aaindex.html</ext-link>.</p>
</sec>
<sec id="sec18">
<title>kmer</title>
<p>DNA sequences are represented as the occurrence frequencies of k adjacent nucleic acids in the kmer descriptor, which has been effectively used for human gene regulatory sequence prediction. The kmer descriptor (<italic>k</italic>&#x2009;=&#x2009;3) is calculated as follows:</p>
<disp-formula id="E20">
<mml:math id="M67">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mi>N</mml:mi>
</mml:mfrac>
<mml:mo>,</mml:mo>
<mml:mi>t</mml:mi>
<mml:mspace width="thickmathspace"/>
<mml:mi>&#x03B5;</mml:mi>
<mml:mspace width="thickmathspace"/>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>C</mml:mi>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>G</mml:mi>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:mo>&#x2026;</mml:mo>
<mml:mi mathvariant="normal">,</mml:mi>
<mml:mi>T</mml:mi>
<mml:mi>T</mml:mi>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
<mml:mspace width="thickmathspace"/>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math id="M68">
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> represents the number of kmer types (<italic>t</italic>) and <italic>N</italic> is the length of the sequence.</p>
</sec>
</sec>
<sec id="sec19">
<title>Baseline models</title>
<p>Selection of the optimal model is a vital step in developing a novel predictor. We have utilized different ML and CNN models and, based on performance measures, selected the best model. ML models include the Adaboost (AdB) classifier, multinomial naive Bayes, extreme gradient boosting (XGboost), gradient boosting (Gboost), logistic regression (LR), K-nearest neighbor, decision tree classifier, support vector machine (SVM), multilayer perceptron classifier, and SVM bagging. A CNN is composed of two convolution layers. We used hyperparameter tuning to determine the best convolution, pooling, dropout, and dense layer parameters.</p>
</sec>
</sec>
</sec>
<sec id="sec20">
<title>Performance measures</title>
<p>In this section, we explain the evolution metrics, selection of the best model and feature encoding scheme, model performance, and model comparison.</p>
<sec id="sec21">
<title>Evaluation metrics</title>
<p>In the performance assessment matrix, we used the accuracy (Acc), sensitivity (Sn), specificity (Sp), and MCC. These parameters have been used in several cutting-edge studies. The numerical representation of an evaluation matrix is expressed using the following equations:</p>
<disp-formula id="E21">
<mml:math id="M69">
<mml:mrow>
<mml:mi mathvariant="normal">Acc</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="normal">TP</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">TN</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">TP</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">TN</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">FP</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">FN</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="E22">
<mml:math id="M70">
<mml:mrow>
<mml:mi mathvariant="normal">Sn</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="normal">TP</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">TP</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">FN</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="E23">
<mml:math id="M71">
<mml:mrow>
<mml:mi mathvariant="normal">Sp</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="normal">TN</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">TN</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">FP</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="E24">
<mml:math id="M72">
<mml:mrow>
<mml:mi mathvariant="normal">MCC</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="normal">TP</mml:mi>
<mml:mo>&#x2217;</mml:mo>
<mml:mi mathvariant="normal">TN</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi mathvariant="normal">FP</mml:mi>
<mml:mo>&#x2217;</mml:mo>
<mml:mi mathvariant="normal">FN</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="normal">TP</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">FP</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="normal">TP</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">FN</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="normal">TN</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">FP</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="normal">TN</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">FN</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<p>The terms TP, TN, FP, and FN in the aforementioned equations represent the appropriate numbers of true positives, true negatives, false positives, and false negatives, respectively.</p>
</sec>
<sec id="sec22">
<title>Selection of best model and feature encoding</title>
<p>To generate an optimum model, we compared all the encoding strategies stated above to the baseline approaches. <xref ref-type="supplementary-material" rid="SM1">Supplementary Tables S3, S4</xref> in <xref ref-type="supplementary-material" rid="SM1">Supplementary file</xref>, and <xref rid="fig4" ref-type="fig">Figures 4</xref>, <xref rid="fig5" ref-type="fig">5</xref> illustrate the performance of each method on various encoding schemes for the first and second layers. For the first layer of the model CNN and one-hot encoding outperformed after that AdB performed better on PseKNC feature encoding and for the second layer almost every feature encoding scheme performed good on ML and CNN algorithms, but one-hot and CNN outperformed in the second layer as well. Therefore, based on performance evaluation, we chose the CNN and one-hot encoding technique for both layers and the proposed tool &#x201C;iProm-phage.&#x201D;</p>
<fig position="float" id="fig4">
<label>Figure 4</label>
<caption>
<p>Accuracy of First layer baseline models.</p>
</caption>
<graphic xlink:href="fmicb-13-1061122-g004.tif"/>
</fig>
<fig position="float" id="fig5">
<label>Figure 5</label>
<caption>
<p>Accuracy of Second layer baseline models.</p>
</caption>
<graphic xlink:href="fmicb-13-1061122-g005.tif"/>
</fig>
</sec>
<sec id="sec23">
<title>Model performance</title>
<p>The prediction performance of iProm-phage was evaluated using 5-fold cross validation. We employed the same parameters used in choosing the best model and also considered ROC curve data. The first layer of iProm-phage achieved an Acc of 95.68 93.47%, Sn of 96.12%, Sp of 92.63%, MCC of 0.872, and AUROC of 0.99 during cross validation. These findings suggest that our predictor is capable of properly recognizing whether a query sequence is a promoter. The second layer of iProm-Zea achieved values of 97.25, 94.32, 98.5%, 0.8619, and 0.97, respectively. In the test dataset model, the first layer achieved an accuracy of 94.2%, Sn 90%, Sp 90%, and MCC 0.88. The second layer obtained accuracies of 95.2%, 94.37%, 97.14%, and 0.88% for the test dataset. <xref rid="fig6" ref-type="fig">Figures 6</xref>, <xref rid="fig7" ref-type="fig">7</xref> depict the ROC curves for both layers of the iProm-phage model.</p>
<fig position="float" id="fig6">
<label>Figure 6</label>
<caption>
<p>First layer ROC curve.</p>
</caption>
<graphic xlink:href="fmicb-13-1061122-g006.tif"/>
</fig>
<fig position="float" id="fig7">
<label>Figure 7</label>
<caption>
<p>Second layer ROC curve.</p>
</caption>
<graphic xlink:href="fmicb-13-1061122-g007.tif"/>
</fig>
</sec>
<sec id="sec24">
<title>Comparison with existing models</title>
<p>We compared iProm-phage with state-of-the-art promoter identification tools PhagePromoter and DPProm for the identification of query sequences as promoters or promoters. We measured the precision and recall for both layers to compare them with state-of-the-art methods. The following equations express precision and recall:</p>
<disp-formula id="E25">
<mml:math id="M73">
<mml:mrow>
<mml:mi mathvariant="normal">Recall</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="normal">TP</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">TP</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">FN</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="E26">
<mml:math id="M74">
<mml:mrow>
<mml:mi mathvariant="normal">Precison</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="normal">TP</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">TP</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">FP</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<p>A performance comparison of the methods used for promoter identification is presented in <xref rid="tab2" ref-type="table">Table 2</xref>. The superior performance of the proposed iProm-phage tool can be observed in all four performance metrics for this particular task.</p>
<table-wrap position="float" id="tab2">
<label>Table 2</label>
<caption>
<p>First layer performance comparison.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="top">
<bold>Methods</bold>
</th>
<th align="center" valign="top">
<bold>Acc%</bold>
</th>
<th align="center" valign="top">
<bold>Precision%</bold>
</th>
<th align="center" valign="top">
<bold>Recall%</bold>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top" char=".">PhagePromoter</td>
<td align="left" valign="top">92</td>
<td align="left" valign="top">89</td>
<td align="left" valign="top">87</td>
</tr>
<tr>
<td align="left" valign="top" char=".">DPProm</td>
<td align="left" valign="top">85.5</td>
<td align="left" valign="top">88.9</td>
<td align="left" valign="top">83</td>
</tr>
<tr>
<td align="left" valign="top" char=".">iProm-phage</td>
<td align="left" valign="top">95.68</td>
<td align="left" valign="top">94.2</td>
<td align="left" valign="top">93.5</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>We demonstrate the performance comparison between DPProm in <xref rid="tab3" ref-type="table">Table 3</xref> for promoter classification as a phage or host. The iProm-phage tool was superior to DPProm in performance for all classification tasks. The precision and recall of iProm-phage for promoter identification and classification were higher than those of DPProm, and the values were more consistent. As a result, iProm-phage showed a considerably higher score than the state-of-the-art methods in all cases.</p>
<table-wrap position="float" id="tab3">
<label>Table 3</label>
<caption>
<p>Second layer performance comparison.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="top">
<bold>Methods</bold>
</th>
<th align="center" valign="top">
<bold>Acc%</bold>
</th>
<th align="center" valign="top">
<bold>Precision%</bold>
</th>
<th align="center" valign="top">
<bold>Recall%</bold>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top" char=".">DPProm</td>
<td align="center" valign="top">93.0</td>
<td align="center" valign="top">95.2</td>
<td align="center" valign="top">96.4</td>
</tr>
<tr>
<td align="left" valign="top" char=".">iProm-phage</td>
<td align="center" valign="top">95.2</td>
<td align="center" valign="top">96.5</td>
<td align="center" valign="top">97.2</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="sec25">
<title>Webserver</title>
<p>A web server hosting the high performance iProm-phage tool is freely available at the following link<xref rid="fn0004" ref-type="fn"><sup>1</sup></xref> to enable easy access to the proposed tool for the scientific community. This approach has been adopted by several scholars (<xref ref-type="bibr" rid="ref3">Chantsalnyam et al., 2020</xref>; <xref ref-type="bibr" rid="ref2">Ali SD et al., 2022</xref>). iProm-phage is an easy-to-use tool that can be utilized by researchers and specialists in bioinformatics. It consists of two stages first is input and second is output. To input it uses two input methods: direct sequence input and uploading a file containing sequences for prediction. Each sequence should be 99&#x2009;bp long and contain the letters <italic>A</italic>, <italic>C</italic>, <italic>G</italic>, and <italic>T</italic>. <xref rid="fig8" ref-type="fig">Figures 8</xref>, <xref rid="fig9" ref-type="fig">9</xref> depict web server snippets; <xref rid="fig8" ref-type="fig">Figure 8</xref> is an example of adding sequences for prediction and <xref rid="fig9" ref-type="fig">Figure 9</xref> provides the predictor&#x2019;s output. We also provide an example to better understand how to use the webserver.</p>
<fig position="float" id="fig8">
<label>Figure 8</label>
<caption>
<p>Webserver adding query sequence.</p>
</caption>
<graphic xlink:href="fmicb-13-1061122-g008.tif"/>
</fig>
<fig position="float" id="fig9">
<label>Figure 9</label>
<caption>
<p>Predictor output.</p>
</caption>
<graphic xlink:href="fmicb-13-1061122-g009.tif"/>
</fig>
</sec>
<sec id="sec26" sec-type="conclusions">
<title>Conclusion</title>
<p>This work presents iProm-phage, a two-layer technique for identifying phage promoters and classifying them as phages or hosts. We developed a new method for generating negative datasets to create a robust model that performs well on tough datasets. Based on cutting-edge performance tests, we also found the best model among several ML and CNN algorithms, as well as the best feature encoding method among the 10 distinct methods. The architecture of the proposed model was evaluated using publicly available datasets. Compared to earlier techniques, the program had superior overall results. Finally, we created a web server that is available online and will be extremely useful to other experimental scientists.</p>
</sec>
<sec id="sec27" sec-type="data-availability">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/<xref ref-type="supplementary-material" rid="SM1">Supplementary material</xref>, further inquiries can be directed to the corresponding authors.</p>
</sec>
<sec id="sec28">
<title>Author contributions</title>
<p>MS: conceptualization, methodology, software, writing&#x2013;original draft, and writing&#x2013;review and editing. JJ: methodology and writing&#x2013;review and editing. HT: supervision and writing&#x2013;review and editing. KC: conceptualization, validation, supervision, writing&#x2013;review and editing, and funding acquisition. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec id="sec29" sec-type="funding-information">
<title>Funding</title>
<p>This work was supported in part by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT; nos. 2020R1A2C2005612 and 2022R1G1A1004613). This work was supported by &#x201C;Human Resources Program in Energy Technology&#x201D; of the Korea Institute of Energy Technology Evaluation and Planning (KETEP), granted financial resource from the Ministry of Trade, Industry &#x0026; Energy, Republic of Korea (no. 20204010600470).</p>
</sec>
<sec id="conf1" sec-type="COI-statement">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec id="sec100" sec-type="disclaimer">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<sec id="sec31" sec-type="supplementary-material">
<title>Supplementary material</title>
<p>The Supplementary material for this article can be found online at: <ext-link xlink:href="https://www.frontiersin.org/articles/10.3389/fmicb.2022.1061122/full#supplementary-material" ext-link-type="uri">https://www.frontiersin.org/articles/10.3389/fmicb.2022.1061122/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.docx" id="SM1" mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="ref1"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ali</surname> <given-names>S. D.</given-names></name> <name><surname>Alam</surname> <given-names>W.</given-names></name> <name><surname>Tayara</surname> <given-names>H.</given-names></name> <name><surname>Chong</surname> <given-names>K.</given-names></name></person-group> (<year>2020</year>). <article-title>Identification of functional pi RNAs using a convolutional neural network</article-title>. <source>IEEE/ACM Trans. Comput. Biol. Bioinforma.</source> <volume>14</volume>:<fpage>1</fpage>. doi: <pub-id pub-id-type="doi">10.1109/tcbb.2020.3034313</pub-id></citation></ref>
<ref id="ref2"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ali</surname> <given-names>S. D.</given-names></name> <name><surname>Alam</surname> <given-names>W.</given-names></name> <name><surname>Tayara</surname> <given-names>H.</given-names></name> <name><surname>Chong</surname> <given-names>K. T.</given-names></name></person-group> (<year>2022</year>). <article-title>Identification of functional piRNAs using a convolutional neural network</article-title>. <source>IEEE/ACM Trans. Comput. Biol. Bioinform.</source> <volume>19</volume>, <fpage>1661</fpage>&#x2013;<lpage>1669</lpage>. doi: <pub-id pub-id-type="doi">10.1109/TCBB.2020.3034313</pub-id>, PMID: <pub-id pub-id-type="pmid">33119510</pub-id></citation></ref>
<ref id="ref3"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chantsalnyam</surname> <given-names>T.</given-names></name> <name><surname>Lim</surname> <given-names>D. Y.</given-names></name> <name><surname>Tayara</surname> <given-names>H.</given-names></name> <name><surname>Chong</surname> <given-names>K. T.</given-names></name></person-group> (<year>2020</year>). <article-title>ncRDeep: non-coding RNA classification with convolutional neural network</article-title>. <source>Comput. Biol. Chem.</source> <volume>88</volume>:<fpage>107364</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.compbiolchem.2020.107364</pub-id>, PMID: <pub-id pub-id-type="pmid">32890916</pub-id></citation></ref>
<ref id="ref4"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Feng</surname> <given-names>Z. P.</given-names></name> <name><surname>Zhang</surname> <given-names>C. T.</given-names></name></person-group> (<year>2000</year>). <article-title>Prediction of membrane protein types based on the hydrophobic index of amino acids</article-title>. <source>J. Protein Chem.</source> <volume>19</volume>, <fpage>269</fpage>&#x2013;<lpage>275</lpage>. doi: <pub-id pub-id-type="doi">10.1023/A:1007091128394</pub-id></citation></ref>
<ref id="ref5"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Guzina</surname> <given-names>J.</given-names></name> <name><surname>Djordjevic</surname> <given-names>M.</given-names></name></person-group> (<year>2015</year>). <article-title>Bioinformatics as a first-line approach for understanding bacteriophage transcription</article-title>. <source>Bacteriophage</source> <volume>5</volume>:<fpage>e1062588</fpage>. doi: <pub-id pub-id-type="doi">10.1080/21597081.2015.1062588</pub-id>, PMID: <pub-id pub-id-type="pmid">26442194</pub-id></citation></ref>
<ref id="ref6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Horne</surname> <given-names>D. S.</given-names></name></person-group> (<year>1988</year>). <article-title>Prediction of protein helix content from an autocorrelation analysis of sequence hydrophobicities</article-title>. <source>Biopolymers</source> <volume>27</volume>, <fpage>451</fpage>&#x2013;<lpage>477</lpage>. doi: <pub-id pub-id-type="doi">10.1002/bip.360270308</pub-id>, PMID: <pub-id pub-id-type="pmid">3359010</pub-id></citation></ref>
<ref id="ref7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jeong</surname> <given-names>B.-S.</given-names></name> <name><surname>Golam Bari</surname> <given-names>A. T. M.</given-names></name> <name><surname>Rokeya Reaz</surname> <given-names>M.</given-names></name> <name><surname>Jeon</surname> <given-names>S.</given-names></name> <name><surname>Lim</surname> <given-names>C.-G.</given-names></name> <name><surname>Choi</surname> <given-names>H.-J.</given-names></name></person-group> (<year>2014</year>). <article-title>Codon-based encoding for DNA sequence analysis</article-title>. <source>Methods</source> <volume>67</volume>, <fpage>373</fpage>&#x2013;<lpage>379</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.ymeth.2014.01.016</pub-id>, PMID: <pub-id pub-id-type="pmid">24530970</pub-id></citation></ref>
<ref id="ref8"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kawashima</surname> <given-names>S.</given-names></name> <name><surname>Pokarowski</surname> <given-names>P.</given-names></name> <name><surname>Pokarowska</surname> <given-names>M.</given-names></name> <name><surname>Kolinski</surname> <given-names>A.</given-names></name> <name><surname>Katayama</surname> <given-names>T.</given-names></name> <name><surname>Kanehisa</surname> <given-names>M.</given-names></name></person-group> (<year>2008</year>). <article-title>AAindex: amino acid index database, progress report 2008</article-title>. <source>Nucleic Acids Res.</source> <volume>36</volume>, <fpage>D202</fpage>&#x2013;<lpage>D205</lpage>. doi: <pub-id pub-id-type="doi">10.1093/nar/gkm998</pub-id>, PMID: <pub-id pub-id-type="pmid">17998252</pub-id></citation></ref>
<ref id="ref9"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>J.</given-names></name> <name><surname>Shujaat</surname> <given-names>M.</given-names></name> <name><surname>Tayara</surname> <given-names>H.</given-names></name></person-group> (<year>2022</year>). <article-title>Iprom-zea: a twolayer model to identify plant promoters and their types using convolutional neural network</article-title>. <source>Genomics</source> <volume>114</volume>:<fpage>110384</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.ygeno.2022.110384</pub-id>, PMID: <pub-id pub-id-type="pmid">35533969</pub-id></citation></ref>
<ref id="ref10"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Klucar</surname> <given-names>L.</given-names></name> <name><surname>Stano</surname> <given-names>M.</given-names></name> <name><surname>Hajduk</surname> <given-names>M.</given-names></name></person-group> (<year>2010</year>). <article-title>Phi SITE: database of gene regulation in bacteriophages</article-title>. <source>Nucleic Acids Res.</source> <volume>38</volume>, <fpage>D366</fpage>&#x2013;<lpage>D370</lpage>. doi: <pub-id pub-id-type="doi">10.1093/nar/gkp911</pub-id>, PMID: <pub-id pub-id-type="pmid">19900969</pub-id></citation></ref>
<ref id="ref11"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lavigne</surname> <given-names>R.</given-names></name> <name><surname>Sun</surname> <given-names>W. D.</given-names></name> <name><surname>Volckaert</surname> <given-names>G.</given-names></name></person-group> (<year>2004</year>). <article-title>PHIRE, a deterministic approach to reveal regulatory elements in bacteriophage genomes</article-title>. <source>Bioinformatics</source> <volume>20</volume>, <fpage>629</fpage>&#x2013;<lpage>635</lpage>. doi: <pub-id pub-id-type="doi">10.1093/bioinformatics/btg456</pub-id>, PMID: <pub-id pub-id-type="pmid">15033869</pub-id></citation></ref>
<ref id="ref12"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>B.</given-names></name> <name><surname>Li</surname> <given-names>K.</given-names></name></person-group> (<year>2019</year>). <article-title>Ipromoter-2l2. 0: identifying promoters and their types by combining smoothing cutting window algorithm and sequence-based features</article-title>. <source>Mol. Ther. Nucleic Acids</source> <volume>18</volume>, <fpage>80</fpage>&#x2013;<lpage>87</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.omtn.2019.08.008</pub-id>, PMID: <pub-id pub-id-type="pmid">31536883</pub-id></citation></ref>
<ref id="ref13"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mishra</surname> <given-names>A.</given-names></name> <name><surname>Dhanda</surname> <given-names>S.</given-names></name> <name><surname>Siwach</surname> <given-names>P.</given-names></name> <name><surname>Aggarwal</surname> <given-names>S.</given-names></name> <name><surname>Jayaram</surname> <given-names>B.</given-names></name></person-group> (<year>2020</year>). <article-title>A novel method seprom for prokaryotic promoter prediction based on dna structure and energetics</article-title>. <source>Bioinformatics</source> <volume>36</volume>, <fpage>2375</fpage>&#x2013;<lpage>2384</lpage>. doi: <pub-id pub-id-type="doi">10.1093/bioinformatics/btz941</pub-id>, PMID: <pub-id pub-id-type="pmid">31909789</pub-id></citation></ref>
<ref id="ref14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rahman</surname> <given-names>M. S.</given-names></name> <name><surname>Aktar</surname> <given-names>U.</given-names></name> <name><surname>Jani</surname> <given-names>M. R.</given-names></name> <name><surname>Shatabda</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). <article-title>iPro70-FMWin: identifying sigma 70 promoters using multiple windowing and minimal features</article-title>. <source>Mol. Gen. Genomics.</source> <volume>294</volume>, <fpage>69</fpage>&#x2013;<lpage>84</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s00438-018-1487-5</pub-id>, PMID: <pub-id pub-id-type="pmid">30187132</pub-id></citation></ref>
<ref id="ref15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rehman</surname> <given-names>M. U.</given-names></name> <name><surname>Hong</surname> <given-names>K. J.</given-names></name> <name><surname>Tayara</surname> <given-names>H.</given-names></name> <name><surname>Chong</surname> <given-names>K. T.</given-names></name></person-group> (<year>2021</year>). <article-title>To Chong, m6A-neural tool: convolution neural tool for RNA N6-methyladenosine site identification in different species</article-title>. <source>IEEE Access</source> <volume>9</volume>, <fpage>17779</fpage>&#x2013;<lpage>17786</lpage>. doi: <pub-id pub-id-type="doi">10.1109/ACCESS.2021.3054361</pub-id></citation></ref>
<ref id="ref16"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Salmond</surname> <given-names>G. P.</given-names></name> <name><surname>Fineran</surname> <given-names>P. C.</given-names></name></person-group> (<year>2015</year>). <article-title>A century of the phage: past, present and future</article-title>. <source>Nat. Rev. Microbiol.</source> <volume>13</volume>, <fpage>777</fpage>&#x2013;<lpage>786</lpage>. doi: <pub-id pub-id-type="doi">10.1038/nrmicro3564</pub-id>, PMID: <pub-id pub-id-type="pmid">26548913</pub-id></citation></ref>
<ref id="ref17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sampaio</surname> <given-names>M.</given-names></name> <name><surname>Rocha</surname> <given-names>M.</given-names></name> <name><surname>Oliveira</surname> <given-names>H.</given-names></name> <name><surname>Dias</surname> <given-names>O.</given-names></name></person-group> (<year>2019</year>). <article-title>Predicting promoters in phage genomes using phage promoter</article-title>. <source>Bioinformatics</source> <volume>35</volume>, <fpage>5301</fpage>&#x2013;<lpage>5302</lpage>. doi: <pub-id pub-id-type="doi">10.1093/bioinformatics/btz580</pub-id>, PMID: <pub-id pub-id-type="pmid">31359029</pub-id></citation></ref>
<ref id="ref18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shujaat</surname> <given-names>M.</given-names></name> <name><surname>Lee</surname> <given-names>S. B.</given-names></name> <name><surname>Tayara</surname> <given-names>H.</given-names></name> <name><surname>Chong</surname> <given-names>K. T.</given-names></name></person-group> (<year>2021</year>). <article-title>Crprom: a convolutional neural network-based model for the prediction of rice promoters</article-title>. <source>IEEE Access</source> <volume>9</volume>, <fpage>81485</fpage>&#x2013;<lpage>81491</lpage>. doi: <pub-id pub-id-type="doi">10.1109/ACCESS.2021.3086102</pub-id></citation></ref>
<ref id="ref19"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shujaat</surname> <given-names>M.</given-names></name> <name><surname>Wahab</surname> <given-names>A.</given-names></name> <name><surname>Tayara</surname> <given-names>H.</given-names></name> <name><surname>Chong</surname> <given-names>K. T.</given-names></name></person-group> (<year>2020</year>). <article-title>Chong, pc promoter-CNN: a CNN-based prediction and classification of promoters</article-title>. <source>Genes (Basel)</source> <volume>11</volume>:<fpage>1529</fpage>. doi: <pub-id pub-id-type="doi">10.3390/genes11121529</pub-id>, PMID: <pub-id pub-id-type="pmid">33371507</pub-id></citation></ref>
<ref id="ref20"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sierro</surname> <given-names>N.</given-names></name> <name><surname>Makita</surname> <given-names>Y.</given-names></name> <name><surname>de Hoon</surname> <given-names>M.</given-names></name> <name><surname>Nakai</surname> <given-names>K.</given-names></name></person-group> (<year>2008</year>). <article-title>Dbtbs: a database of transcriptional regulation in bacillus subtilis containing upstream intergenic conservation information</article-title>. <source>Nucleic Acids Res.</source> <volume>36</volume>, <fpage>D93</fpage>&#x2013;<lpage>D96</lpage>. doi: <pub-id pub-id-type="doi">10.1093/nar/gkm910</pub-id>, PMID: <pub-id pub-id-type="pmid">17962296</pub-id></citation></ref>
<ref id="ref21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Silva</surname> <given-names>S.</given-names></name> <name><surname>Echeverrigaray</surname> <given-names>S.</given-names></name></person-group> (<year>2012</year>). <article-title>Bacterial promoter features description and their application on <italic>E. coli</italic> in silico prediction and recognition approaches</article-title>. <source>Bioinformatics. InTech</source> <volume>1</volume>, <fpage>241</fpage>&#x2013;<lpage>260</lpage>. doi: <pub-id pub-id-type="doi">10.5772/48149</pub-id></citation></ref>
<ref id="ref22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sokal</surname> <given-names>R. R.</given-names></name> <name><surname>Thomson</surname> <given-names>B. A.</given-names></name></person-group> (<year>2006</year>). <article-title>Population structure inferred by local spatial autocorrelation: an example from an Amerindian tribal population</article-title>. <source>Am. J. Phys. Anthropol.</source> <volume>129</volume>, <fpage>121</fpage>&#x2013;<lpage>131</lpage>. doi: <pub-id pub-id-type="doi">10.1002/ajpa.20250</pub-id>, PMID: <pub-id pub-id-type="pmid">16261547</pub-id></citation></ref>
<ref id="ref23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Umarov</surname> <given-names>R. K.</given-names></name> <name><surname>Solovyev</surname> <given-names>V. V.</given-names></name></person-group> (<year>2017</year>). <article-title>Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks</article-title>. <source>PLoS One</source> <volume>12</volume>:<fpage>e0171410</fpage>. doi: <pub-id pub-id-type="doi">10.1371/journal.pone.0171410</pub-id>, PMID: <pub-id pub-id-type="pmid">28158264</pub-id></citation></ref>
<ref id="ref24"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>H.</given-names></name> <name><surname>Wei</surname> <given-names>L.</given-names></name> <name><surname>Li</surname> <given-names>S.</given-names></name> <name><surname>Liu</surname> <given-names>L.</given-names></name> <name><surname>Wang</surname> <given-names>X.</given-names></name></person-group> (<year>2020</year>). <article-title>Synthetic promoter design in escherichia coli based on a deep generative network</article-title>. <source>Nucleic Acids Res.</source> <volume>48</volume>, <fpage>6403</fpage>&#x2013;<lpage>6412</lpage>. doi: <pub-id pub-id-type="doi">10.1093/nar/gkaa325</pub-id>, PMID: <pub-id pub-id-type="pmid">32424410</pub-id></citation></ref>
<ref id="ref25"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>C.</given-names></name> <name><surname>Zhang</surname> <given-names>J.</given-names></name> <name><surname>Cheng</surname> <given-names>L.</given-names></name> <name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>Xiao</surname> <given-names>M.</given-names></name> <name><surname>Xia</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>DPProm: a two-layer predictor for identifying promoters and their types on phage genome using deep learning</article-title>. <source>IEEE J. Biomed. Health Inform.</source> <volume>26</volume>, <fpage>5258</fpage>&#x2013;<lpage>5266</lpage>. doi: <pub-id pub-id-type="doi">10.1109/JBHI.2022.3193224</pub-id>, PMID: <pub-id pub-id-type="pmid">35867364</pub-id></citation></ref>
</ref-list>
<fn-group>
<fn id="fn0004">
<p><sup>1</sup><ext-link xlink:href="http://nsclbio.jbnu.ac.kr/tools/iProm-phage/" ext-link-type="uri">http://nsclbio.jbnu.ac.kr/tools/iProm-phage/</ext-link></p>
</fn>
</fn-group>
</back>
</article>