<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Archiving and Interchange DTD v2.3 20070202//EN" "archivearticle.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="methods-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Genet.</journal-id>
<journal-title>Frontiers in Genetics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Genet.</abbrev-journal-title>
<issn pub-type="epub">1664-8021</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fgene.2021.744334</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Genetics</subject>
<subj-group>
<subject>Methods</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Comparative Analysis of Unsupervised Protein Similarity Prediction Based on Graph Embedding</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Zhang</surname> <given-names>Yuanyuan</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/816662/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Wang</surname> <given-names>Ziqi</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1149186/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Wang</surname> <given-names>Shudong</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1075655/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Shang</surname> <given-names>Junliang</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1261282/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>School of Information and Control Engineering, Qingdao University of Technology</institution>, <addr-line>Qingdao</addr-line>, <country>China</country></aff>
<aff id="aff2"><sup>2</sup><institution>College of Computer Science and Technology, China University of Petroleum (East China)</institution>, <addr-line>Qingdao</addr-line>, <country>China</country></aff>
<aff id="aff3"><sup>3</sup><institution>School of Information Science and Engineering, Qufu Normal University</institution>, <addr-line>Rizhao</addr-line>, <country>China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Jianing Xi, Northwestern Polytechnical University, China</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Shouheng Tuo, Xi&#x2019;an University of Posts and Telecommunications, China; Cheng Liang, Shandong Normal University, China; Yajun Liu, Xi&#x2019;an University of Technology, China</p></fn>
<corresp id="c001">&#x002A;Correspondence: Yuanyuan Zhang, <email>yyzhang1217@163.com</email></corresp>
<fn fn-type="other" id="fn004"><p>This article was submitted to Computational Genomics, a section of the journal Frontiers in Genetics</p></fn>
</author-notes>
<pub-date pub-type="epub">
<day>22</day>
<month>09</month>
<year>2021</year>
</pub-date>
<pub-date pub-type="collection">
<year>2021</year>
</pub-date>
<volume>12</volume>
<elocation-id>744334</elocation-id>
<history>
<date date-type="received">
<day>20</day>
<month>07</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>25</day>
<month>08</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2021 Zhang, Wang, Wang and Shang.</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Zhang, Wang, Wang and Shang</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>The study of protein&#x2013;protein interaction and the determination of protein functions are important parts of proteomics. Computational methods are used to study the similarity between proteins based on Gene Ontology (GO) to explore their functions and possible interactions. GO is a series of standardized terms that describe gene products from molecular functions, biological processes, and cell components. Previous studies on assessing the similarity of GO terms were primarily based on Information Content (IC) between GO terms to measure the similarity of proteins. However, these methods tend to ignore the structural information between GO terms. Therefore, considering the structural information of GO terms, we systematically analyze the performance of the GO graph and GO Annotation (GOA) graph in calculating the similarity of proteins using different graph embedding methods. When applied to the actual Human and Yeast datasets, the feature vectors of GO terms and proteins are learned based on different graph embedding methods. To measure the similarity of the proteins annotated by different GO numbers, we used Dynamic Time Warping (DTW) and cosine to calculate protein similarity in GO graph and GOA graph, respectively. Link prediction experiments were then performed to evaluate the reliability of protein similarity networks constructed by different methods. It is shown that graph embedding methods have obvious advantages over the traditional IC-based methods. We found that random walk graph embedding methods, in particular, showed excellent performance in calculating the similarity of proteins. By comparing link prediction experiment results from GO(DTW) and GOA(cosine) methods, it is shown that GO(DTW) features provide highly effective information for analyzing the similarity among proteins.</p>
</abstract>
<kwd-group>
<kwd>protein similarity</kwd>
<kwd>graph embedding</kwd>
<kwd>gene ontology</kwd>
<kwd>link prediction</kwd>
<kwd>DTW algorithm</kwd>
</kwd-group>
<contract-num rid="cn001">61902430</contract-num>
<contract-num rid="cn001">61873281</contract-num>
<contract-num rid="cn001">61972226</contract-num>
<contract-sponsor id="cn001">National Natural Science Foundation of China<named-content content-type="fundref-id">10.13039/501100001809</named-content></contract-sponsor>
<counts>
<fig-count count="5"/>
<table-count count="6"/>
<equation-count count="14"/>
<ref-count count="25"/>
<page-count count="11"/>
<word-count count="7402"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="S1">
<title>Introduction</title>
<p>Proteomics essentially refers to the study of the characteristics of proteins on a large scale, including the expression level of proteins, the functions of proteins, protein&#x2013;protein interactions, and so forth. The study of proteome not only provides the material basis for the law of life activities but can also provide the theoretical basis and solutions for elucidating and solving the mechanism of many diseases (<xref ref-type="bibr" rid="B100">Xi et al., 2020a</xref>). However, at present, research on the function of proteins is lacking. The functions of proteins encoded by most of the newly discovered genes by genome sequencing are unknown. For those whose functions are known, their functions have mostly been inferred by methods such as homologous gene function analogy. Therefore, using computational methods to explore the similarity between proteins can effectively improve the efficiency of proteomic studies.</p>
<p>Gene Ontology (GO) (<xref ref-type="bibr" rid="B5">Harris, 2004</xref>) describes the function of genes It is a standardized description of the characteristics of genes and gene products, enabling bioinformatics researchers to uniformly summarize, process, interpret, and share the data of genes and gene products. It provides the representation of biological knowledge through structured and controlled terms. GO includes three kinds of ontologies: Biological Processes (BPs), Cell Components (CCs), and Molecular Functions (MFs). The words in the three kinds of ontologies are related to each other and form a Directed Acyclic Graph (DAG), wherein a node denotes a GO term, while an edge denotes a kind of relationship between two GO terms. Therefore, it is of great significance to study the similarity of proteins based on the graph characteristics of GO to explore the function of proteins.</p>
<p>GO has been widely studied in the field of biology (<xref ref-type="bibr" rid="B23">Xi et al., 2020b</xref>). GO terms have been used to annotate many biomedical databases [e.g., UniProt database (<xref ref-type="bibr" rid="B19">UniProt Consortium, 2015</xref>) and SwissProt database (<xref ref-type="bibr" rid="B1">Amos and Brigitte, 1999</xref>)]. The characteristics and structure of GO have made GO terms the basis of functional comparison between gene products (<xref ref-type="bibr" rid="B14">Pesaranghader et al., 2014</xref>). GO annotation defines the semantic similarity of genes (proteins) and provides a basis for measuring the functional similarity of proteins. The more information two GO terms share, the more similar they are, and the more the similarity between the proteins annotated by the two GO terms (<xref ref-type="bibr" rid="B6">Hu et al., 2021</xref>). In earlier studies, many researchers analyzed protein&#x2013;protein interaction (PPI) based on GO (<xref ref-type="bibr" rid="B17">Sevilla et al., 2005</xref>). Studies on computing protein similarity using GO mainly focus on the IC of GO terms, which is widely used to identify relations between proteins. The uniqueness of GO terms is often evaluated by taking the average of the IC of two terms. The IC of a term depends on the annotating corpus (<xref ref-type="bibr" rid="B17">Sevilla et al., 2005</xref>). Three IC-based methods&#x2014;Resnik&#x2019;s (<xref ref-type="bibr" rid="B16">Resnik, 1999</xref>), Rel&#x2019;s (<xref ref-type="bibr" rid="B12">Paul and Meeta, 2008</xref>), and Jiang and Conrath&#x2019;s (<xref ref-type="bibr" rid="B7">Jiang and Conrath, 1997</xref>)&#x2014;have been introduced from natural language taxonomies by <xref ref-type="bibr" rid="B10">Lord et al. (2003)</xref> to compare genes (proteins). Although the abovementioned methods are used to calculate semantic similarity between two GO terms to achieve good results, they only consider the amount of information of common nodes. They do not consider the information differences between the nodes themselves and ignore the structural information of the terms. The result of term comparison is a rough estimate. For example, in Resnik&#x2019;s method, if the ancestors of two terms are the same, then the similarity of two terms in any layer is not different and cannot be compared. Obviously, this is unreasonable.</p>
<p>This study merged the three categories of ontologies and GO annotations into a large graph called the GO Annotation (GOA) graph. We used three categories of ontologies transformed into a GO graph. Effective graph analysis on GOA and GO graphs can improve our understanding of the structure and node information of GO and proteins. Using the GOA information of the proteins, the similarity among proteins can be calculated, and the relationship between proteins can be predicted. In recent years, graph learning-based analytical methods have made remarkable progress in bioinformatics and other fields (<xref ref-type="bibr" rid="B22">Xi et al., 2021</xref>). At present, graph learning-based analytical methods focuses on dynamic graphs. Methods such as SDNE (<xref ref-type="bibr" rid="B20">Wang et al., 2016</xref>), DeepWalk (<xref ref-type="bibr" rid="B13">Perozzi et al., 2014</xref>), LINE (<xref ref-type="bibr" rid="B18">Tang et al., 2015</xref>), Node2vec (<xref ref-type="bibr" rid="B4">Grover and Leskovec, 2016</xref>), and SINE (<xref ref-type="bibr" rid="B21">Wang et al., 2020</xref>) have been widely used for unsupervised feature learning in the field of data mining and natural language processing. The edge prediction task is applied to the PPI prediction to find new protein interaction relationships. They also provide a basis for calculating protein similarity based on GO, such as GO2vec (<xref ref-type="bibr" rid="B25">Zhong et al., 2019</xref>), which used the Node2vec algorithm to compute the functional similarity between proteins.</p>
<p>To explore the performance of graph embedding methods in measuring protein similarity based on GO and GOA, we used four typical graph embedding methods to learn the features of GO terms and proteins. These methods can be divided into two categories. The first category is the random walk method, such as the DeepWalk and Node2Vec methods. The DeepWalk method uses the truncated random walk strategy to obtain the sequence of nodes and point embedding obtained from learning with Word2Vec (<xref ref-type="bibr" rid="B3">Goldberg and Levy, 2014</xref>). Node2Vec uses biased random walk to generate a node sequence by balancing the Breadth First Search (BFS) and Depth First Search (DFS) of the graph. The second category is based on deep learning, such as SDNE and LINE methods. SDNE uses an auto-encoder to optimize the first-order and second-order similarity simultaneously, while LINE optimizes the orders of similarity separately. As a result, their learned node embedding can retain the local and global graph structure and is robust to sparse networks. We introduce the overall flowchart of this paper in <xref ref-type="fig" rid="F1">Figure 1</xref>, which is divided into two parts. Firstly, in Part A, the features of GO terms are learned based on the GO graph using graph embedding methods. The similarity of proteins is then calculated based on the features of their annotated GO terms by Dynamic Time Warping (DTW) distance (<xref ref-type="bibr" rid="B11">Lou et al., 2016</xref>). Secondly, in Part B, the features of proteins are learned based on the GOA graph directly. Then, the cosine similarity of the corresponding features is calculated to measure the similarity of protein. Finally, a link prediction (<xref ref-type="bibr" rid="B8">Li et al., 2018</xref>) experiment is performed in the screened-out protein similarity networks, using the area under the curve (AUC) (<xref ref-type="bibr" rid="B9">Lobo, 2010</xref>) and area under the precision-recall curve (AUCPR) (<xref ref-type="bibr" rid="B24">Yu and Park, 2014</xref>) to evaluate the reliability of the protein network constructed by learned vectors.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption><p>Framework for analyzing protein similarity.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fgene-12-744334-g001.tif"/>
</fig>
</sec>
<sec id="S2" sec-type="materials|methods">
<title>Materials and Methods</title>
<sec id="S2.SS1">
<title>Data Source and Preprocessing</title>
<p>We downloaded GO data in Open Biomedical Ontologies (OBO) format from the GO Consortium Website<sup><xref ref-type="fn" rid="footnote1">1</xref></sup>. The GO protein annotations were obtained from the UniProt GOA website<sup><xref ref-type="fn" rid="footnote2">2</xref></sup>. The Yeast dataset contained 2,887 proteins, and the Human dataset contained 9,677 proteins. The GO data were then preprocessed based on the following processes. First, since several GO terms annotate a protein, term&#x2013;term relations of GO terms and term&#x2013;protein annotations between GO terms and proteins were combined into a GOA graph. Second, the GO terms were then transformed into an undirected, unweighted GO graph, regardless of the type and direction of the relationship. We summarize the numbers of GO terms and edges in <xref ref-type="table" rid="T1">Table 1</xref>.</p>
<table-wrap position="float" id="T1">
<label>TABLE 1</label>
<caption><p>Characteristics of GO graphs.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Gene ontology</td>
<td valign="top" align="center">Term</td>
<td valign="top" align="center">Edges</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">BP<xref ref-type="table-fn" rid="t1fn1">&#x002A;</xref></td>
<td valign="top" align="center">30,705</td>
<td valign="top" align="center">71,530</td>
</tr>
<tr>
<td valign="top" align="left">CC<xref ref-type="table-fn" rid="t1fn1">&#x002A;&#x002A;</xref></td>
<td valign="top" align="center">4,380</td>
<td valign="top" align="center">7,523</td>
</tr>
<tr>
<td valign="top" align="left">MF<xref ref-type="table-fn" rid="t1fn1">&#x002A;&#x002A;&#x002A;</xref></td>
<td valign="top" align="center">12,127</td>
<td valign="top" align="center">13,658</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="t1fn1"><p><italic>&#x002A;Biological Processes, &#x002A;&#x002A;Cell Components, and &#x002A;&#x002A;&#x002A;Molecular Functions.</italic></p></fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="S2.SS2">
<title>Method</title>
<p>Based on different graph embedding methods, the feature of GO terms and proteins was learned into vector representations by fusing GO and GOA graph topologies, respectively. Thus, we could capture the global information based on the graph embedding method, and its learned vectors could calculate the similarity between proteins by the DTW distance and cosine similarity.</p>
<sec id="S2.SS2.SSS1">
<title>Introduction of Different Graph Embedding Methods</title>
<p>In this paper, we used the methods of graph embedding based on random walk and deep learning to learn the features of GO terms and proteins through fusing the topology of GO and GOA graphs, respectively. Random walk-based methods include DeepWalk (<xref ref-type="bibr" rid="B13">Perozzi et al., 2014</xref>) and Node2vec (<xref ref-type="bibr" rid="B4">Grover and Leskovec, 2016</xref>). The DeepWalk method is divided into two parts: random walk to obtain node sequences and to generate node embedding. Random walk is used to obtain the local information of the node in the graph, and the embedding reflects the local structure of the node in the graph. The path length is controlled by setting the parameter walk-length (<italic><sub><italic>L</italic></sub></italic>). The more neighborhood nodes (higher-order neighborhood nodes) two nodes have, the more similar they are. <xref ref-type="fig" rid="F2">Figure 2A</xref> illustrates the DeepWalk algorithm flow. Node2vec method sets two hyper-parameters <italic>p</italic> and <italic>q</italic> to control the random walk and adopts a flexible biased random walk procedure that smoothly combines BFS and DFS to generate node sequences. <xref ref-type="fig" rid="F2">Figure 2B</xref> illustrates the Node2vec algorithm flow. Nodes <italic>c</italic><sub><italic>i</italic></sub> are generated based on the following distribution:</p>
<disp-formula id="S2.E1">
<label>(1)</label>
<mml:math id="M1">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="normal">x</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">c</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">i</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mtable displaystyle="true" rowspacing="0pt">
<mml:mtr>
<mml:mtd columnalign="center">
<mml:mrow>
<mml:mfrac>
<mml:msub>
<mml:mi mathvariant="normal">&#x03C0;</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi>Z</mml:mi>
</mml:mfrac>
<mml:mo mathvariant="italic" separator="true">&#x2003;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>E</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="center">
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo mathvariant="italic" separator="true">&#x2003;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>o</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>w</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where &#x03C0;<sub><italic>tx</italic></sub> is the transition probability between nodes <italic>t</italic> and <italic>x</italic>, and <italic>Z</italic> is the normalization constant. According to the node context information, node sequences are generated by setting the sizes of the hyper-parameters <italic>p</italic> and <italic>q</italic> to control the random walk strategy. The Skip-gram model is used to obtain the vector representation of the nodes. The random walk graph embedding of nodes reflects the local and global topology information of nodes in the graph.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption><p>Framework for graph embedding method. <bold>(A)</bold> DeepWalk, <bold>(B)</bold> Node2vec, <bold>(C)</bold> SDNE, and <bold>(D)</bold> LINE.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fgene-12-744334-g002.tif"/>
</fig>
<p>The second kind of embedding method is SDNE, which proposed a new semi-supervised learning model. Combining the advantages of first-order and second-order estimation, SDNE can capture the global and local structural properties of the graph. The unsupervised part uses a deep auto-encoder to learn the second-order similarity, and the supervised part uses a Laplace feature map to capture the first-order similarity. <xref ref-type="fig" rid="F2">Figure 2C</xref> illustrates the SDNE algorithm flow. By inputting the node embedding <italic>S</italic><sub><italic>i</italic></sub> in the model, where <italic>S</italic><sub><italic>i</italic></sub> is compressed by the auto-encoder, the feature is then reconstructed. Finally, its loss function is defined as follows:</p>
<disp-formula id="S2.E2">
<label>(2)</label>
<mml:math id="M2">
<mml:mrow>
<mml:msub>
<mml:mi>O</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="normal">&#x03A3;</mml:mi>
<mml:mo maxsize="160%" minsize="160%">|</mml:mo>
<mml:mo maxsize="160%" minsize="160%">|</mml:mo>
<mml:mpadded width="+1.7pt">
<mml:mi>S</mml:mi>
</mml:mpadded>
<mml:mmultiscripts>
<mml:mo>-</mml:mo>
<mml:mprescripts/>
<mml:none/>
<mml:mo>&#x2032;</mml:mo>
<mml:mpadded lspace="-3.3pt" width="-3.3pt">
<mml:mi>i</mml:mi>
</mml:mpadded>
<mml:none/>
</mml:mmultiscripts>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo maxsize="160%" minsize="160%">|</mml:mo>
<mml:msubsup>
<mml:mo maxsize="160%" minsize="160%">|</mml:mo>
<mml:mn>2</mml:mn>
<mml:mn>2</mml:mn>
</mml:msubsup>
</mml:mrow>
</mml:math>
</disp-formula>
<p>LINE is another method based on deep learning, which optimizes the first-order and second-order similarities (<xref ref-type="fig" rid="F2">Figure 2D</xref>). The first-order similarity is used to describe the local similarity between pairs of nodes in the graph. The second-order similarity is described as two nodes in the graph not having directly connected edges, but there are common neighbor nodes, which indicate that the two nodes are similar.</p>
</sec>
<sec id="S2.SS2.SSS2">
<title>Introduction to IC-Based Method</title>
<p>In this paper, we chose two typical IC-based methods to measure the semantic similarity of GO terms, based on <xref ref-type="bibr" rid="B7">Jiang and Conrath (1997)</xref> and Rel (<xref ref-type="bibr" rid="B12">Paul and Meeta, 2008</xref>). The IC of a term is inversely proportional to the frequency of the term being used to annotate genes in a given corpus, such as the UniProt database. The IC of a GO term <italic>g</italic> is defined by the negative log-likelihood and is given by</p>
<disp-formula id="S2.E3">
<label>(3)</label>
<mml:math id="M3">
<mml:mrow>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mi>C</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>g</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>-</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mi>log</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>g</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="S2.E4">
<label>(4)</label>
<mml:math id="M4">
<mml:mrow>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>g</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>q</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>g</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mi>N</mml:mi>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <italic>p</italic>(<italic>g</italic>) is the frequency of term <italic>g</italic> and its offspring in a specific GO annotated corpus. <italic>N</italic> represents the total number of annotated proteins in the corpus. If there are 50 annotated proteins in a corpus and 10 of them are annotated by term <italic>g</italic>, the annotation frequency of term <italic>g</italic> is <italic>p</italic>(<italic>g</italic>) = 0.2.</p>
<p>Jiang and Conrath and Rel&#x2019;s methods rely on comparing the attributes of terms in GO. Jiang and Conrath&#x2019;s method considered the fact that the semantic similarity between two terms is closely related to the nearest common ancestor corresponding to the two terms. The semantic similarity between two terms is estimated by calculating the amount of IC in the nearest common ancestor. Jiang and Conrath&#x2019;s and Rel&#x2019;s similarities are expressed as follows:</p>
<disp-formula id="S2.E5">
<label>(5)</label>
<mml:math id="M5">
<mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
<mml:msub>
<mml:mi>m</mml:mi>
<mml:mrow>
<mml:mi>J</mml:mi>
<mml:mo>&#x0026;</mml:mo>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mn>2</mml:mn>
<mml:mo>&#x002A;</mml:mo>
</mml:msup>
<mml:mi>I</mml:mi>
<mml:mi>C</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mi>c</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>-</mml:mo>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mi>C</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>-</mml:mo>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mi>C</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="S2.E6">
<label>(6)</label>
<mml:math id="M6">
<mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
<mml:msub>
<mml:mi>m</mml:mi>
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:msup>
<mml:mn>2</mml:mn>
<mml:mo>&#x002A;</mml:mo>
</mml:msup>
<mml:mi>I</mml:mi>
<mml:mi>C</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mi>c</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mi>C</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mi>C</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>-</mml:mo>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mi>c</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <italic>g</italic><sub><italic>c</italic></sub> is the most informative common ancestor of <italic>g</italic><sub>1</sub> and <italic>g</italic><sub>2</sub> in the ontology. Given two proteins <italic>P</italic><sub><italic>m</italic></sub> and <italic>P</italic><sub><italic>n</italic></sub> annotated with GO terms <italic>G</italic><sub><italic>m</italic></sub> = {<italic>g</italic><sub>1</sub>,&#x22EF;,<italic>g</italic><sub><italic>i</italic></sub>} and <inline-formula><mml:math id="INEQ2"><mml:mrow><mml:msub><mml:mi>G</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msubsup><mml:mi>g</mml:mi><mml:mn>1</mml:mn><mml:msup><mml:mi/><mml:mo>&#x2032;</mml:mo></mml:msup></mml:msubsup><mml:mo rspace="4.2pt">,</mml:mo><mml:mi mathvariant="normal">&#x22EF;</mml:mi><mml:mo rspace="4.2pt">,</mml:mo><mml:msubsup><mml:mi>g</mml:mi><mml:mi>j</mml:mi><mml:msup><mml:mi/><mml:mo>&#x2032;</mml:mo></mml:msup></mml:msubsup><mml:mo>}</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula>, we used the Best Match Average (BMA) method to compute the similarity between two sets of GO terms, which can be expressed as follows:</p>
<disp-formula id="S2.Ex1">
<label>(7)</label>
<mml:math id="M7">
<mml:mrow>
<mml:mi>B</mml:mi><mml:mi>M</mml:mi><mml:mi>A</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo><mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mn>2</mml:mn>
</mml:mfrac>
<mml:mo stretchy='false'>(</mml:mo><mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mi>n</mml:mi>
</mml:mfrac>
<mml:mstyle displaystyle='true'>
<mml:munder>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo><mml:msub>
<mml:mi>G</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
</mml:mrow>
</mml:munder>
<mml:mrow>
<mml:munder>
<mml:mrow>
<mml:mi>max</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow></mml:mrow>
<mml:mrow>
<mml:msub>
<mml:msup>
<mml:mi>g</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:msup>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo><mml:msub>
<mml:mi>G</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:munder>
</mml:mrow>
</mml:mstyle><mml:mtext>&#x2009;</mml:mtext>
<mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub>
<mml:mi>g</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo><mml:msub>
<mml:msup>
<mml:mi>g</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:msup>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mi>m</mml:mi>
</mml:mfrac>
<mml:mstyle displaystyle='true'>
<mml:munder>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:msub>
<mml:msup>
<mml:mi>g</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:msup>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo><mml:msub>
<mml:mi>G</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
</mml:mrow>
</mml:munder>
<mml:mrow>
<mml:munder>
<mml:mrow>
<mml:mi>max</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>g</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo><mml:msub>
<mml:mi>G</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
</mml:mrow>
</mml:munder>
<mml:mtext>&#x2009;</mml:mtext>
</mml:mrow>
</mml:mstyle><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub>
<mml:mi>g</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo><mml:msub>
<mml:msup>
<mml:mi>g</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:msup>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula><p>where <inline-formula><mml:math id="INEQ3"><mml:mrow><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>g</mml:mi><mml:mi>m</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mi>g</mml:mi><mml:mi>n</mml:mi><mml:mo>&#x2032;</mml:mo></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> is the similarity between term <italic>g</italic><sub><italic>m</italic></sub> and term <italic>g</italic>&#x2032;<sub><italic>n</italic></sub>, which could have been calculated using IC-based similarity methods.</p>
</sec>
<sec id="S2.SS2.SSS3">
<title>Protein Similarity Calculation</title>
<p>Each node in the GO graph is represented as a low-dimensional feature vector by considering the topology feature using a graph embedding method. Usually, a protein is annotated by several GO terms. For example, the protein &#x201C;P03882&#x201D; is annotated by the GO terms &#x201C;GO:0004519,&#x201D; &#x201C;GO:0005739,&#x201D; &#x201C;GO:0006314,&#x201D; and &#x201C;GO:0006397.&#x201D; Since a set of GO terms can be represented by its corresponding set of vectors, the similarity between proteins can be calculated based on the similarity of the two sets of GO vectors. Therefore, for any GO term <italic>g</italic><sub><italic>i</italic></sub>, we use SDNE (<xref ref-type="bibr" rid="B20">Wang et al., 2016</xref>), DeepWalk (<xref ref-type="bibr" rid="B13">Perozzi et al., 2014</xref>), LINE (<xref ref-type="bibr" rid="B18">Tang et al., 2015</xref>), and Node2vec (<xref ref-type="bibr" rid="B4">Grover and Leskovec, 2016</xref>) graph embedding methods to learn the low-dimensional feature vector <italic>v</italic><sub><italic>i</italic></sub>.</p>
<p>We let <italic>G</italic><sub><italic>m</italic></sub> = {<italic>g</italic><sub>1</sub>,<italic>g</italic><sub>2</sub>,&#x22EF;,<italic>g</italic><sub><italic>m</italic></sub>} and <inline-formula><mml:math id="INEQ7"><mml:mrow><mml:msub><mml:mi>G</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msubsup><mml:mi>g</mml:mi><mml:mn>1</mml:mn><mml:mo>&#x2032;</mml:mo></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>g</mml:mi><mml:mn>2</mml:mn><mml:mo>&#x2032;</mml:mo></mml:msubsup><mml:mo rspace="4.2pt">,</mml:mo><mml:mi mathvariant="normal">&#x22EF;</mml:mi><mml:mo rspace="4.2pt">,</mml:mo><mml:msubsup><mml:mi>g</mml:mi><mml:mi>n</mml:mi><mml:mo>&#x2032;</mml:mo></mml:msubsup><mml:mo>}</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> denote the sets of GO terms that annotated proteins <italic>P</italic><sub><italic>m</italic></sub> and <italic>P</italic><sub><italic>n</italic></sub>; thus, <italic>V</italic><sub><italic>m</italic></sub> = {<italic>v</italic><sub>1</sub>,<italic>v</italic><sub>2</sub>,&#x22EF;,<italic>v</italic><sub><italic>m</italic></sub>} and <inline-formula><mml:math id="INEQ9"><mml:mrow><mml:msub><mml:mi>V</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msubsup><mml:mi>v</mml:mi><mml:mn>1</mml:mn><mml:mo>&#x2032;</mml:mo></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>v</mml:mi><mml:mn>2</mml:mn><mml:mo>&#x2032;</mml:mo></mml:msubsup><mml:mo rspace="4.2pt">,</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x22EF;</mml:mi><mml:msubsup><mml:mi>v</mml:mi><mml:mi>n</mml:mi><mml:mo>&#x2032;</mml:mo></mml:msubsup></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> denote the sets of vectors that correspond to <italic>G</italic><sub><italic>m</italic></sub> = {<italic>g</italic><sub>1</sub>,<italic>g</italic><sub>2</sub>,&#x22EF;,<italic>g</italic><sub><italic>m</italic></sub>} and <inline-formula><mml:math id="INEQ11"><mml:mrow><mml:msub><mml:mi>G</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msubsup><mml:mi>g</mml:mi><mml:mn>1</mml:mn><mml:mo>&#x2032;</mml:mo></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>g</mml:mi><mml:mn>2</mml:mn><mml:mo>&#x2032;</mml:mo></mml:msubsup><mml:mo>,</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x2026;</mml:mi><mml:msubsup><mml:mi>g</mml:mi><mml:mi>n</mml:mi><mml:mo>&#x2032;</mml:mo></mml:msubsup></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula>, respectively. In this paper, we use the idea of DTW to calculate the similarity between two sets of vectors, which is denoted as DTW distance. The smaller the value, the more similar the two proteins. The GO embedding of the two proteins&#x2019; annotations is concatenated as <italic>V</italic><sub><italic>m</italic></sub> and <italic>V</italic><sub><italic>n</italic></sub>, and the lengths are <italic>m</italic> and <italic>n</italic>, respectively (<italic>m</italic> &#x2260; <italic>n</italic>). For constructing the matrix <italic>D</italic><sub><italic>m&#x00D7;n</italic></sub>, the element <italic>D</italic>(<italic>v</italic><sub><italic>m</italic></sub>, <italic>v</italic>&#x2032;<sub><italic>n</italic></sub>) represents the distance between points <italic>v</italic><sub><italic>m</italic></sub> and <italic>v</italic>&#x2032;<sub><italic>n</italic></sub> and can be expressed as follows:</p>
<disp-formula id="S2.Ex2">
<label>(8)</label>
<mml:math id="M8">
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>v</mml:mi>
<mml:mi>n</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:msubsup>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow><mml:mrow>
<mml:mi/>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mi>min</mml:mi>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mtable displaystyle="true" rowspacing="0pt">
<mml:mtr>
<mml:mtd columnalign="center">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>v</mml:mi>
<mml:mi>n</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:msubsup>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>v</mml:mi>
<mml:mi>n</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:msubsup>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>v</mml:mi>
<mml:mi>n</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:msubsup>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="center">
<mml:mtable displaystyle="true" rowspacing="0pt">
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mrow>
<mml:mpadded lspace="2.8pt" width="+2.8pt">
<mml:mi>D</mml:mi>
</mml:mpadded>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>v</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:msubsup>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>v</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:msubsup>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>v</mml:mi>
<mml:mi>n</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:msubsup>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd/>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>v</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:msubsup>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>v</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:msubsup>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>v</mml:mi>
<mml:mi>n</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:msubsup>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mi/>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula><p>We used the DTW distance method to find a path <italic>W</italic> through several lattice points in the matrix. The shortest path is the distance between the set of vectors <italic>V</italic><sub><italic>m</italic></sub> = {<italic>v</italic><sub>1</sub>,<italic>v</italic><sub>2</sub>,&#x2026;<italic>v</italic><sub><italic>m</italic></sub>} and <inline-formula><mml:math id="INEQ16"><mml:mrow><mml:msub><mml:mi>V</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msubsup><mml:mi>v</mml:mi><mml:mn>1</mml:mn><mml:mo>&#x2032;</mml:mo></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>v</mml:mi><mml:mn>2</mml:mn><mml:mo>&#x2032;</mml:mo></mml:msubsup><mml:mo>,</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x2026;</mml:mi><mml:msubsup><mml:mi>v</mml:mi><mml:mi>n</mml:mi><mml:mo>&#x2032;</mml:mo></mml:msubsup></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula>. We then calculated the distance used to measure the similarity between the two proteins. The process for calculating the DTW distance is presented in <xref ref-type="supplementary-material" rid="DS3">Supplementary Figure 1</xref>.</p>
<p>For any protein <italic>P</italic><sub><italic>i</italic></sub>, the low-dimensional feature &#x03C9;<sub><italic>i</italic></sub> is directly learned from the GOA graph, which contains the information of term&#x2013;term and term&#x2013;protein relations. We use the cosine distance of the proteins&#x2019; vector &#x03C9; to measure the similarity of the proteins. Cosine distance can be expressed as follows:</p>
<disp-formula id="S2.E9">
<label>(9)</label>
<mml:math id="M9">
<mml:mrow>
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mi>cosine</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x03C9;</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x03C9;</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">&#x03C9;</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>&#x22C5;</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x03C9;</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo fence="true">||</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x03C9;</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo fence="true">||</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo fence="true">||</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x03C9;</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo fence="true">||</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
</sec>
<sec id="S2.SS2.SSS4">
<title>Link Prediction and Evaluation Metrics</title>
<p>When it is difficult to use a unified standard to measure the advantages and disadvantages of a network model, link prediction can be used as a unified comparison method for the similarity nodes in the network. It provides a standard to measure the reliability of the structure of the network. In the comprehensive evaluation, we use two commonly used evaluation indicators, AUC (<xref ref-type="bibr" rid="B9">Lobo, 2010</xref>) and AUCPR (<xref ref-type="bibr" rid="B24">Yu and Park, 2014</xref>), widely used in dichotomy. Therefore, to evaluate the available networks constructed based on different graph embedding methods in the GO graph and GOA graph, we perform link prediction experiments on the protein similarity network and evaluate the accuracy of the prediction results. For any undirected network <italic>G</italic>(<italic>V</italic>,<italic>E</italic>), we let <italic>E</italic> be the complete set of <inline-formula><mml:math id="INEQ19"><mml:msubsup><mml:mi>C</mml:mi><mml:mrow><mml:mo>|</mml:mo><mml:mi>V</mml:mi><mml:mo>|</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> node pairs. We first remove 20% of the existing edges <italic>E</italic><sub><italic>r</italic></sub> in the network. The remaining 80% of the edges <italic>E</italic><sub><italic>s</italic></sub> are then divided into <italic>E</italic><sub><italic>p</italic></sub> and <italic>E</italic><sub><italic>t</italic></sub>, where <italic>E</italic><sub>s</sub> = <italic>E</italic><sub>p</sub> &#x222A; <italic>E</italic><sub><italic>t</italic></sub>, <italic>E</italic><sub><italic>P</italic></sub> &#x2229; <italic>E</italic><sub>t</sub> = &#x2205;, and <italic>E</italic> = <italic>E</italic><sub><italic>r</italic></sub> &#x222A; <italic>E</italic><sub><italic>s</italic></sub>. Given a link prediction method, each pair of unconnected node pairs <italic>v</italic><sub><italic>x</italic></sub> and <italic>v</italic><sub><italic>y</italic></sub> is given a link probability of two nodes. Sorting all the node pairs according to the score value in descending order, we have the top node pair with the highest link probability. The calculation process of the AUC value is presented in <xref ref-type="supplementary-material" rid="DS3">Supplementary Figure 2</xref>. The value of AUCPR is affected by the precision and recall value. For a link prediction experiment, accuracy is defined as the proportion of accurate prediction among the top <italic>L</italic> prediction edges. If <italic>m</italic> prediction edges exist, sort the link probability score value in descending order. If <italic>m</italic> of the top <italic>L</italic> edges are in the <italic>E</italic><sub><italic>t</italic></sub>, the precision is defined as follows:</p>
<disp-formula id="S2.E10">
<label>(10)</label>
<mml:math id="M10">
<mml:mrow>
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mi>m</mml:mi>
<mml:mi>L</mml:mi>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<p>The number of existing edges in the network <italic>M</italic> = <italic>E</italic>&#x2212;<italic>E</italic><sub><italic>r</italic></sub>, where <italic>m</italic> is the number of edges predicted by the prediction algorithm. The recall index is defined as follows:</p>
<disp-formula id="S2.E11">
<label>(11)</label>
<mml:math id="M11">
<mml:mrow>
<mml:mpadded width="+2.8pt">
<mml:mi>Recall</mml:mi>
</mml:mpadded>
<mml:mo rspace="5.3pt">=</mml:mo>
<mml:mfrac>
<mml:mi>m</mml:mi>
<mml:mi>M</mml:mi>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<p>The similarity between nodes is an essential precondition for link prediction, and the more similar the two nodes are, the more likely that a link exists between them. The similarity of network-based structural information definition is called structural similarity. Link prediction accuracy based on structure similarity depends on whether the structure similarity can grasp target structure characteristics. In the link prediction task, there are many methods to calculate the structural similarity between nodes, such as the following:</p>
<sec id="S2.SS2.SSS4.Px1">
<title>Common neighbors index</title>
<p>Common Neighbors (CN) (<xref ref-type="bibr" rid="B8">Li et al., 2018</xref>) similarity can be called structural equivalence, that is, if two nodes have multiple common neighbors, they are similar. In the link prediction experiment, CN index basic assumption is that if two unconnected nodes have more common neighbors, they are more likely to be connected. For nodes <italic>v</italic><sub><italic>x</italic></sub> and <italic>v</italic><sub><italic>y</italic></sub> in the protein similarity network, their neighbors are defined as &#x0393;(<italic>x</italic>) and &#x0393;(<italic>y</italic>), and the similarity of the two nodes is defined as the number of their CN. The index of CN is defined as follows:</p>
<disp-formula id="S2.E12">
<label>(12)</label>
<mml:math id="M12">
<mml:mrow>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mi>y</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">&#x0393;</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>&#x2229;</mml:mo>
<mml:mrow>
<mml:mi mathvariant="normal">&#x0393;</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>y</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msup>
<mml:mi>A</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <italic>S</italic> represents the similarity matrix and <italic>A</italic> represents the adjacency matrix of the graph. CN index is based on local information similarity index.</p>
</sec>
<sec id="S2.SS2.SSS4.Px2">
<title>Jaccard index</title>
<p>Based on the common neighbors and considering the influence of the node degree at both ends, the Jaccard (JC) similarity index (<xref ref-type="bibr" rid="B15">Ran et al., 2015</xref>) is proposed. JC not only considers the number of two nodes&#x2019; common neighbors but also considers the number of all their neighbors. JC is defined as follows:</p>
<disp-formula id="S2.E13">
<label>(13)</label>
<mml:math id="M13">
<mml:mrow>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mi>y</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">&#x0393;</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>&#x2229;</mml:mo>
<mml:mrow>
<mml:mi mathvariant="normal">&#x0393;</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>y</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">&#x0393;</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>&#x222A;</mml:mo>
<mml:mrow>
<mml:mi mathvariant="normal">&#x0393;</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>y</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msup>
<mml:mi>A</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mi>y</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo fence="true">||</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">&#x0393;</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>&#x2229;</mml:mo>
<mml:mrow>
<mml:mi mathvariant="normal">&#x0393;</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>y</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo fence="true">||</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
</sec>
<sec id="S2.SS2.SSS4.Px3">
<title>Resource allocation index</title>
<p>Resource Allocation (RA) (<xref ref-type="bibr" rid="B2">Dianati et al., 2005</xref>) index considers the attribute information of the common neighbors of two nodes. In the link prediction process, the common neighbor nodes with higher degrees play a lesser role than those with lower degrees, and the weight of the common neighbor nodes decreases in the form of <inline-formula><mml:math id="INEQ20"><mml:mrow><mml:mfrac bevelled='true'><mml:mn>1</mml:mn><mml:mi>k</mml:mi></mml:mfrac></mml:mrow></mml:math></inline-formula>. An example is presented in <xref ref-type="supplementary-material" rid="DS3">Supplementary Figure 3</xref>. RA index (<xref ref-type="bibr" rid="B2">Dianati et al., 2005</xref>) is defined as follows:</p>
<disp-formula id="S2.E14">
<label>(14)</label>
<mml:math id="M14">
<mml:mrow>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mrow>
<mml:mi>x</mml:mi><mml:mi>y</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo><mml:mstyle displaystyle='true'>
<mml:munder>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>z</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo>&#x0393;</mml:mo><mml:mrow><mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2229;</mml:mo><mml:mo>&#x0393;</mml:mo><mml:mrow><mml:mo>(</mml:mo>
<mml:mi>y</mml:mi>
<mml:mo>)</mml:mo></mml:mrow>
</mml:mrow>
</mml:munder>
<mml:mrow>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:msub>
<mml:mi>K</mml:mi>
<mml:mtext>z</mml:mtext>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <italic>K</italic><sub><italic>z</italic></sub> is the degree of the common neighbors of nodes <italic>v</italic><sub><italic>x</italic></sub> and <italic>v</italic><sub><italic>y</italic></sub>. The calculation process of the RA similarity index is shown in <xref ref-type="supplementary-material" rid="DS3">Supplementary Figure 3</xref>. Assuming that each node&#x2019;s resources are distributed equally to its neighbors, the RA index calculates a node&#x2019;s received resources, which is the similarity between nodes <italic>v</italic><sub><italic>x</italic></sub> and <italic>v</italic><sub><italic>y</italic></sub>.</p>
</sec>
</sec>
</sec>
</sec>
<sec sec-type="results" id="S3">
<title>Results</title>
<sec id="S3.SS1">
<title>Comparison of Protein Similarity and the Actual PPI Network Coincidence Degree</title>
<p>We downloaded the human yeast protein interaction network from the String database. We then mapped the proteins to the UniProt database, filtered out those proteins that could not be found in the UniProt database, and removed duplicate edges. After filtering, the Yeast dataset consisted of 2,877 proteins with 228,468 interactions, and the Human dataset consisted of 6,882 proteins with 892,054 interactions. Finally, to verify the validity of our calculated protein similarity network, we compared protein similarity and the actual PPI network coincidence degree.</p>
<p>This paper only shows the Human dataset experiment results in <xref ref-type="fig" rid="F3">Figure 3</xref>, and the Yeast dataset results are shown in <xref ref-type="supplementary-material" rid="DS3">Supplementary Figures 4</xref>, <xref ref-type="supplementary-material" rid="DS3">5</xref>.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption><p>Human protein similarity network (&#x03C4; &#x003E; 0.4) and PPI coincidence degree. <bold>(A)</bold> Cosine, <bold>(B)</bold> DTW.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fgene-12-744334-g003.tif"/>
</fig>
<p>We selected the protein similarity networks (&#x03C4; &#x003E; 0.4) and compared them with the PPI dataset downloaded from the String database to analyze the coincidence degree of the Human and Yeast protein networks. Furthermore, we compared the edge coincidence of the protein similarity network based on different graph embedding methods (as shown in <xref ref-type="fig" rid="F3">Figure 3</xref>). The calculation was based on <inline-formula><mml:math id="INEQ29"><mml:mrow><mml:mfrac><mml:mrow><mml:msub><mml:mi>E</mml:mi><mml:mi>a</mml:mi></mml:msub><mml:mo>&#x2229;</mml:mo><mml:msub><mml:mi>E</mml:mi><mml:mi>b</mml:mi></mml:msub></mml:mrow><mml:msub><mml:mi>E</mml:mi><mml:mi>a</mml:mi></mml:msub></mml:mfrac><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mpadded width="+3.3pt"><mml:msub><mml:mi>E</mml:mi><mml:mi>a</mml:mi></mml:msub></mml:mpadded><mml:mo>&gt;</mml:mo><mml:msub><mml:mi>E</mml:mi><mml:mi>b</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula>.</p>
<p>By comparing the GO(DTW) and GOA(cosine) methods, it can be seen that the Node2vec graph embedding method performed best in the GO graph. SDNE and LINE methods performed better in the GOA graph, and there was little difference between them in the GOA graph and GO graph. However, Node2vec and DeepWalk performed better in the GO graph. In general, the performance of protein similarity calculation based on different graph embedding methods in the GO graph was better than in the GOA graph. As shown, using graph embedding methods can be effective in calculating protein similarity in GO and GOA graphs. We also proved that using the DTW method to calculate different dimensional protein vector similarities is feasible.</p>
</sec>
<sec id="S3.SS2">
<title>Comparison of Link Prediction Results Based on Different Graph Embedding Methods in GO Graph</title>
<p>The features of GO terms are learned from the GO graph based on different graph embedding methods, and the similarity among proteins is calculated. By selecting the top 5%, middle 5%, and the last 5% of the protein similarity network data, the link prediction is computed for the filtered protein similarity network, and the AUC and AUCPR values are calculated (as shown in <xref ref-type="fig" rid="F4">Figure 4</xref> and <xref ref-type="table" rid="T2">Table 2</xref>). This paper only shows the Human dataset experiment result, and the Yeast dataset result is shown in <xref ref-type="supplementary-material" rid="DS3">Supplementary Figure 6</xref> and <xref ref-type="supplementary-material" rid="DS3">Supplementary Table 1</xref>.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption><p>Comparison of prediction results of Human protein similarity networks.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fgene-12-744334-g004.tif"/>
</fig>
<table-wrap position="float" id="T2">
<label>TABLE 2</label>
<caption><p>AUCPR value of protein similarity prediction in the Human dataset.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Method</td>
<td valign="top" align="center">The top 5% of the network</td>
<td valign="top" align="center">The middle 5% of the network</td>
<td valign="top" align="center">The last 5% of the network</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">SDNE</td>
<td valign="top" align="center">0.9105</td>
<td valign="top" align="center">0.0076</td>
<td valign="top" align="center">0.0052</td>
</tr>
<tr>
<td valign="top" align="left">Node2vec</td>
<td valign="top" align="center"><bold>0.9115</bold></td>
<td valign="top" align="center"><bold>0.0143</bold></td>
<td valign="top" align="center"><bold>0.0055</bold></td>
</tr>
<tr>
<td valign="top" align="left">DeepWalk</td>
<td valign="top" align="center">0.8220</td>
<td valign="top" align="center">0.0127</td>
<td valign="top" align="center">0.0052</td>
</tr>
<tr>
<td valign="top" align="left">LINE</td>
<td valign="top" align="center">0.7117</td>
<td valign="top" align="center">0.0097</td>
<td valign="top" align="center">0.0052</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn><p><italic>Bold means the best result in the comparative experiment.</italic></p></fn>
</table-wrap-foot>
</table-wrap>
<p>We can see that as the similarity of network nodes decreases, the value of AUC decreases. In the top 5% of the protein similarity network, the proteins are more similar, but for AUCPR values, we can see that the performance of the Node2vec method is the best in all the top, middle, and the last 5% of the protein similarity networks. The Node2vec method introduces BFS and DFS into the generation process of the random walk sequence by introducing two parameters <italic>p</italic> and <italic>q</italic>. BFS focuses on the adjacent nodes and characterizes a relatively local graph representation; that is, the BFS can explore the local structural properties of the graph, while the DFS can explore the global similarity in context. We found that the AUC value of protein similarity calculated by the graph embedding method decreased gradually with the decrease in the value of the screening protein similarity. Furthermore, it is shown that the edge connection of the protein similarity network calculated by the graph embedding method is reliable.</p>
<p>We also found that the Node2vec graph embedding method performed well in calculating the Yeast protein similarity network (as shown in <xref ref-type="supplementary-material" rid="DS3">Supplementary Figure 6</xref> and <xref ref-type="supplementary-material" rid="DS3">Supplementary Table 1</xref>). Therefore, the GO term vectors fused the local and global information of nodes in the GO graph and contain more information, so the GO(DTW) method performs better in computing protein similarity.</p>
</sec>
<sec id="S3.SS3">
<title>Comparison of Link Prediction Results Based on Different Graph Embedding Methods in the GOA Graph</title>
<p>To reflect the influence of the structure information of the GO annotation on proteins, the features of proteins are learned from the GOA graph based on different graph embedding methods, and the similarity among proteins is calculated (as shown in <xref ref-type="fig" rid="F5">Figure 5</xref> and <xref ref-type="table" rid="T3">Table 3</xref>). This paper only shows the Human dataset experiment result, and the Yeast dataset result is presented in <xref ref-type="supplementary-material" rid="DS3">Supplementary Figure 7</xref> and <xref ref-type="supplementary-material" rid="DS3">Supplementary Table 2</xref>.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption><p>Comparison of prediction results of Human protein similarity networks.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fgene-12-744334-g005.tif"/>
</fig>
<table-wrap position="float" id="T3">
<label>TABLE 3</label>
<caption><p>AUCPR value of Human protein similarity prediction.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Method</td>
<td valign="top" align="center">The top 5% of the network</td>
<td valign="top" align="center">The middle 5% of the network</td>
<td valign="top" align="center">The last 5% of the network</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">SDNE</td>
<td valign="top" align="center">0.6578</td>
<td valign="top" align="center">0.0100</td>
<td valign="top" align="center">0.0052</td>
</tr>
<tr>
<td valign="top" align="left">Node2vec</td>
<td valign="top" align="center"><bold>0.8758</bold></td>
<td valign="top" align="center"><bold>0.0105</bold></td>
<td valign="top" align="center"><bold>0.0069</bold></td>
</tr>
<tr>
<td valign="top" align="left">DeepWalk</td>
<td valign="top" align="center">0.8719</td>
<td valign="top" align="center">0.0094</td>
<td valign="top" align="center">0.0053</td>
</tr>
<tr>
<td valign="top" align="left">LINE</td>
<td valign="top" align="center">0.8189</td>
<td valign="top" align="center">0.0095</td>
<td valign="top" align="center">0.0053</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn><p><italic>Bold means the best result in the comparative experiment.</italic></p></fn>
</table-wrap-foot>
</table-wrap>
<p>We screened the top, middle, and last 5% of the protein similar networks and performed the link prediction experiments to observe the values of AUC and AUCPR under different methods. The AUC and AUCPR values decreased gradually with the decrease in the percentage selected. Therefore, it can be seen that the performance of the Node2vec method in the GOA(cosine) method is also better than other graph embedding methods. For the Yeast protein similarity network, we also performed the same experiment and obtained the same experimental conclusions as described above. We found that SDNE graph embedding methods also showed excellent performance in the Yeast dataset (as shown in <xref ref-type="supplementary-material" rid="DS3">Supplementary Table 2</xref>). This is because the SDNE method also defines first-order and second-order similarities. Therefore, calculating the protein similarity network based on these vectors achieved excellent results in the prediction task.</p>
</sec>
<sec id="S3.SS4">
<title>Comparison of Link Prediction Results of Protein Similarity Calculated by IC-Based Method and Based on Graph Embedding Methods</title>
<p>We studied the application of different graph embedding methods to calculate protein similarity in GO and GOA graphs. We screened the top 5% of the protein similarity networks for link prediction analysis (as shown in <xref ref-type="table" rid="T4">Table 4</xref>). Furthermore, we performed an experiment that calculated the density of the protein similarity network based on graph embedding and IC-based methods (as shown in <xref ref-type="table" rid="T5">Table 5</xref>). This paper only presents the Human dataset experiment results, and the Yeast dataset result is presented in <xref ref-type="supplementary-material" rid="DS3">Supplementary Tables 3</xref>, <xref ref-type="supplementary-material" rid="DS3">4</xref>.</p>
<table-wrap position="float" id="T4">
<label>TABLE 4</label>
<caption><p>AUCPR and AUC values of Human protein similarity prediction (the top 5% of the similarity network).</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Method</td>
<td valign="top" align="center">AUC</td>
<td valign="top" align="center">AUCPR</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">SDNE (cosine/DTW)</td>
<td valign="top" align="center">0.9699/<bold>0.9739</bold></td>
<td valign="top" align="center">0.9015/<bold>0.9105</bold></td>
</tr>
<tr>
<td valign="top" align="left">Node2vec (cosine/DTW)</td>
<td valign="top" align="center">0.9714/<bold>0.983</bold></td>
<td valign="top" align="center">0.8758/<bold>0.9115</bold></td>
</tr>
<tr>
<td valign="top" align="left">DeepWalk (cosine/DTW)</td>
<td valign="top" align="center"><bold>0.9925</bold>/0.9752</td>
<td valign="top" align="center"><bold>0.8719</bold>/0.8220</td>
</tr>
<tr>
<td valign="top" align="left">LINE (cosine/DTW)</td>
<td valign="top" align="center"><bold>0.9839</bold>/0.9716</td>
<td valign="top" align="center"><bold>0.8189</bold>/0.7117</td>
</tr>
<tr>
<td valign="top" align="left">Rel.</td>
<td valign="top" align="center">0.9067</td>
<td valign="top" align="center">0.1519</td>
</tr>
<tr>
<td valign="top" align="left">Jiang and Conrath</td>
<td valign="top" align="center">0.8409</td>
<td valign="top" align="center">0.0669</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn><p><italic>Bold means the best result in the comparative experiment.</italic></p></fn>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T5">
<label>TABLE 5</label>
<caption><p>Comparison of Human protein similarity network density between different methods.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Method</td>
<td valign="top" align="center">Nodes</td>
<td valign="top" align="center">Edges</td>
<td valign="top" align="center">Density</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">SDNE (cosine/DTW)</td>
<td valign="top" align="center">4,797/2,024</td>
<td valign="top" align="center">1,183,801/713,961</td>
<td valign="top" align="center">0.1/<bold>0.3</bold></td>
</tr>
<tr>
<td valign="top" align="left">Node2vec (cosine/DTW)</td>
<td valign="top" align="center">6,882/2,807</td>
<td valign="top" align="center">2,841,303/1,183,762</td>
<td valign="top" align="center">0.12/<bold>0.3</bold></td>
</tr>
<tr>
<td valign="top" align="left">DeepWalk (cosine/DTW)</td>
<td valign="top" align="center">6,882/3,079</td>
<td valign="top" align="center">1,183,876/1,183,707</td>
<td valign="top" align="center">0.05/<bold>0.2</bold></td>
</tr>
<tr>
<td valign="top" align="left">LINE (cosine/DTW)</td>
<td valign="top" align="center">5,586/1,660</td>
<td valign="top" align="center">1,183,815/206,650</td>
<td valign="top" align="center">0.07/<bold>0.15</bold></td>
</tr>
<tr>
<td valign="top" align="left">Rel</td>
<td valign="top" align="center">5,902</td>
<td valign="top" align="center">870,987</td>
<td valign="top" align="center">0.05</td>
</tr>
<tr>
<td valign="top" align="left">Jiang and Conrath</td>
<td valign="top" align="center">5,883</td>
<td valign="top" align="center">870,986</td>
<td valign="top" align="center">0.05</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn><p><italic>Bold means the best result in the comparative experiment.</italic></p></fn>
</table-wrap-foot>
</table-wrap>
<p>The link prediction results from these methods are compared as follows. From <xref ref-type="table" rid="T4">Table 4</xref>, it can be seen that the similarity calculation of proteins based on different graph embedding methods is superior to that of the IC-based methods. We also performed the above experiment for Yeast datasets, and the same conclusion was obtained (as shown in <xref ref-type="supplementary-material" rid="DS3">Supplementary Table 3</xref>). It can be seen that the SDNE and Node2vec graph embedding methods show good performance in the GO graph. Analyzing the density of the top 5% of the human protein similarity networks, it can be seen that the density of the protein similarity network calculated by the graph embedding method is higher than that calculated by IC-based methods. Therefore, it is shown that the protein similarity network calculated by the IC-based method is sparse, and the similarity of proteins is not as high as that calculated by the graph embedding method. Thus, in the IC-based method, the AUCPR value obtained in link prediction is lower. We also verified this conclusion on the Yeast dataset (as shown in <xref ref-type="supplementary-material" rid="DS3">Supplementary Table 4</xref>).</p>
<p>Based on different graph embedding methods, the features of the GO terms were learned into the vector representations through fusing the topology of the GO graph. Thus, we could capture the global information based on the graph embedding method, and its learned vectors could calculate the similarity between proteins by the DTW distance similarity. As can be seen from the results of the link prediction, the GO(DTW) method performed better than GOA(cosine), and most of the protein similarity networks calculated by the GO(DTW) method are denser than those calculated by the GOA(cosine) method.</p>
</sec>
<sec id="S3.SS5">
<title>Similarity Indexes&#x2019; Results</title>
<p>We performed three different link prediction similarity index experiments on the top 5% of the protein similarity network and found that based on different similarity indexes, the difference in the AUC value is small, which indicates that the calculated protein similarity network structure has improved (as shown in <xref ref-type="table" rid="T6">Table 6</xref>). This paper only presents the Human dataset experiment result, and the Yeast dataset result is presented in <xref ref-type="supplementary-material" rid="DS3">Supplementary Table 5</xref>.</p>
<table-wrap position="float" id="T6">
<label>TABLE 6</label>
<caption><p>Prediction results under different similarity indexes (the top 5% of the Human protein similarity network).</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Similarity index</td>
<td valign="top" align="center">CN</td>
<td valign="top" align="center">JC</td>
<td valign="top" align="center">RA</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">SDNE (cosine/DTW)</td>
<td valign="top" align="center">0.9694/0.981</td>
<td valign="top" align="center">0.9739/0.9843</td>
<td valign="top" align="center"><bold>0.9818</bold>/<bold>0.9886</bold></td>
</tr>
<tr>
<td valign="top" align="left">Node2vec (cosine/DTW)</td>
<td valign="top" align="center">0.9598/0.9809</td>
<td valign="top" align="center">0.9714/0.9843</td>
<td valign="top" align="center"><bold>0.9856</bold>/<bold>0.9886</bold></td>
</tr>
<tr>
<td valign="top" align="left">DeepWalk (cosine/DTW)</td>
<td valign="top" align="center">0.9772/0.981</td>
<td valign="top" align="center">0.9856/0.9842</td>
<td valign="top" align="center"><bold>0.9885</bold>/<bold>0.9884</bold></td>
</tr>
<tr>
<td valign="top" align="left">LINE (cosine/DTW)</td>
<td valign="top" align="center">0.9703/0.9716</td>
<td valign="top" align="center">0.9716/0.9825</td>
<td valign="top" align="center"><bold>0.9874</bold>/<bold>0.9853</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn><p><italic>Bold means the best result in the comparative experiment.</italic></p></fn>
</table-wrap-foot>
</table-wrap>
<p>Among the three different similarity evaluation indexes, we found that the AUC value of the RA similarity index based on link prediction is slightly higher than the other two similarity indexes. Furthermore, the results showed that the top 5% of the protein similarity network had higher AUC values in different similarity indexes of link prediction, indicating that the graph embedding method effectively calculated protein similarity. We obtained the same conclusion in the experiment with the Yeast dataset (as shown in <xref ref-type="supplementary-material" rid="DS3">Supplementary Table 5</xref>).</p>
</sec>
</sec>
<sec sec-type="discussion" id="S4">
<title>Discussion</title>
<p>Gene Ontology is one of the many biological ontology languages. Its emergence and development reduce the confusion of biological concepts and terms, provide a three-layer (BP, MF, and CC) structure of system definition, and describe the functions of proteins. Therefore, it is important to understand protein function based on GO terms to describe protein similarity.</p>
<p>In this paper, by fusing the GO terms&#x2019; topology information, we learned the features of GO terms and proteins into vector representations in GO and GOA graph based on different graph embedding methods. Then, the similarity of proteins was calculated based on these vectors using DTW and cosine similarity. Finally, protein similarity networks were screened by selecting different percentages, and a link prediction experiment was used to evaluate the prediction accuracy of different networks. The experimental results indicate that the graph embedding method is better than the IC-based method in protein similarity calculation. Among the two graph embedding methods, the performance of the GO(DTW) method is better than that of the GOA(cosine) method. This is because the GO terms and proteins are treated equally in the GOA graph, and some information may be ignored when learning protein low-dimensional embedding. Therefore, the coincidence degree between the protein similarity network calculated by the GOA(cosine) method and the actual PPI data is not as high as that calculated by the GO(DTW) method. There are potential limitations to our method. First, we transformed directed graphs into undirected graphs, which might result in a loss of structural information. We also treated the GO terms and the proteins equally in the GOA graph, which may ignore some information. Therefore, in our future study, we plan to learn the protein representations in the graph by combining the information in the directed graph and by considering representation learning of heterogeneous graphs that contain GO terms and proteins.</p>
</sec>
<sec sec-type="data-availability" id="S5">
<title>Data Availability Statement</title>
<p>The original contributions presented in the study are included in the article/<xref ref-type="supplementary-material" rid="DS1">Supplementary Material</xref>, further inquiries can be directed to the corresponding author/s.</p>
</sec>
<sec id="S6">
<title>Author Contributions</title>
<p>YZ conceived the idea and prepared the experimental data. ZW and YZ debugged the code, conducted the experiments, interpreted the results, and wrote and edited the manuscript. SW and JS advised the study and reviewed the manuscript. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s10">
<title>Publisher&#x2019;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<sec sec-type="funding-information" id="S7">
<title>Funding</title>
<p>This work was supported by the National Natural Science Foundation of China (Grant Nos. 61902430, 61873281, and 61972226).</p>
</sec>
<ack>
<p>We would like to thank LetPub (<ext-link ext-link-type="uri" xlink:href="http://www.letpub.com">www.letpub.com</ext-link>) for its linguistic assistance during the preparation of this manuscript.</p>
</ack>
<sec id="S9" sec-type="supplementary-material"><title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fgene.2021.744334/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fgene.2021.744334/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.ZIP" id="DS1" mimetype="application/zip" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Data_Sheet_2.ZIP" id="DS2" mimetype="application/zip" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Data_Sheet_3.docx" id="DS3" mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Amos</surname> <given-names>B.</given-names></name> <name><surname>Brigitte</surname> <given-names>B.</given-names></name></person-group> (<year>1999</year>). <article-title>The SWISS-PROT protein sequence data bank.</article-title> <source><italic>Nucleic Acids Res.</italic></source> <volume>22</volume> <fpage>49</fpage>&#x2013;<lpage>54</lpage>. <pub-id pub-id-type="doi">10.1093/nar/22.17.3626</pub-id></citation></ref>
<ref id="B2"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dianati</surname> <given-names>M.</given-names></name> <name><surname>Shen</surname> <given-names>X.</given-names></name> <name><surname>Naik</surname> <given-names>S.</given-names></name></person-group> (<year>2005</year>). &#x201C;<article-title>A new fairness index for radio resource allocation in wireless networks</article-title>,&#x201D; in <source><italic>Proceedings of the Wireless Communications &#x0026; Networking Conference</italic></source> (<publisher-loc>New Orleans, LA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>785</fpage>&#x2013;<lpage>890</lpage>.</citation></ref>
<ref id="B3"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Goldberg</surname> <given-names>Y.</given-names></name> <name><surname>Levy</surname> <given-names>O.</given-names></name></person-group> (<year>2014</year>). <article-title>word2vec Explained: deriving Mikolov et al.&#x2019;s negative-sampling word-embedding method.</article-title> <source><italic>OALib J.</italic></source> <volume>14</volume> <fpage>144</fpage>&#x2013;<lpage>156</lpage>. <pub-id pub-id-type="doi">10.1017/S1351324916000334</pub-id></citation></ref>
<ref id="B4"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Grover</surname> <given-names>A.</given-names></name> <name><surname>Leskovec</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). &#x201C;<article-title>node2vec: scalable feature learning for networks</article-title>,&#x201D; in <source><italic>Proceedings of the 22nd ACM SIGKDD International Conference</italic></source>, (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>855</fpage>&#x2013;<lpage>864</lpage>.</citation></ref>
<ref id="B5"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Harris</surname> <given-names>M. A.</given-names></name></person-group> (<year>2004</year>). <article-title>The gene ontology (GO) database and informatics resource.</article-title> <source><italic>Nucleic Acids Res.</italic></source> <volume>32</volume> <fpage>258</fpage>&#x2013;<lpage>261</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkh036</pub-id> <pub-id pub-id-type="pmid">14681407</pub-id></citation></ref>
<ref id="B6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hu</surname> <given-names>L.</given-names></name> <name><surname>Wang</surname> <given-names>X.</given-names></name> <name><surname>Huang</surname> <given-names>Y. A.</given-names></name> <name><surname>Hu</surname> <given-names>P.</given-names></name> <name><surname>You</surname> <given-names>Z. H.</given-names></name></person-group> (<year>2021</year>). <article-title>A survey on computational models for predicting protein&#x2013;protein interactions.</article-title> <source><italic>Bioinformatics.</italic></source> <volume>05</volume> <fpage>77</fpage>&#x2013;<lpage>85</lpage>. <pub-id pub-id-type="doi">10.1093/bib/bbab036</pub-id> <pub-id pub-id-type="pmid">33693513</pub-id></citation></ref>
<ref id="B7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jiang</surname> <given-names>J. J.</given-names></name> <name><surname>Conrath</surname> <given-names>D. W.</given-names></name></person-group> (<year>1997</year>). &#x201C;<article-title>Semantic similarity based on corpus statistics and lexical taxonomy</article-title>,&#x201D; in <source><italic>Proceedings of the 10th Research on Computational Linguistics International Conference</italic></source>, <volume>Vol. 11</volume> (<publisher-loc>Taipei</publisher-loc>: <publisher-name>The Association for Computational Linguistics and Chinese Language Processing (ACLCLP)</publisher-name>), <fpage>115</fpage>&#x2013;<lpage>123</lpage>.</citation></ref>
<ref id="B8"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>S.</given-names></name> <name><surname>Huang</surname> <given-names>J.</given-names></name> <name><surname>Zhang</surname> <given-names>Z.</given-names></name> <name><surname>Liu</surname> <given-names>J.</given-names></name> <name><surname>Huang</surname> <given-names>T.</given-names></name> <name><surname>Chen</surname> <given-names>H.</given-names></name></person-group> (<year>2018</year>). <article-title>Similarity-based future common neighbors model for link prediction in complex networks.</article-title> <source><italic>Sci. Rep.</italic></source> <volume>19</volume> <fpage>518</fpage>&#x2013;<lpage>524</lpage>. <pub-id pub-id-type="doi">10.1038/s41598-018-35423-2</pub-id> <pub-id pub-id-type="pmid">30451945</pub-id></citation></ref>
<ref id="B9"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lobo</surname> <given-names>J. M.</given-names></name></person-group> (<year>2010</year>). <article-title>AUC: a misleading measure of the performance of predictive distribution models.</article-title> <source><italic>Glob. Ecol.</italic></source> <volume>17</volume> <fpage>145</fpage>&#x2013;<lpage>151</lpage>. <pub-id pub-id-type="doi">10.1111/j.1466-8238.2007.00358</pub-id></citation></ref>
<ref id="B10"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lord</surname> <given-names>P. W.</given-names></name> <name><surname>Stevens</surname> <given-names>R. D.</given-names></name> <name><surname>Brass</surname> <given-names>A.</given-names></name> <name><surname>Goble</surname> <given-names>C. A.</given-names></name></person-group> (<year>2003</year>). <article-title>Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation.</article-title> <source><italic>Bioinformatics.</italic></source> <volume>19</volume> <fpage>1275</fpage>&#x2013;<lpage>1283</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btg153</pub-id> <pub-id pub-id-type="pmid">12835272</pub-id></citation></ref>
<ref id="B11"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lou</surname> <given-names>Y.</given-names></name> <name><surname>Ao</surname> <given-names>H.</given-names></name> <name><surname>Dong</surname> <given-names>Y.</given-names></name></person-group> (<year>2016</year>). &#x201C;<article-title>Improvement of dynamic time warping (DTW) Algorithm</article-title>,&#x201D; in <source><italic>Proceedings of the 2015 14th International Symposium on Distributed Computing and Applications for Business Engineering and Science (DCABES)</italic></source>, <volume>14</volume> (<publisher-loc>Guiyang</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>18</fpage>&#x2013;<lpage>24</lpage>. <pub-id pub-id-type="doi">10.1109/DCABES.2015.103</pub-id></citation></ref>
<ref id="B12"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Paul</surname> <given-names>P.</given-names></name> <name><surname>Meeta</surname> <given-names>M.</given-names></name></person-group> (<year>2008</year>). <article-title>Gene Ontology term overlap as a measure of gene functional similarity.</article-title> <source><italic>BMC Bioinformatics.</italic></source> <volume>9</volume>:<issue>327</issue>. <pub-id pub-id-type="doi">10.1186/1471-2105-9-327</pub-id> <pub-id pub-id-type="pmid">18680592</pub-id></citation></ref>
<ref id="B13"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Perozzi</surname> <given-names>B.</given-names></name> <name><surname>Al-Rfou</surname> <given-names>R.</given-names></name> <name><surname>Skiena</surname> <given-names>S.</given-names></name></person-group> (<year>2014</year>). &#x201C;<article-title>DeepWalk: online learning of social representations</article-title>,&#x201D; in <source><italic>Proceedings of the 2014 ACM SIGKDD International Conference on Knowledge Discovery &#x0026; Data Mining</italic></source>, (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>701</fpage>&#x2013;<lpage>740</lpage>.</citation></ref>
<ref id="B14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pesaranghader</surname> <given-names>A.</given-names></name> <name><surname>Rezaei</surname> <given-names>A.</given-names></name> <name><surname>Davoodi</surname> <given-names>D.</given-names></name></person-group> (<year>2014</year>). <article-title>Gene functional similarity analysis by definition-based semantic similarity measurement of GO terms.</article-title> <source><italic>Lecture Notes Bioinformatics.</italic></source> <volume>12</volume> <fpage>203</fpage>&#x2013;<lpage>214</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-06483-3_18</pub-id></citation></ref>
<ref id="B15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ran</surname> <given-names>S.</given-names></name> <name><surname>Ngan</surname> <given-names>K. N.</given-names></name> <name><surname>Li</surname> <given-names>S.</given-names></name></person-group> (<year>2015</year>). &#x201C;<article-title>Jaccard index compensation for object segmentation evaluation</article-title>,&#x201D; in <source><italic>Proceedings of the 2014 IEEE International Conference on Image Processing</italic></source>, (<publisher-loc>Paris</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>253</fpage>&#x2013;<lpage>259</lpage>.</citation></ref>
<ref id="B16"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Resnik</surname> <given-names>P.</given-names></name></person-group> (<year>1999</year>). <article-title>Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language.</article-title> <source><italic>J. Artif. Intell. Res.</italic></source> <volume>11</volume> <fpage>95</fpage>&#x2013;<lpage>130</lpage>. <pub-id pub-id-type="doi">10.1613/jair.514</pub-id></citation></ref>
<ref id="B17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sevilla</surname> <given-names>J. L.</given-names></name> <name><surname>Segura</surname> <given-names>V.</given-names></name> <name><surname>Podhorski</surname> <given-names>A.</given-names></name> <name><surname>Guruceaga</surname> <given-names>E.</given-names></name> <name><surname>Mato</surname> <given-names>J.</given-names></name> <name><surname>Mart&#x00ED;nez-Cruz</surname> <given-names>L. A.</given-names></name><etal/></person-group> (<year>2005</year>). <article-title>Correlation between gene expression and GO semantic similarity.</article-title> <source><italic>IEEE/ACM Trans. Comput. Biol. Bioinformatics.</italic></source> <volume>24</volume> <fpage>330</fpage>&#x2013;<lpage>338</lpage>. <pub-id pub-id-type="doi">10.1109/TCBB.2005.50</pub-id> <pub-id pub-id-type="pmid">17044170</pub-id></citation></ref>
<ref id="B18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tang</surname> <given-names>J.</given-names></name> <name><surname>Qu</surname> <given-names>M.</given-names></name> <name><surname>Wang</surname> <given-names>M.</given-names></name> <name><surname>Zhang</surname> <given-names>M.</given-names></name> <name><surname>Yan</surname> <given-names>J.</given-names></name> <name><surname>Mei</surname> <given-names>Q.</given-names></name></person-group> (<year>2015</year>). &#x201C;<article-title>LINE: large-scale information network embedding</article-title>,&#x201D; in <source><italic>Proceedings of the 24th International Conference on World Wide Web</italic></source>, (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>1067</fpage>&#x2013;<lpage>1077</lpage>.</citation></ref>
<ref id="B19"><citation citation-type="journal"><collab>UniProt Consortium</collab> (<year>2015</year>). <article-title>UniProt: a hub for protein information.</article-title> <source><italic>Nucleic Acids Res.</italic></source> <volume>32</volume> <fpage>115</fpage>&#x2013;<lpage>119</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkh131</pub-id> <pub-id pub-id-type="pmid">14681372</pub-id></citation></ref>
<ref id="B20"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>D.</given-names></name> <name><surname>Peng</surname> <given-names>C.</given-names></name> <name><surname>Zhu</surname> <given-names>W.</given-names></name></person-group> (<year>2016</year>). &#x201C;<article-title>Structural deep network embedding</article-title>,&#x201D; in <source><italic>Proceedings of the 22nd ACM SIGKDD International Conference on Data Mining</italic></source>, (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>1225</fpage>&#x2013;<lpage>1234</lpage>.</citation></ref>
<ref id="B21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>Z.</given-names></name> <name><surname>Zhang</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>S.</given-names></name> <name><surname>Shang</surname> <given-names>J.</given-names></name></person-group> (<year>2020</year>). <article-title>SINE: second-order information network embedding.</article-title> <source><italic>IEEE Access</italic></source> <volume>1</volume> <fpage>98</fpage>&#x2013;<lpage>110</lpage>. <pub-id pub-id-type="doi">10.1109/ACCESS.2020.3007886</pub-id></citation></ref>
<ref id="B100"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xi</surname> <given-names>J.</given-names></name> <name><surname>Li</surname> <given-names>A.</given-names></name> <name><surname>Wang</surname> <given-names>M.</given-names></name></person-group> (<year>2020a</year>). <article-title>HetRCNA: a novel method to identify recurrent copy number alternations from heterogeneous tumor samples based on matrix decomposition framework.</article-title> <source><italic>IEEE/ACM Trans. Comput. Biol. Bioinform.</italic></source> <volume>17</volume>, <fpage>422</fpage>&#x2013;<lpage>434</lpage>. <pub-id pub-id-type="doi">10.1109/TCBB.2018.2846599</pub-id></citation></ref>
<ref id="B22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xi</surname> <given-names>J.</given-names></name> <name><surname>Ye</surname> <given-names>L.</given-names></name> <name><surname>Huang</surname> <given-names>Q.</given-names></name></person-group> (<year>2021</year>). &#x201C;<article-title>Tolerating data missing in breast cancer diagnosis from clinical ultrasound reports via knowledge graph inference</article-title>,&#x201D; in <source><italic>Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD &#x2019;21)</italic></source>, (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Association for Computing Machinery</publisher-name>), <fpage>1</fpage>&#x2013;<lpage>9</lpage>. <pub-id pub-id-type="doi">10.1145/3447548.3467106</pub-id></citation></ref>
<ref id="B23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xi</surname> <given-names>J.</given-names></name> <name><surname>Yuan</surname> <given-names>X.</given-names></name> <name><surname>Wang</surname> <given-names>M.</given-names></name> <name><surname>Li</surname> <given-names>A.</given-names></name> <name><surname>Huang</surname> <given-names>Q.</given-names></name></person-group> (<year>2020b</year>). <article-title>Inferring subgroup-specific driver genes from heterogeneous cancer samples via subspace learning with subgroup indication.</article-title> <source><italic>Bioinformatics</italic></source> <volume>36</volume> <fpage>1855</fpage>&#x2013;<lpage>1863</lpage>.</citation></ref>
<ref id="B24"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yu</surname> <given-names>W.</given-names></name> <name><surname>Park</surname> <given-names>T.</given-names></name></person-group> (<year>2014</year>). <article-title>AucPR: an AUC-based approach using penalized regression for disease prediction with high-dimensional omics data.</article-title> <source><italic>BMC Genomics.</italic></source> <volume>15</volume>:<issue>S1</issue>. <pub-id pub-id-type="doi">10.1186/1471-2164-15-S10-S1</pub-id> <pub-id pub-id-type="pmid">25559769</pub-id></citation></ref>
<ref id="B25"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhong</surname> <given-names>X.</given-names></name> <name><surname>Kaalia</surname> <given-names>R.</given-names></name> <name><surname>Rajapakse</surname> <given-names>J. C.</given-names></name></person-group> (<year>2019</year>). <article-title>GO2Vec: transforming GO terms and proteins to vector representations via graph embeddings.</article-title> <source><italic>BMC Genomics</italic></source> <volume>20</volume>:<issue>918</issue>. <pub-id pub-id-type="doi">10.1186/s12864-019-6272-2</pub-id> <pub-id pub-id-type="pmid">31874639</pub-id></citation></ref>
</ref-list>
<fn-group>
<fn id="footnote1">
<label>1</label>
<p><ext-link ext-link-type="uri" xlink:href="http://geneontology.org/page/download-ontology">http://geneontology.org/page/download-ontology</ext-link></p></fn>
<fn id="footnote2">
<label>2</label>
<p><ext-link ext-link-type="uri" xlink:href="http://www.ebi.ac.uk/GOA">http://www.ebi.ac.uk/GOA</ext-link></p></fn>
</fn-group>
</back>
</article>