<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Microbiol.</journal-id>
<journal-title>Frontiers in Microbiology</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Microbiol.</abbrev-journal-title>
<issn pub-type="epub">1664-302X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fmicb.2023.1078760</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Microbiology</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Evaluation of computational phage detection tools for metagenomic datasets</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Schackart</surname> <given-names>Kenneth E.</given-names> <suffix>III</suffix></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Graham</surname> <given-names>Jessica B.</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Ponsero</surname> <given-names>Alise J.</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/655527/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Hurwitz</surname> <given-names>Bonnie L.</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1564588/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Biosystems Engineering, The University of Arizona</institution>, <addr-line>Tucson, AZ</addr-line>, <country>United States</country></aff>
<aff id="aff2"><sup>2</sup><institution>BIO5 Institute, The University of Arizona</institution>, <addr-line>Tucson, AZ</addr-line>, <country>United States</country></aff>
<aff id="aff3"><sup>3</sup><institution>Human Microbiome Research Program, Faculty of Medicine, University of Helsinki</institution>, <addr-line>Helsinki</addr-line>, <country>Finland</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Maria Dzunkova, University of Valencia, Spain</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Migun Shakya, Biosciences Division, United States; David Paez-Espino, Joint Genome Institute, United States</p></fn>
<corresp id="c001">&#x002A;Correspondence: Alise J. Ponsero, <email>alise.ponsero@helsinki.fi</email></corresp>
<fn fn-type="other" id="fn004"><p>This article was submitted to Phage Biology, a section of the journal Frontiers in Microbiology</p></fn>
</author-notes>
<pub-date pub-type="epub">
<day>25</day>
<month>01</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>14</volume>
<elocation-id>1078760</elocation-id>
<history>
<date date-type="received">
<day>24</day>
<month>10</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>09</day>
<month>01</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2023 Schackart, Graham, Ponsero and Hurwitz.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Schackart, Graham, Ponsero and Hurwitz</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<sec>
<title>Introduction</title>
<p>As new computational tools for detecting phage in metagenomes are being rapidly developed, a critical need has emerged to develop systematic benchmarks.</p>
</sec>
<sec>
<title>Methods</title>
<p>In this study, we surveyed 19 metagenomic phage detection tools, 9 of which could be installed and run at scale. Those 9 tools were assessed on several benchmark challenges. Fragmented reference genomes are used to assess the effects of fragment length, low viral content, phage taxonomy, robustness to eukaryotic contamination, and computational resource usage. Simulated metagenomes are used to assess the effects of sequencing and assembly quality on the tool performances. Finally, real human gut metagenomes and viromes are used to assess the differences and similarities in the phage communities predicted by the tools.</p>
</sec>
<sec>
<title>Results</title>
<p>We find that the various tools yield strikingly different results. Generally, tools that use a homology approach (VirSorter, MARVEL, viralVerify, VIBRANT, and VirSorter2) demonstrate low false positive rates and robustness to eukaryotic contamination. Conversely, tools that use a sequence composition approach (VirFinder, DeepVirFinder, Seeker), and MetaPhinder, have higher sensitivity, including to phages with less representation in reference databases. These differences led to widely differing predicted phage communities in human gut metagenomes, with nearly 80% of contigs being marked as phage by at least one tool and a maximum overlap of 38.8% between any two tools. While the results were more consistent among the tools on viromes, the differences in results were still significant, with a maximum overlap of 60.65%. Discussion: Importantly, the benchmark datasets developed in this study are publicly available and reusable to enable the future comparability of new tools developed.</p>
</sec>
</abstract>
<kwd-group>
<kwd>microbiome</kwd>
<kwd>bacteriophage</kwd>
<kwd>computational biology</kwd>
<kwd>benchmark</kwd>
<kwd>metagenome</kwd>
<kwd>virome</kwd>
</kwd-group>
<counts>
<fig-count count="10"/>
<table-count count="1"/>
<equation-count count="6"/>
<ref-count count="76"/>
<page-count count="16"/>
<word-count count="11469"/>
</counts>
</article-meta>
</front>
<body>
<sec id="S1" sec-type="intro">
<title>1. Introduction</title>
<p>Prokaryotic viruses called bacteriophages (phages) are the most abundant biological entity in most ecosystems (<xref ref-type="bibr" rid="B51">Ofir and Sorek, 2018</xref>) and profoundly impact the ecology of natural ecosystems (<xref ref-type="bibr" rid="B9">Breitbart and Rohwer, 2005</xref>; <xref ref-type="bibr" rid="B8">Blazanin and Turner, 2021</xref>). For example, marine viruses have massive effects on ocean biochemistry, influencing nutrient cycling and carbon sequestration by altering host-driven processes through controlling bacterial population growth and altering metabolic function (as reviewed in <xref ref-type="bibr" rid="B20">Fuhrman, 1999</xref>; <xref ref-type="bibr" rid="B30">Hurwitz and U&#x2019;Ren, 2016</xref>; <xref ref-type="bibr" rid="B10">Breitbart et al., 2018</xref>). Additionally, recent studies have demonstrated the importance of phages in shaping the human microbiota and interacting with human health (as reviewed in <xref ref-type="bibr" rid="B17">Edlund et al., 2015</xref>; <xref ref-type="bibr" rid="B45">Manrique et al., 2017</xref>; <xref ref-type="bibr" rid="B65">Sharma et al., 2018</xref>; <xref ref-type="bibr" rid="B41">Li et al., 2021</xref>). Next-generation sequencing techniques enable the identification of an exponential number of novel phages but also allow for a better understanding of phage populations in multiple ecosystems (<xref ref-type="bibr" rid="B11">Breitbart et al., 2002</xref>).</p>
<p>Despite the increasing number of virome studies, identifying viral sequences in metagenomic datasets is still computationally challenging. Since viruses lack a universal gene marker (e.g., 16S rRNA in prokaryotes), earlier bioinformatics methods to identify viruses from metagenomes often relied on sequence alignment methods against reference genome databases. Strikingly, in gut viromes, 75&#x2013;99% of viral reads do not produce significant alignments to any known viral genome. This large range in alignable sequences can be partially explained by the high diversity, fast phage evolution, and their ability to integrate into their host genome and be mistaken as bacterial. Indeed, an inherent limitation to genome comparison approaches is the database completeness and clear separation between host and viral DNA. Moreover, these reference-based methods typically cannot identify novel phage sequences. To address these limitations, several dedicated computational tools and approaches were proposed. In 2015, the tool VirSorter (<xref ref-type="bibr" rid="B61">Roux et al., 2015</xref>) was released, identifying phage sequences by enrichment of viral hallmark genes, depletion of cellular genes indicated by reduced Pfam hits, and strand shifts. In 2016, MetaPhinder took into account the mosaicism of phage sequences by integrating hits to multiple genomes to classify a sequence as host or viral (<xref ref-type="bibr" rid="B34">Jurtz et al., 2016</xref>).</p>
<p>Recently, bioinformatic tools leverage machine learning algorithms to identify features of viral origin, and typically allow for a broader recall of previously unknown sequences than alignment-based approaches. Chosen features are genes and gene density (<xref ref-type="bibr" rid="B2">Amgarten et al., 2018</xref>; <xref ref-type="bibr" rid="B4">Antipov et al., 2020</xref>; <xref ref-type="bibr" rid="B37">Kieft et al., 2020</xref>; <xref ref-type="bibr" rid="B67">Tisza et al., 2020</xref>; <xref ref-type="bibr" rid="B25">Guo et al., 2021</xref>) and protein families (<xref ref-type="bibr" rid="B2">Amgarten et al., 2018</xref>) that are used to train classification models including random forest (<xref ref-type="bibr" rid="B2">Amgarten et al., 2018</xref>; <xref ref-type="bibr" rid="B25">Guo et al., 2021</xref>), naive Bayes (<xref ref-type="bibr" rid="B4">Antipov et al., 2020</xref>), and neural network (<xref ref-type="bibr" rid="B37">Kieft et al., 2020</xref>). Interestingly, some authors also proposed using the differential <italic>k</italic>-mer (short sequences of length <italic>k</italic>) frequencies between phages and prokaryotes for sequence classification (<xref ref-type="bibr" rid="B15">Deaton et al., 2017</xref>; <xref ref-type="bibr" rid="B57">Ren et al., 2017</xref>). These methods allow the detection of shorter phage sequences, as they do not require multiple open reading frames (ORF) for classification that are difficult to obtain in fragmentary metagenomic data. However, the classification results and the rationale behind the classification are typically difficult to interpret.</p>
<p>All in all, between 2015 and 2021, we identified 19 published tools designed for detecting phage in metagenomes, making the development of benchmarking datasets critical for exploring the limitations and biases of the currently available tools but also facilitating future tool development. Similar benchmark efforts are currently available for other computational tasks such as metagenome assembly, binning, and taxonomic profiling (<xref ref-type="bibr" rid="B63">Sczyrba et al., 2017</xref>; <xref ref-type="bibr" rid="B48">Meyer et al., 2019</xref>). Recently, several efforts to benchmark phage detection tools have been published, and explore the ability of these tools to correctly identify and classify dsDNA viruses and curate auxiliary metabolic genes (<xref ref-type="bibr" rid="B55">Pratama et al., 2021</xref>; <xref ref-type="bibr" rid="B28">Ho et al., 2022</xref>). However, the potential impact of parameters such as the sequence length, sequencing error, eukaryotic contamination, quality of assembly, and phage taxonomy on the tool&#x2019;s classification performance is not explored.</p>
<p>In this study, we developed a series of benchmark datasets, each aiming at assessing a precise classification challenge in detecting phage in metagenomic datasets and evaluated phage metagenomic detection tools published before July 2021. Notably, this work only evaluated self-contained computational tools and did not include more modular viral discovery pipelines such as the IMG/VR viral discovery pipelines (<xref ref-type="bibr" rid="B53">Paez-Espino et al., 2017</xref>). Additionally, this work does not include tools specifically intended to detect integrated prophage in complete bacterial genomes, for which prior benchmarking efforts are already available (<xref ref-type="bibr" rid="B60">Roach et al., 2022</xref>). Importantly, we ensured the availability and reusability of the developed benchmark datasets and described how researchers could utilize them for benchmarking new phage detection tools (<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.7194616">doi.org/10.5281/zenodo.7194616</ext-link>).</p>
</sec>
<sec id="S2" sec-type="materials|methods">
<title>2. Materials and methods</title>
<sec id="S2.SS1">
<title>2.1. Phage detection tools</title>
<p>Tools were categorized into two broad groups. Homology-based tools are those that utilize a reference database at the time of classification to search for homologues. Sequence-based tools are those that classify using a model trained on sequence features such as <italic>k</italic>-mer frequencies.</p>
<p>Each of the evaluated tools included in this study was installed from the recommended source following the authors&#x2019; instructions. When tools were available from several sources, Bioconda was preferred due to simplified dependency management. Tools that could not be obtained through Bioconda were directly cloned from GitHub or Sourceforge. <xref ref-type="table" rid="T1">Table 1</xref> summarizes the tool and version number when available, the category, classification method, training of reference database, and how tools were obtained. Tools with &#x201C;-&#x201D; under distribution were not used for further benchmarking, and reasons for doing so are presented in Results.</p>
<table-wrap position="float" id="T1">
<label>TABLE 1</label>
<caption><p>Overview of published metagenomic phage detection tools.</p></caption>
<table cellspacing="5" cellpadding="5" frame="box" rules="all">
<thead>
<tr>
<td valign="top" align="left" style="color:#ffffff;background-color: #7f8080;">Tool and version</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Category</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Method</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Training set/Reference database</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Distribution</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">References</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">DeepVirFinder (1.0)</td>
<td valign="top" align="center">Sequence</td>
<td valign="top" align="center"><italic>k</italic>-mer based deep learning neural net</td>
<td valign="top" align="center">NCBI RefSeq genomes before May 2015 and virome sequences</td>
<td valign="top" align="center">GitHub</td>
<td valign="top" align="center"><xref ref-type="bibr" rid="B58">Ren et al., 2020</xref></td>
</tr>
<tr>
<td valign="top" align="left">MARVEL (0.2)</td>
<td valign="top" align="center">Homology</td>
<td valign="top" align="center">Random forest utilizing gene density, strand shifts, and proteins.</td>
<td valign="top" align="center">NCBI RefSeq genomes before 2016</td>
<td valign="top" align="center">GitHub</td>
<td valign="top" align="center"><xref ref-type="bibr" rid="B2">Amgarten et al., 2018</xref></td>
</tr>
<tr>
<td valign="top" align="left">MetaPhinder</td>
<td valign="top" align="center">Homology</td>
<td valign="top" align="center">Integrated analysis of BLASTn hits to a phage database</td>
<td valign="top" align="center">Viral dataset from NCBI genomes, EMBL EBI genomes, phageDB, PhAnToMe/bacterial dataset from NCBI genomes. Downloaded before August 2014</td>
<td valign="top" align="center">GitHub</td>
<td valign="top" align="center"><xref ref-type="bibr" rid="B34">Jurtz et al., 2016</xref></td>
</tr>
<tr>
<td valign="top" align="left">PhaMers</td>
<td valign="top" align="center">Sequence</td>
<td valign="top" align="center"><italic>k</italic>-Nearest neighbors and centroid proximity metric of <italic>k</italic> mers</td>
<td valign="top" align="center">NCBI RefSeq genomes before October 2015</td>
<td valign="top" align="center">&#x2212;</td>
<td valign="top" align="center"><xref ref-type="bibr" rid="B15">Deaton et al., 2017</xref></td>
</tr>
<tr>
<td valign="top" align="left">PPR-Meta</td>
<td valign="top" align="center">Sequence</td>
<td valign="top" align="center">Convolutional neural network (CNN) of one-hot encodings of nucleobases and codons</td>
<td valign="top" align="center">NCBI RefSeq genomes. Download date unknown.</td>
<td valign="top" align="center">&#x2212;</td>
<td valign="top" align="center"><xref ref-type="bibr" rid="B18">Fang et al., 2019</xref></td>
</tr>
<tr>
<td valign="top" align="left">RNN-VirSeeker</td>
<td valign="top" align="center">Sequence</td>
<td valign="top" align="center">Long short-term memory (LSTM) of sequences</td>
<td valign="top" align="center">NCBI RefSeq genomes downloaded before January 2014.</td>
<td valign="top" align="center">&#x2212;</td>
<td valign="top" align="center"><xref ref-type="bibr" rid="B42">Liu F. et al., 2022</xref></td>
</tr>
<tr>
<td valign="top" align="left">Seeker</td>
<td valign="top" align="center">Sequence</td>
<td valign="top" align="center">LSTM of sequences</td>
<td valign="top" align="center">NCBI genomes and EMBL EBI genomes. Download date unknown.</td>
<td valign="top" align="center">PyPi</td>
<td valign="top" align="center"><xref ref-type="bibr" rid="B5">Auslander et al., 2020</xref></td>
</tr>
<tr>
<td valign="top" align="left">Unlimited Breadsticks</td>
<td valign="top" align="center">Homology</td>
<td valign="top" align="center">HMM of virus hallmark genes</td>
<td valign="top" align="center">&#x2212;</td>
<td valign="top" align="center">GitHub</td>
<td valign="top" align="center"><xref ref-type="bibr" rid="B67">Tisza et al., 2020</xref></td>
</tr>
<tr>
<td valign="top" align="left">VIBRANT (1.0.1)</td>
<td valign="top" align="center">Homology</td>
<td valign="top" align="center">Neural network of protein signatures including ratios of KEGG, VOG, and PFAM hits, and presence of key viral-like genes.</td>
<td valign="top" align="center">NCBI RefSeq and Genbank before July 2019</td>
<td valign="top" align="center">Bioconda</td>
<td valign="top" align="center"><xref ref-type="bibr" rid="B37">Kieft et al., 2020</xref></td>
</tr>
<tr>
<td valign="top" align="left">viralVerify (1.1)</td>
<td valign="top" align="center">Homology</td>
<td valign="top" align="center">Naive Bayes classifier using an hmmsearch of genes predicted with Prodigal</td>
<td valign="top" align="center">NCBI RefSeq genomes. Download date unknown.</td>
<td valign="top" align="center">Bioconda</td>
<td valign="top" align="center"><xref ref-type="bibr" rid="B4">Antipov et al., 2020</xref></td>
</tr>
<tr>
<td valign="top" align="left">ViraMiner</td>
<td valign="top" align="center">Sequence</td>
<td valign="top" align="center">CNN of one-hot encoded nucleobases</td>
<td valign="top" align="center">Sequences from 19 WGS metagenomes from human microbiome samples.</td>
<td valign="top" align="center">&#x2212;</td>
<td valign="top" align="center"><xref ref-type="bibr" rid="B66">Tampuu et al., 2019</xref></td>
</tr>
<tr>
<td valign="top" align="left">VirFinder (1.1)</td>
<td valign="top" align="center">Sequence</td>
<td valign="top" align="center">Logistic regression using <italic>k</italic>-mers</td>
<td valign="top" align="center">NCBI RefSeq genomes downloaded before January 2014.</td>
<td valign="top" align="center">Bioconda</td>
<td valign="top" align="center"><xref ref-type="bibr" rid="B57">Ren et al., 2017</xref></td>
</tr>
<tr>
<td valign="top" align="left">VirMine</td>
<td valign="top" align="center">Homology</td>
<td valign="top" align="center">BLAST search of ORFs against viral and non-viral databases</td>
<td valign="top" align="center">Viral dataset: NCBI RefSeq viral genomes. Bacterial dataset: Bacterial COGs. Download date unknown.</td>
<td valign="top" align="center">&#x2212;</td>
<td valign="top" align="center"><xref ref-type="bibr" rid="B21">Garretto et al., 2019</xref></td>
</tr>
<tr>
<td valign="top" align="left">VirMiner</td>
<td valign="top" align="center">Homology</td>
<td valign="top" align="center">Random forest (RF) based on functional profiling and protein homology</td>
<td valign="top" align="center">Viral dataset NCBI genomes and ACLAME database. Bacterial dataset: NCBI genomes. Downloaded October 2016.</td>
<td valign="top" align="center">&#x2212;</td>
<td valign="top" align="center"><xref ref-type="bibr" rid="B76">Zheng et al., 2019</xref></td>
</tr>
<tr>
<td valign="top" align="left">VirNet</td>
<td valign="top" align="center">Sequence</td>
<td valign="top" align="center">Deep attention model of sequences</td>
<td valign="top" align="center">NCBI RefSeq genomes. Download date unknown.</td>
<td valign="top" align="center">&#x2212;</td>
<td valign="top" align="center"><xref ref-type="bibr" rid="B1">Abdelkareem et al., 2018</xref></td>
</tr>
<tr>
<td valign="top" align="left">VIROME</td>
<td valign="top" align="center">Homology</td>
<td valign="top" align="center">Functional and taxonomic information based on ORF homology</td>
<td valign="top" align="center">&#x2212;</td>
<td valign="top" align="center">&#x2212;</td>
<td valign="top" align="center"><xref ref-type="bibr" rid="B69">Wommack et al., 2012</xref></td>
</tr>
<tr>
<td valign="top" align="left">VirSorter</td>
<td valign="top" align="center">Homology</td>
<td valign="top" align="center">Gene homology including enrichment of viral-like and short genes, and depletion of PFAM hits and strand shifts</td>
<td valign="top" align="center">NCBI RefSeq genomes before January 2014 and environmental viromes.</td>
<td valign="top" align="center">wget</td>
<td valign="top" align="center"><xref ref-type="bibr" rid="B61">Roux et al., 2015</xref></td>
</tr>
<tr>
<td valign="top" align="left">VirSorter2 (2.2)</td>
<td valign="top" align="center">Homology</td>
<td valign="top" align="center">Random forest classifiers using an hmmsearch of genes predicted with Prodigal</td>
<td valign="top" align="center">NCBI RefSeq genomes before January 2020 and high-quality genomes from the literature.</td>
<td valign="top" align="center">Bioconda</td>
<td valign="top" align="center"><xref ref-type="bibr" rid="B25">Guo et al., 2021</xref></td>
</tr>
<tr>
<td valign="top" align="left">VirusSeeker</td>
<td valign="top" align="center">Homology</td>
<td valign="top" align="center">BLAST search against virus database, followed by search against full NCBI database to remove false positives</td>
<td valign="top" align="center">Viral-only NCBI NT and NR database before August 2016</td>
<td valign="top" align="center">&#x2212;</td>
<td valign="top" align="center"><xref ref-type="bibr" rid="B74">Zhao G. et al., 2017</xref></td>
</tr>
</tbody>
</table></table-wrap>
</sec>
<sec id="S2.SS2">
<title>2.2. Datasets</title>
<p>This study leverages 4 benchmark datasets: (1) <italic>genome fragment set</italic> used for assessing the effect of contig length, low viral abundance, eukaryotic contamination, and potential bias toward certain groups of phages, (2) <italic>simulated phageome set</italic> used to explore the effect of sequencing error on the classification, (3) <italic>simulated metagenome set</italic>, used for exploring the effect of the quality of assembly, and viral abundance in samples. The study also includes a real gut metagenome dataset from colorectal cancer (CRC) patients compared to healthy controls: (4) <italic>CRC dataset</italic> used to compare the results of the tools on a real metagenomic dataset. Finally, this study includes real gut viromes from children with Crohn&#x2019;s disease, ulcerative colitis, and healthy controls: (5) <italic>gut virome dataset</italic>.</p>
<sec id="S2.SS2.SSS1">
<title>2.2.1. Set 1: Genome fragment set</title>
<p>All complete bacterial, fungal, and viral genomes were downloaded from the RefSeq database on 14 June 2021 (<xref ref-type="bibr" rid="B52">O&#x2019;Leary et al., 2016</xref>). These genomes were fragmented into non-overlapping adjacent fragments of lengths 500, 1,000, 3,000, and 5,000 nucleotides. In total, 379 archaeal, 21,788 bacterial, 18 fungal, and 11,156 viral genomes were obtained and fragmented, of which 1,483 were phage. From those fragmented genomes, 10,000 fragments were randomly selected from each length and each superkingdom. The resulting set includes four subsets: 500, 1,000, 3,000, and 5,000 bp, each with 10k fragments from the four superkingdoms for a total of 40k fragments per length subset. This collection of unmodified fragmented reference genomes is referred to as the <italic>genome fragment set</italic>.</p>
</sec>
<sec id="S2.SS2.SSS2">
<title>2.2.2. Set 2: Simulated phageome set</title>
<p>InSilicoSeq v 1.5.4 (<xref ref-type="bibr" rid="B23">Gourl&#x00E9; et al., 2019</xref>) was used for creating simulated reads from phage genomes. This tool creates an error model of per-base quality (Phred) scores using Kernel Density Estimation, trained on real sequencing reads. InSilicoSeq was chosen due to its computational efficiency (<xref ref-type="bibr" rid="B47">McElroy et al., 2012</xref>), simplicity of use and documentation (<xref ref-type="bibr" rid="B72">Yu et al., 2020</xref>), and ability to simulate Illumina sequencing instead of 454 technology (<xref ref-type="bibr" rid="B59">Richter et al., 2008</xref>; <xref ref-type="bibr" rid="B75">Zhao M. et al., 2017</xref>). Additionally, this tool has been demonstrated to generate reads with realistic quality score distributions for several sequencing platforms, including MiSeq, HiSeq, and NovaSeq (<xref ref-type="bibr" rid="B23">Gourl&#x00E9; et al., 2019</xref>; <xref ref-type="bibr" rid="B72">Yu et al., 2020</xref>).</p>
<p>Three &#x201C;phageome&#x201D; profiles were created by randomly selecting 500 phage genomes per profile from the downloaded RefSeq database. Reads were simulated using InSilicoSeq, specifying 30x coverage of all genomes. Reads simulated using each of the three built-in error models were created for each profile. Simulated reads from InSilicoSeq were assembled with MEGAHIT v1.2.9 (<xref ref-type="bibr" rid="B40">Li et al., 2015</xref>) and binned with MetaBAT 2 v2:2.15 and using Bowtie2 v2.4.5 for indexing (<xref ref-type="bibr" rid="B39">Langmead and Salzberg, 2012</xref>; <xref ref-type="bibr" rid="B35">Kang et al., 2019</xref>).</p>
<p>To determine the genomic origin of each contig, BLAST v2.12.0+ was used for alignment. Since the genomes used for read simulation for each profile were known, a local BLAST database was created using those same genomes for each simulated phageome, reducing spurious hits. Alignment was done using the MEGABLAST mode of BLASTn, with an <italic>e</italic>-value of 1e-20. Even with the limited BLAST databases, it was common for contigs to have significant hits to several genomes. To determine the &#x201C;true&#x201D; origin of the contigs, a basic decision tree was used which is shown in <xref ref-type="supplementary-material" rid="DS1">Supplementary Figure 3</xref>.</p>
<p>In total, the <italic>simulated phageome set</italic> is comprised of nine phageome assemblies and bin sets (3 profiles &#x002A; 3 error models).</p>
</sec>
<sec id="S2.SS2.SSS3">
<title>2.2.3. Set 3: Simulated metagenome set</title>
<p>The same simulation and binning steps used in the Set 2 were used to generate a set of simulated metagenomes. Five marine samples were used as the basis of this dataset. Three were from the Hawaii Ocean Time-series (HOT) program (SRR5720259, SRR5720320, SRR6507280) (<xref ref-type="bibr" rid="B36">Karl and Lukas, 1996</xref>), and two were from the Amazon continuum dataset (SRR4831655, SRR4831664) (<xref ref-type="bibr" rid="B62">Satinsky et al., 2014</xref>). Raw sequencing data were downloaded from Sequence Read Archive (SRA), and processed using fastqc v0.11.9 and trimGalore v0.6.6 (<xref ref-type="bibr" rid="B3">Andrews, 2010</xref>; <xref ref-type="bibr" rid="B6">Babraham Bioinformatics, 2022</xref>). Briefly, reads with average base quality score below 20 were removed, and those with adapters and poly-G sequences were trimmed. After trimming, reads with a length &#x003C; 20 bp were filtered out. After quality control (QC), taxonomic abundance profiles of the bacterial and phage population in each sample were obtained using Kraken2 (<xref ref-type="bibr" rid="B70">Wood et al., 2019</xref>) and Bracken (<xref ref-type="bibr" rid="B44">Lu et al., 2017</xref>) against the PlusPF database (version 5/17/2021 available at <ext-link ext-link-type="uri" xlink:href="https://benlangmead.github.io/aws-indexes/k2">https://benlangmead.github.io/aws-indexes/k2</ext-link>). The abundance profiles were used as input for InSilicoSeq, using reference genomes obtained from the RefSeq database. Additionally, for any profile with a phage abundance below 5% of reads, the profile was supplemented with additional phages by adding a minimum of 10 phages known to infect the top non-viral organisms in the profile. 20M Simulated reads for each profile were generated using the three built-in error models (HiSeq, MiSeq, and NovaSeq). The 15 resulting assemblies and bins are referred to as the <italic>simulated metagenome set</italic>.</p>
</sec>
<sec id="S2.SS2.SSS4">
<title>2.2.4. Set 4: CRC dataset and Set 5: Gut virome dataset</title>
<p>We also included a real-metagenomic dataset from a published study that used fecal shotgun metagenomics to characterize stool microbial populations from CRC patients compared to healthy controls with a total of 198 samples (<xref ref-type="bibr" rid="B73">Zeller et al., 2014</xref>). This dataset is referred to as the <italic>CRC dataset</italic> in this study. Additionally, we included a real virome dataset from a previously published study that used viral particle enrichment on fecal samples (<xref ref-type="bibr" rid="B19">Fernandes et al., 2019</xref>) from 24 healthy and IBD children. This second dataset is referred to as <italic>gut virome dataset</italic> in this study.</p>
<p>Raw sequencing data were downloaded from SRA (PRJEB6070 and PRJNA391511) and were quality filtered using fastqc v0.11.9 and trimGalore v0.6.6. Briefly, reads with an average base pair quality score below 20 were removed, and adapters and poly-G sequences were trimmed. After trimming, reads with a length &#x003C; 20 bp were filtered out. Quality-filtered sequences were screened to remove human sequences using bowtie2 v2.4.2 against a non-redundant version of the Genome Reference Consortium Human Build 38, patch release 7 (available at PRJNA31257 in NCBI).</p>
<p>After QC and human read filtering, the reads were assembled using Megahit v1.2.9. The code of the pipeline used for the assembly is available on Github.<sup><xref ref-type="fn" rid="footnote1">1</xref></sup> Megahit was run on the paired-end reads or single-end reads using the default parameters (referred to as the simple assembly). Additionally, a co-assembly of the multiple runs per BioSample was also performed (referred to as the co-assembly). Assemblies were binned with MetaBAT 2 v2:2.15 and using Bowtie2 v2.4.5 for indexing. CheckV v1.0.1 was run on all assemblies to assess viral and bacterial gene content.</p>
</sec>
</sec>
<sec id="S2.SS3">
<title>2.3. Classification of the datasets</title>
<p>Snakemake was used as a workflow manager for running the tools (<xref ref-type="bibr" rid="B38">K&#x00F6;ster and Rahmann, 2012</xref>). This pipeline was implemented on the Puma High-Performance Compute (HPC) cluster at the University of Arizona using SLURM (<xref ref-type="bibr" rid="B71">Yoo et al., 2003</xref>). While running the tools, the following metrics were collected by Snakemake: runtime and CPU time, peak memory usage, and file write operations.</p>
<p>When running the tools, the default parameters, modes, and databases were used to replicate those intended for use by the authors. DeepVirFinder was run without a length cutoff. MetaPhinder was run using the default database. Seeker was run using the command-line executable binary instead of the Python package. VIBRANT was run in standard (not virome) mode, with the default minimum length (1,000 bp) and number of ORFs (4). viralVerify utilized the default database. Virsorter was run using the default (RefSeq) database, in non-virome mode, and BLASTP as the default was used instead of DIAMOND. Virsorter2 was run to identify only dsDNAphage and ssDNAphage, allowing for proviruses by not using the &#x201C;&#x2013;no-pro-virus&#x201D; flag and not limiting the number of ORFs.</p>
<p>All of the tools classified the fragments in the <italic>genome fragment set</italic> except for MARVEL, which requires bins as input. VIBRANT and VirSorter do not classify fragments shorter than 1,000 nucleotides (nt), so there is no data for these tools for the 500 nt fragments. Default parameters were used for cutoff thresholds when the option was provided. To simplify the comparison of the classification performances, all predictions were binned into &#x201C;phage&#x201D; or &#x201C;non-phage&#x201D; classes. VirFinder and DeepVirFinder return a prediction score, and a score of 0.5 was used as a cutoff for classification. VIBRANT predicts both prophage and lytic virus labels, both of which were considered to be classified as phage. VirSorter also predicts prophage and lytic labels, and assigns a confidence category, all of which were considered to be classified as phage.</p>
<p>The simulated datasets (<italic>simulated phageome set</italic> and <italic>simulated metagenome set</italic>), and the <italic>CRC dataset</italic>, and the <italic>gut virome dataset</italic> were classified by all tools. Assembled reads were classified by all tools except MARVEL. MARVEL was given binned assemblies for classification. Resource usage required for binning was included in the resource usage benchmarking for MARVEL.</p>
</sec>
<sec id="S2.SS4">
<title>2.4. Performance assessment</title>
<p>Several performance metrics are assessed for each of the challenge datasets. These are precision, sensitivity (recall), specificity, <italic>F</italic>1, and AUPRC, as defined below. In these definitions, a &#x201C;positive&#x201D; is a phage sequence, while a negative is anything that is not phage. Accordingly, a true positive (TP) is a phage sequence that has been correctly labeled as phage, a True Negative (TN) is a non-phage sequence correctly labeled as such, a False Positive (FP) is a non-phage sequence labeled as phage, and a False Negative (FN) is a phage sequence not labeled as phage. Precision is the portion of all predicted phage sequences that are indeed phage (Eq. 1). Sensitivity, also known as recall, is the proportion of all true phage sequences that were correctly identified (Eq. 2). Specificity is the proportion of all non-phage sequences that were correctly labeled (Eq. 3). <italic>F</italic>1 score is the harmonic mean of precision and sensitivity (Eq. 4). The area under the precision recall curve (AUPRC) is a measure of precision over the sensitivity range, given a varying classification threshold for a continuous predictive output value (Eq. 5).</p>
<disp-formula id="S2.E1">
<label>(1)</label>
<mml:math id="M1">
<mml:mrow>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>o</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>P</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="S2.E2">
<label>(2)</label>
<mml:math id="M2">
<mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+5pt">
<mml:mi>y</mml:mi>
</mml:mpadded>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="S2.E3">
<label>(3)</label>
<mml:math id="M3">
<mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>f</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>P</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="S2.E4">
<label>(4)</label>
<mml:math id="M4">
<mml:mrow>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>&#x00D7;</mml:mo>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>o</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mo>&#x00D7;</mml:mo>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>o</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>y</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="S2.E5">
<label>(5)</label>
<mml:math id="M5">
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>U</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>P</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>R</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>C</mml:mi>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mo largeop="true" symmetric="true">&#x222B;</mml:mo>
<mml:mn>0</mml:mn>
<mml:mn>1</mml:mn>
</mml:msubsup>
<mml:mrow>
<mml:mpadded width="+1.7pt">
<mml:mi>p</mml:mi>
</mml:mpadded>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo mathvariant="italic" rspace="0pt">d</mml:mo>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo rspace="7.5pt">;</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>o</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mrow>
<mml:mo rspace="7.5pt">,</mml:mo>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
</sec>
</sec>
<sec id="S3" sec-type="results">
<title>3. Results</title>
<sec id="S3.SS1">
<title>3.1. Installation of tools</title>
<p>A total of 19 tools were collected based on a survey of the literature as of July 2021. However, several tools were omitted from further investigation for the following reasons: (1) the creation of runtime exceptions (PhaMers and VirMine), (2) tools with hard-coded paths that require the user to modify source code (RNN-VirSeeker and VirusSeeker), (3) lack of clear installation instructions and documentation (ViraMiner and VirNet), (4) tools that are unscalable due to web server usage (VirMiner and VIROME), and (5) inability to run instances of the tool on different cores from the same directory (PPR-Meta). Finally, Cenote Unlimited Breadsticks could be installed and run but did not classify any of the genome fragments as viral, and was excluded from the benchmark analysis, but included in the resource usage comparison.</p>
</sec>
<sec id="S3.SS2">
<title>3.2. Resource usage</title>
<p>Computational resource usage was benchmarked using the <italic>genome fragment set</italic> since the quantity and length of genomic fragments were known and balanced. Pre- and post-processing steps were excluded from these measurements. For tools that did not allow the user to specify the output directory (MARVEL and Seeker), we also included time to move output files to the correct output directory.</p>
<p>The total time (in CPU time) for each tool included: (1) CPU time to run the tool summed for user and system and (2) the amount of time to read and write data while classifying the <italic>genome fragment set</italic> (<xref ref-type="fig" rid="F1">Figure 1</xref>). For some tools, CPU time was highly variable (MetaPhinder, VIBRANT, viralVerify, and to a lesser extent for VirSorter, VirSorter2, and Seeker). Seeker generally had the longest CPU times, even for shorter fragments. While DeepVirFinder was consistently fast, its real-world performance was hindered due to its use of the Theano backend. While multiple jobs can be submitted in parallel, the Theano backend can only process one dataset at a time for serial processing. This led to long-running jobs for DeepVirFinder, but deceptively low CPU time measures.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption><p>Resource usage while classifying 10k genome fragments of various lengths (500, 1,000, 3,000, and 5,000 nt). <bold>(A)</bold> Read operations (MB), <bold>(B)</bold> write operations (MB), and <bold>(C)</bold> CPU time (h) summed for user and system. Sequence-based tools are in blue, homology-based tools are in yellow.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmicb-14-1078760-g001.tif"/>
</fig>
</sec>
<sec id="S3.SS3">
<title>3.3. Benchmark challenge 1: Classification of genome fragments</title>
<sec id="S3.SS3.SSS1">
<title>3.3.1. Effect of contig length</title>
<p>We first evaluated the effect of contig length on each tool&#x2019;s performance. To assess this effect, the <italic>genome fragment set (Set 1)</italic> was used as input. <xref ref-type="fig" rid="F2">Figure 2A</xref> shows <italic>F</italic>1 score for increasing fragment lengths (500, 1,000, 3,000, and 5,000 nt). As expected, homology-based tools such as VIBRANT, viralVerify, and VirSorter2 were strongly affected by fragment length, with performance increasing with length. VirSorter had the lowest <italic>F</italic>1 score for all lengths and had only a marginal increase in <italic>F</italic>1 with increasing length. The sequence-based tools (DeepVirFinder, VirFinder, and Seeker), as well as MetaPhinder, were largely unaffected by fragment length.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption><p>Effect of fragment length on classification performance. Only phage and bacterial sequence fragments are included. <bold>(A)</bold> Balanced <italic>F</italic>1 score plotted against fragment length (nt) and <bold>(B)</bold> balanced precision plotted against sensitivity for four fragment lengths. Top row of tools are sequence-based (in blue), bottom two rows are homology-based (in yellow).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmicb-14-1078760-g002.tif"/>
</fig>
<p>While <italic>F</italic>1 score illustrates overall changes in classification performance due to fragment length, each tool&#x2019;s performance is affected in different ways (<xref ref-type="fig" rid="F2">Figure 2B</xref>). VIBRANT, viralVerify, and VirSorter2 demonstrate fairly consistent and high precision but have length-dependent sensitivity, whereas DeepVirFinder, MetaPhinder, Seeker, and VirFinder demonstrate fairly consistent sensitivity and precision. Generally, length-dependent sensitivity is a property of homology-based tools where longer fragments are needed for classification. Interestingly, MetaPhinder (a homology-based tool) exhibits a pattern similar to the sequence-based tools for this property.</p>
<p>Several tools output a continuous classification score metric. For these tools (VirFinder, DeepVirFinder, Seeker, MetaPhinder, and viralVerify), the threshold used will affect precision and sensitivity. Although the default threshold was used, the effect of this threshold can be seen in the precision-recall curves (<xref ref-type="supplementary-material" rid="DS1">Supplementary Figure 1</xref>) and AUPRC (<xref ref-type="supplementary-material" rid="DS1">Supplementary Figure 2</xref>). These tools all showed lower AUPRC for shorter contigs, with DeepVirFinder outperforming the other tools even on shorter contigs.</p>
</sec>
<sec id="S3.SS3.SSS2">
<title>3.3.2. Low viral content</title>
<p>In the above section, precision is computed based on a balanced dataset (equal quantities of phage and bacteria). However, this gives a highly optimistic estimate of precision. For a given false positive rate (FPR), precision will drop significantly when phage content is low. To illustrate this, the FPR (Eq. 6) was taken from the classification of the <italic>genome fragment set</italic>, and precision was extrapolated to hypothetical community compositions ranging from 0 to 100% non-viral fragments (<xref ref-type="fig" rid="F3">Figure 3</xref>). 50% represents a balanced dataset.</p>
<disp-formula id="S3.E6">
<label>(6)</label>
<mml:math id="M6">
<mml:mrow>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>P</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>R</mml:mi>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption><p>Extrapolated precision calculated from FPR of each tool at four fragment lengths (500, 1,000, 3,000, and 5,000 nt). Precision is calculated for communities composed of varying levels of non-viral fragments from 0% (all phage) to 100% (all non-phage). Top row of tools are sequence-based (in blue), bottom two rows are homology-based (in yellow).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmicb-14-1078760-g003.tif"/>
</fig>
<p>For communities with low viral content, precision decreases for nearly all tools. Notably, viralVerify did not falsely classify any 500 nt fragments as phage, thus its hypothetical precision remains perfect for that fragment length. VIBRANT, viralVerify, and VirSorter2 maintain fairly high precision but still drop below 0.5 for communities with low viral content.</p>
</sec>
<sec id="S3.SS3.SSS3">
<title>3.3.3. Effect of phage taxonomy</title>
<p>The lack of phage diversity and bias for certain phage groups in reference databases leads to challenges in training models and propensity for tools to retrieve fewer phages from less represented phage groups. The majority (c.a. 93%) of phages in RefSeq belong to the order Caudovirales (recently renamed as the class Caudoviricetes; <xref ref-type="bibr" rid="B68">Turner et al., 2021</xref>). Importantly, all tools were shown to have reduced sensitivity for non-caudovirales sequences. In particular homology-based tools showed a drastically reduced sensitivity toward these phages (<xref ref-type="fig" rid="F4">Figure 4</xref>).</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption><p>Effects of phage taxonomy on sensitivity. <bold>(A)</bold> Sensitivity plotted against fragment length, comparing bacteriophages in the order Caudovirales and those in other orders. <bold>(B)</bold> Sensitivity plotted against fragment length, comparing bacteriophages in the top three families of the order Caudovirales. Top row of tools are sequence-based (in blue), bottom two rows are homology-based (in yellow).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmicb-14-1078760-g004.tif"/>
</fig>
<p>While the limitation of database composition is mitigated by sequence-based compared to homology-based approaches, a slightly lower sensitivity for non-caudoviral phages was nonetheless observed.</p>
<p>Even within the caudoviral order, the three main families are detected unequally by the tools. <italic>Siphoviridae</italic> constitutes the greatest phage family represented in RefSeq (69.9%), followed by <italic>Myoviridae</italic> (15.6%) and <italic>Podoviridae</italic> (7.28%). This is apparent in the <italic>fragmented genomes set</italic> due to random sampling from the fragmented genomes; of the caudoviruses, more fragments came from <italic>Siphoviridae</italic> and <italic>Myoviridae</italic> than <italic>Podoviridae</italic>. <xref ref-type="fig" rid="F4">Figure 4</xref> demonstrates how sensitivity is decreased for the retrieval of <italic>Siphoviridae</italic> and <italic>Podoviridae</italic> sequences compared to <italic>Myoviridae</italic> in particular MetaPhinder, VIBRANT, and VirSorter2.</p>
</sec>
<sec id="S3.SS3.SSS4">
<title>3.3.4. Eukaryotic contamination</title>
<p>A concern for sequence-based tools is specificity when faced with eukaryotic contamination, due to the lack of eukaryotic sequences in the training sets (<xref ref-type="bibr" rid="B54">Ponsero and Hurwitz, 2019</xref>). As part of the <italic>genome fragment set</italic>, the tools classified 10k eukaryotic genome fragments of 4 lengths from fungi in the phyla Ascomycota and Basidiomycota. The specificity on these eukaryotic fragments is shown in <xref ref-type="fig" rid="F5">Figure 5</xref>. All homology-based tools are extremely robust to eukaryotic contamination even for short fragments. However, sequence-based tools and Metaphinder show much lower specificity, frequently misclassifying eukaryotic fragments as viral, with an FPR around 0.5. Notably, Seeker shows a sensitivity that is worse than a random chance binary classification, classifying nearly all eukaryotic sequences as viral.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption><p>Classification specificity of eukaryotic genome fragments. Eukaryotic fragments were generated from the Ascomycota and Basidiomycota phyla (<italic>n</italic> = 10k) at different length size. The specificity of each tool was measures for each sequence length. Top row of tools are sequence-based (in blue), bottom two rows are homology-based (in yellow).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmicb-14-1078760-g005.tif"/>
</fig>
</sec>
</sec>
<sec id="S3.SS4">
<title>3.4. Benchmark challenge 2: Classification of simulated metagenomic sequences</title>
<p>To compare the relative performance of the tools when faced with read errors from different sequencing technologies and potential assembly error, the <italic>simulated phageome set</italic> and <italic>simulated metagenome set</italic> were created from simulated reads generated using error models that represent 3 popular sequencing platforms: HiSeq, MiSeq, and NovaSeq. Each technology has a unique per-base error rate, as modeled by InSilicoSeq, as well as differing read lengths. We aimed to assess how these differences affect assembly and tool performance.</p>
<sec id="S3.SS4.SSS1">
<title>3.4.1. Simulated phageomes</title>
<p>To directly assess the effect of sequencing error and assembly on classification sensitivity, the <italic>simulated phageome set</italic> was classified by each tool. In this dataset, simulated reads were obtained from phage genomes and assembled into contigs. The assembled contigs&#x2019; length varied from 500 to 309,196 bp, with a median length of 949 bp. Unlike in the <italic>genome fragment set</italic>, simulated contigs could be used to assess MARVEL, which requires binned sequences for classification. Each tool&#x2019;s sensitivity was calculated using the assembled contigs grouped by length (<xref ref-type="fig" rid="F6">Figure 6</xref>).</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption><p>Tools performances on a simulated phageome dataset. <bold>(A)</bold> Length distributions, and number of, assembled contigs constituting the nine phageomes generated using the three phageome profiles and three error models. <bold>(B)</bold> Sensitivity of classifying simulated phage contigs. Contigs were grouped by length (<italic>x</italic>-axis) for computation of sensitivity. Error models are ordered by increasing read length. Top row of tools are sequence-based (in blue), bottom two rows are homology-based (in yellow).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmicb-14-1078760-g006.tif"/>
</fig>
<p>The results are largely consistent with those obtained using the <italic>genome fragment set</italic>. DeepVirFinder and MetaPhinder, followed by VirFinder and Seeker, show the highest and most consistent sensitivity across all contig lengths. VIBRANT, VirSorter, and VirSorter2 performed well for longer contigs, but sensitivity suffers as contig length decreases. VirSorter&#x2019;s sensitivity begins to improve at about 10<sup>4</sup> bp and increases greatly for 10<sup>4.5</sup> and 10<sup>5</sup> bp. MARVEL, while better than VirSorter on short contigs, demonstrates the lowest sensitivity for long contigs. Importantly, MARVEL shows more variability in sensitivity for a given length. Indeed, the tool performs classification of contigs after binning, and since contigs of various lengths may be present in the same bin, we observe that the tool&#x2019;s performance is less tightly coupled to contig length.</p>
<p>The three error models produce reads of different lengths (HiSeq 125 bp, NovaSeq 150 bp, and MiSeq 300 bp). This led to a significant difference in contig lengths between MiSeq and the other two error models (based on Wilcoxon rank sum test with Bonferroni adjusted <italic>p</italic>-value, <italic>p</italic> &#x003C; 0.05). However, the sensitivity at a given length was similar across error models, suggesting that the difference in sequencing technologies is mitigated by the assembly process.</p>
</sec>
<sec id="S3.SS4.SSS2">
<title>3.4.2. Simulated metagenomes</title>
<p>The <italic>simulated metagenome set</italic> was produced from the bacterial and phage content of 5 metagenomic marine samples. This method allowed us to generate simulated metagenomes that are as close and possible to a real metagenome set while excluding the unknown fraction of the microbial population. The distance between the original taxonomic profile for the sample and the profile used for simulation was calculated as Bray-Curtis dissimilarity (<xref ref-type="supplementary-material" rid="DS1">Supplementary Figure 4</xref>). This computational method allowed us to generate simulated metagenomes containing 5% of phage sequence content and a realistic distribution of contigs length.</p>
<p>The precision and sensitivity of the tools on the <italic>simulated metagenome set</italic> were assessed based on the contig length (<xref ref-type="fig" rid="F7">Figure 7</xref>, and <italic>F</italic>1 score in <xref ref-type="supplementary-material" rid="DS1">Supplementary Figure 4</xref>). The observed sensitivity (<xref ref-type="fig" rid="F7">Figure 7A</xref>) is consistent with the results from the <italic>simulated phageome set</italic>, although MARVEL displays a large variance in sensitivity across the different replicates, possibly reflecting the quality of binning.</p>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption><p><bold>(A)</bold> Precision and <bold>(B)</bold> sensitivity on a simulated metagenome dataset. Each point represents the tool&#x2019;s performance on contigs within a given length group, assembled from reads from a specific abundance profile, using the indicated error model. Top row of tools are sequence-based (in blue), bottom two rows are homology-based (in yellow).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmicb-14-1078760-g007.tif"/>
</fig>
<p>Importantly, using this benchmark set, a decrease in precision due to low viral abundance is clearly seen for all tools (<xref ref-type="fig" rid="F7">Figure 7B</xref>). VirSorter, viralVerify, VIBRANT, and VirSorter2 perform slightly better than VirFinder, Seeker, and MetaPhinder. viralVerify performs exceptionally well for long contigs (&#x2265;10<sup>5</sup>), followed closely by VirFinder.</p>
</sec>
</sec>
<sec id="S3.SS5">
<title>3.5. Benchmark challenge 3: Comparison on a real-world dataset</title>
<p>We finally compared the tools on two real datasets, the <italic>CRC dataset</italic> and the <italic>gut virome dataset</italic>, for which the true phage/bacterial composition is unknown. We aimed here to assess the overlap in phage identification of the different tools and estimate the number of potential FP results.</p>
<sec id="S3.SS5.SSS1">
<title>3.5.1. CRC dataset</title>
<p>Importantly, when comparing the results obtained for tools on the CRC dataset the proportions of contigs predicted to be phage vary strongly by tool. VirSorter, MARVEL, viralVerify, VIBRANT, and VirSorter2 detect fewer phages (median less than 2,200 contigs per sample) than the sequence-based methods and MetaPhinder (median greater than 33,000 contigs per sample). This result is consistent with the high precision but lower sensitivity measured for homology-based tools.</p>
<p>We next assessed the overlap of the different tools in predicting the same sequence as phage (<xref ref-type="fig" rid="F8">Figure 8</xref>). Strikingly, there was very little overlap in phage communities predicted by the tools (<xref ref-type="fig" rid="F8">Figure 8B</xref>). The highest level of consistency between tools was seen for VirFinder and DeepVirFinder (38.8% of contigs identified by either tool were identified by both), and the highest level of consistency between homology-based tools was with VirSorter and VIBRANT (26.6% of predicted phages were in common). However, most contigs showed different levels of consistency between tools, where on average, 55,320 contigs were predicted to be phage by only one tool, 62,400 were predicted by 2 or more tools, and 29,900 by 3 or more tools.</p>
<fig id="F8" position="float">
<label>FIGURE 8</label>
<caption><p>Comparison of the tools&#x2019; classifications on a real-world metagenomic dataset. <bold>(A)</bold> Proportion of contigs predicted to be phage in each length group for each sample. <bold>(B)</bold> Upset plot showing intersection size, with the <italic>x</italic>-axis in order of decreasing set size, top 51 intersections shown. <bold>(C)</bold> CheckV assessment of predicted phage contigs from the <italic>CRC dataset</italic>. Predicted phage contigs from each tool are categorized by CheckV, and plotted as a stacked bar chart of the portion of predicted phages in each category. <bold>(D)</bold> Total number of contigs predicted by CheckV to have quality Medium or greater. Sequence-based tools are in blue, homology-based tools are in yellow.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmicb-14-1078760-g008.tif"/>
</fig>
<p>CheckV (<xref ref-type="bibr" rid="B49">Nayfach et al., 2021a</xref>) was used to evaluate the predicted phage contigs for viral genes. Potential viral contigs are categorized as &#x201C;Not Determined&#x201D; (without any detectable viral genes), &#x201C;Low Quality,&#x201D; &#x201C;Medium Quality,&#x201D; &#x201C;High Quality,&#x201D; or &#x201C;Complete.&#x201D; The proportions of contigs predicted by the tools falling into each category are shown in <xref ref-type="fig" rid="F8">Figure 8C</xref>. As expected, the sequence-based tools and MetaPhinder contain a low proportion of contigs with Medium Quality or higher, suggesting a potential number of FPs for these tools, while the other homology-based tools have higher quality predictions. MARVEL predicted no contigs below Medium Quality. Additionally, the number of predicted phages being labeled as Medium Quality, High Quality, or Complete by CheckV are shown in <xref ref-type="fig" rid="F8">Figure 8D</xref>.</p>
</sec>
<sec id="S3.SS5.SSS2">
<title>3.5.2. Gut virome dataset</title>
<p>When comparing the results obtained for each tool on the <italic>gut virome dataset</italic>, we observe that, similarly to the <italic>CRC dataset</italic>, the quantity of contigs predicted as viral varies strongly by tool. However, we observe an increased consistency among the tools, and the phage communities retrieved by the tools had a larger overlap (<xref ref-type="fig" rid="F9">Figure 9A</xref>). Similar to the <italic>CRC dataset</italic>, the highest level of consistency between tools was seen for VirFinder and DeepVirFinder (60.5% of contigs found in common), and the consistency between homology-based tools remained low, with the greatest agreement found between VIBRANT and viralVerify (22.8% of predicted phage sequences in common).</p>
<fig id="F9" position="float">
<label>FIGURE 9</label>
<caption><p>Comparison of the tools&#x2019; classification on real-world viromes. <bold>(A)</bold> Upset plot showing intersection size, with the <italic>x</italic>-axis in order of decreasing set size, top 51 intersections shown. <bold>(B)</bold> CheckV assessment of predicted phage contigs from the Gut Virome dataset. Predicted phage contigs from each tool are categorized by CheckV, and plotted as a stacked bar chart of the portion of predicted phages in each category.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmicb-14-1078760-g009.tif"/>
</fig>
<p>Most sequences retrieved by sequence-based tools were still classified as &#x201C;Not determined&#x201D; by CheckV, suggesting a potential high number of FPs for these tools in these conditions (<xref ref-type="fig" rid="F9">Figure 9B</xref>). Interestingly, the number of contigs classified as &#x201C;Medium quality&#x201D; or higher by CheckV was similar among tools, with the exception of Metaphinder and VirSorter which retrieved fewer contigs (<xref ref-type="supplementary-material" rid="DS1">Supplementary Figure 8</xref>).</p>
</sec>
</sec>
<sec id="S3.SS6">
<title>3.6. Reusable benchmark dataset</title>
<p>The five benchmark datasets are available on Zenodo (<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.7194616">doi.org/10.5281/zenodo.7194616</ext-link>). Files for the <italic>genome fragment set</italic> include a FASTA file of genome fragments, a CSV file of taxonomic information of each fragment, and a compiled and cleaned file of the classification results of the nine tools on the genome fragments. Files for the <italic>simulated metagenome set</italic> and <italic>simulated phageome set</italic> include the assembly files in FASTA format, the binned assemblies, a CSV file of the contig taxonomic origins resulting from BLAST, and a CSV file of the compiled contig classifications by the tools. For the <italic>CRC dataset</italic> and <italic>gut virome dataset</italic>, the assemblies and bins, as well as the compiled contig classifications, are available. For all datasets, the resource usage as recorded by Snakemake is also included.</p>
<p>In addition to README files for each dataset, datasheets based on Datasheets for Datasets (<xref ref-type="bibr" rid="B22">Gebru et al., 2021</xref>) are deposited in Zenodo. These provide details about how the datasets were generated, their composition, and the intended uses.</p>
</sec>
</sec>
<sec id="S4" sec-type="discussion">
<title>4. Discussion</title>
<p>This study aims to gain a better understanding of how metagenomic phage detection tools perform under a variety of conditions. Previous studies have explored the detection and classification of dsDNA viruses and auxiliary metabolic genes, but the performance of these tools under a variety of challenges remains. Given differences in metagenomes based on the taxonomic composition, sequencing quality, and computational methods for analyzing these metagenomes, we examined the effects of contig length, phage taxonomy, and sequencing and assembly error. We sought to address the robustness of tools to eukaryotic contamination and low viral content. Finally, we wanted to address the different phage communities predicted by the tools when classifying real gut metagenomes and viromes.</p>
<sec id="S4.SS1">
<title>4.1. Tools installation, reusability, and computational requirements</title>
<p>One of the first barriers to the effectiveness of a phage detection tool, and bioinformatics tools in general, is the ability to be installed and scaled to real data. Of the 19 tools identified for this study, only 9 (47%) could be installed and run at scale. This is corroborated by related studies, such as <xref ref-type="bibr" rid="B55">Pratama et al. (2021)</xref>, which excluded PHASTER and VirMiner, and <xref ref-type="bibr" rid="B27">Ho et al. (2021)</xref>, which excluded VIROME, VirMiner, ViraMiner, PhaMers, VirNet, and VirMine from their benchmarking efforts. Tools that were available on Bioconda and PyPi were easiest to install because the tool and all dependencies could be installed simultaneously, while those on GitHub required more effort. We acknowledge the extra effort required to develop tools and create releases using Bioconda and PyPi.</p>
<p>There have been efforts to make tools more accessible, even if the original version is not so straightforward to install. One solution is sharing containers, such as on DockerHub, Biocontainers on the AWS ECR public gallery, or CyVerse (<xref ref-type="bibr" rid="B14">da Veiga Leprevost et al., 2017</xref>; <xref ref-type="bibr" rid="B13">CyVerse, 2022</xref>; <xref ref-type="bibr" rid="B16">Docker Hub, 2022</xref>). One effort specific to phage-finding tools is What the Phage, which developed a Nextflow-based workflow for running several phage detection tools in a single container (<xref ref-type="bibr" rid="B46">Marquet et al., 2022</xref>).</p>
<p>In addition to the ability to install and run a tool, scalability and resource usage are significant factors considering the scale of metagenomic data. While web services provide a convenient interface, they are often not a viable option for classifying many samples. Similarly, tools that do not allow several samples to be processed at the same are impractical in most cases. For the tools that could be installed and run at scale, computational resource usage varied widely. Homology-based tool compute times generally varied with contig length. Seeker had the most consistently high compute times (greater than 30 h to classify 10k fragments). While DeepVirFinder had the most consistently low compute times, its inability to process several samples in parallel hindered its actual runtimes. However, this may be circumvented by running each process in its own container, each with its own Theano backend.</p>
</sec>
<sec id="S4.SS2">
<title>4.2. Performance on short sequences</title>
<p>In this study, we compared the tools on several challenges to identify the current limitations and strengths of tools and computational approaches. First, we evaluated the effect of sequence length on classification performance. Unsurprisingly, sequence-based tools were able to work with shorter contigs than homology-based tools, as they did not require the presence of multiple genes to classify a sequence. Similarly, the precision of homology-based tools was largely independent of fragment length since shorter contigs would not be expected to affect the FPR. Finally, the sensitivity of sequence-based tools such as VirFinder, DeepVirFinder, and Seeker were globally unaffected by the sequence length and allow the retrieval of sequences as short as 500 bp. However, a decreased precision for shorter contigs could be observed for these tools, suggesting that a higher number of FPs should be expected when using these tools on shorter contigs. On the other hand, homology-based tools such as VirSorter2, Vibrant, and viralVerify, showed a reduced sensitivity on short contigs but a consistent precision for all lengths. Interestingly, Metaphinder, which leverages multiple hits against a genome database, showed a similar effect for contig length as the sequence-based tools, with an increased FPR for shorter contigs. Metaphinder sums the regions of the fragment that have BLASTn hits to phage genomes in a reference database, even if the hits are from distinct phage groups. This, coupled with a fairly permissive <italic>e</italic>-value of 0.05, leads to a sensitive but less precise classification for short fragments.</p>
<p>The result of this reduced sensitivity on short contigs is seen in both the <italic>simulated phageome set</italic> and <italic>simulated metagenome set</italic>, where VirSorter, viralVerify, VIBRANT, and VirSorter2 recover less than 75% of phage contigs between 1,000 and &#x223C;3,200 bp long. Considering the difficulty of assembling short reads in real metagenomes, this poses a barrier to retrieving a large number of phage sequences present only as short contigs. For perspective, 93% of the contigs in the <italic>CRC dataset</italic> were shorter than 3 kbp.</p>
</sec>
<sec id="S4.SS3">
<title>4.3. Bias toward over-represented phage groups</title>
<p>The ability of these tools to identify novel phages or phages with lower database representation can be critical for exploring many natural viral populations. We assessed this by comparing the sensitivity of the tools on the well-represented Caudoviral group and other phage groups. Importantly, all tools showed a decreased sensitivity for non-caudoviral phages. This effect was also seen when comparing the sensitivity of tools on the more abundant Myoviridae phages compared to the Siphoviridae and Podoviridae phages. This result is particularly striking as it suggests a bias in sensitivity toward the over-represented phage groups in databases. Of the tools included in this benchmark, DeepVirFinder appears to be the most suited to detecting a wider diversity of phages and showed the most consistent sensitivity across phage groups. It is important to note here that our benchmark dataset relied on RefSeq genome sequences and is therefore limited to known phages.</p>
</sec>
<sec id="S4.SS4">
<title>4.4. Low viral content and eukaryotic sequences</title>
<p>Real metagenomes typically contain low levels of viral content and may also carry eukaryotic sequences from the host (e.g., human gut microbiome) or from micro-eukaryotes. While sequences from a eukaryotic host can typically be excluded from the metagenomic dataset before viral detection, this is particularly difficult for micro-eukaryotic sequences. Of the tools compared, none used eukaryotic genomes in their training set, leading to concerns about specificity when faced with eukaryotic contamination. In this benchmark, we showed that sequence-based tools and Metaphinder exhibit low specificity on fungal sequence fragments, while other homology-based tools remain unscathed.</p>
<p>To understand how low viral content affects precision, the precision of tools was extrapolated to the full range of possible phage content. All tools had decreased precision when viral content is low, dropping sharply when viral content is below 20%. viralVerify was the most robust to low viral content, especially on shorter contigs. The consequences of this were seen in the classification of <italic>simulated metagenome</italic> set, each of which had 5% phage. All tools had varying and often low (below 0.5) precision, although viralVerify and VirFinder had good precision for the longest contigs.</p>
</sec>
<sec id="S4.SS5">
<title>4.5. Sequencing error and assembly quality</title>
<p>Using simulated datasets, we next aimed to assess the effect of sequencing error and assembly on each tool&#x2019;s performance. This method allowed us to develop more realistic benchmark sets while retaining the possibility to assess the true composition of the set.</p>
<p>First, we evaluated the tools&#x2019; sensitivity on a <italic>simulated phageome set</italic> composed of simulated phage contigs only and assessing the potential effect of sequencing error and potential misassembly. We showed the global sensitivity to be very similar to that obtained when classifying unmodified genome fragments. This indicates that sequencing error, sequencing technology, and misassembly do not hinder sensitivity significantly, at least with sufficient sequencing depth (simulated phageomes were simulated with 30x coverage).</p>
<p>The <italic>simulated metagenome set</italic> aimed to give the most realistic estimate of real-world performance. The sensitivity of all tools again closely reflected previous results. In particular, the contig length and the low viral content were driving the observed tools&#x2019; performances. DeepVirFinder, Seeker, and at shorter contig lengths Metaphinder, had precision typically below 0.5. VirSorter, viralVerify, VIBRANT, and VirSorter2 had slightly higher precision, although with high variance, often falling below 0.5. viralVerify, however, was extremely precise for long contigs. Once again, this result suggested a limited effect of sequencing error on the tools&#x2019; performances.</p>
</sec>
<sec id="S4.SS6">
<title>4.6. Comparing overlap in viral predictions</title>
<p>The previous benchmark challenges suggested vastly different properties that affect the final result obtained by users when using on their real-world datasets. This was further demonstrated here on the <italic>CRC dataset</italic>. The predicted phage quantity and composition is strikingly different between tools. As expected from the previous benchmarks, the homology-based tools, excluding MetaPhinder, predict far fewer phages than sequence-based tools. But most strikingly, the overlap of sequences found by several tools is surprisingly low. Of all sequences predicted to be phage by at least one tool, only 53% were predicted by two or more tools, and 25% were found by three or more tools, on average. The dissimilarity of contigs predicted as phage by the tools is so wide that approximately 80% of contigs are predicted to be phage by at least one tool. Consequently, when applying these tools to real datasets, the choice of tool would strongly affect the predicted phage community.</p>
<p>The use of CheckV can help reach a larger level of agreement between tools, when used to filter out potential FPs (<xref ref-type="fig" rid="F8">Figure 8D</xref> and <xref ref-type="supplementary-material" rid="DS1">Supplementary Figure 7</xref>). In CheckV, genes are first annotated as either viral or microbial based on comparison to a large database of hidden Markov models (HMMs), and the absence of detectable viral genes in the sequence leads to a classification of the contig as &#x201C;Not determined.&#x201D; Interestingly, when contigs with Low Quality or Not Determined status are removed, then greater than 50% of contigs from the <italic>CRC dataset</italic> found by at least one tool are found by 3 or more tools; 26% of contigs found by at least one tool are found by 6 or more tools. However, CheckV is stringent, and additional contigs can be recovered by supplementing the dataset with contigs that lack genes of cellular origin, especially for metagenomes with highly novel phage.</p>
<p>For viral-particles enriched metagenomes (<italic>gut virome dataset</italic>), the results obtained from the different tools were more consistent. With fewer non-viral sequences, the number of FPs should be reduced in particular for the sequence-based tools, explaining a more consistent results among the tools.</p>
</sec>
<sec id="S4.SS7">
<title>4.7. Toward a reusable benchmark dataset for viral tool assessment</title>
<p>To facilitate further benchmarking of newer tools, we have made all benchmark datasets available. This includes all input files, such as the genome fragments and the assembled simulated metagenomes. Files giving the taxonomic origins of all fragments and contigs are also available, to serve as an answer key when benchmarking new tools. The resulting classifications have been cleaned and compiled into a consistent format, such that classification results can be compared at the fragment/contig level without having to reclassify the input data. Additionally, all code used to analyze the data and generate figures is available for reference on GitHub,<sup><xref ref-type="fn" rid="footnote2">2</xref></sup> although modifications would have to be made to incorporate new tools, due to differences in output format, etc.</p>
</sec>
</sec>
<sec id="S5" sec-type="conclusion">
<title>5. Conclusion and recommendations</title>
<p>We summarized each tool&#x2019;s performances for each benchmark challenge in <xref ref-type="fig" rid="F10">Figure 10</xref>. Given these insights, the remaining question is &#x201C;What is the best solution to phage detection and prediction?.&#x201D; Unfortunately, answering this important question is not straightforward, especially given the tradeoffs between precision and sensitivity. However, some general guidelines can be used to decide which tools to use. For physically purified viromes (viral particle enriched metagenomes), precision is less of a concern, so one can prioritize sensitivity, and may choose DeepVirFinder, which also has the best sensitivity to non&#x2013;caudoviral phages. For metagenomes where phages are actively infecting their bacterial hosts, the research question at hand should be the main driver in deciding. To identify novel phages, DeepVirFinder or Metaphinder may be a good choice, although the results should be further confirmed to avoid FPs, such as through the use of CheckV (<xref ref-type="bibr" rid="B49">Nayfach et al., 2021a</xref>), and when possible, host sequences should be removed prior to classification. However, if one is wanting to study the dominant phages present in an environment and maintain high confidence in phage calling, VirSorter2 or viralVerify may be good choices.</p>
<fig id="F10" position="float">
<label>FIGURE 10</label>
<caption><p>Breakdown of each evaluated tool&#x2019;s performance. Tools are colored such that homology-based methods are yellow and sequence-based tools are blue. Colors are scaled from the minimum to maximum value in each column. Scales are linear, except for speed which is scaled to log 10 of speed. Speed is the average number of genome contigs from the <italic>simulated metagenome set</italic> classified per second. Sensitivity and precision are averages from classifying the simulated metagenomes set. Diverse phages is the average ratio of sensitivity on non-caudoviral phages vs. caudoviral phages. Eukaryote specificity is the average specificity when classifying eukaryotic genome fragments.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmicb-14-1078760-g010.tif"/>
</fig>
<p>Given the high FPR for sequence-based tools, using several of these tools together to increase the sensitivity would lead to a decreased precision. Instead, a more robust multi-tool approach may combine sequence-based and homology based tools to find consensus predictions.</p>
<sec id="S5.SS1">
<title>5.1. Limitations and future directions</title>
<p>This study aimed to set the basis for the development of a fair and reusable benchmark for viral detection tools. However, we wanted to highlight some limitations to the current study. First, we use the default parameters, reference databases, and trained models for all tools. While admittedly these may not be optimal, we believe that it is likely how a large proportion of researchers would use the tools. In particular, some tools allow users to choose a more stringent threshold to reduce false positives. The effect of such change in parameters can be investigated in this study by looking at the precision-recall curves provided in <xref ref-type="supplementary-material" rid="DS1">Supplementary material</xref>. However, the effect of further parameter tuning and database update was not evaluated in this benchmark and would be a valuable future effort.</p>
<p>Additionally, this benchmark did not investigate the tools&#x2019; ability to detect integrated prophage sequences nor plasmid sequences. Indeed, several tools such as VirSorter, VirSorter2. and VIBRANT are able to classify prophage sequences, and PPR-Meta is able to classify plasmid sequences. Future works may in particular evaluate the tool accuracy in detecting prophage boundaries and the effect of prophage inactivation and genetic degradation.</p>
<p>Second, all genomes used in the <italic>genome fragment set</italic> and simulated datasets were retrieved from RefSeq. This has a significant overlap with the tools&#x2019; reference databases and training sets (<xref ref-type="table" rid="T1">Table 1</xref>), therefore this study likely underrepresents the diminished performance on broader phage diversity.</p>
<p>Despite these limitations, we hope the developed benchmark may be informative to users and would be further developed to include new computational challenges. It should be noted that the results presented here are limited to those tools that could be installed and run by July 2021, and since then many more tools have been published [we are aware of 3CAC (<xref ref-type="bibr" rid="B56">Pu and Shamir, 2022</xref>), DeepMicrobeFinder (<xref ref-type="bibr" rid="B29">Hou et al., 2021</xref>), INHERIT (<xref ref-type="bibr" rid="B7">Bai et al., 2022</xref>), PHAMB (<xref ref-type="bibr" rid="B32">Johansen et al., 2022</xref>), PhaMer (<xref ref-type="bibr" rid="B64">Shang et al., 2022</xref>), VirMine 2.0 (<xref ref-type="bibr" rid="B33">Johnson and Putonti, 2022</xref>), and virSearcher (<xref ref-type="bibr" rid="B43">Liu Q. et al., 2022</xref>)]. Additionally, modular pipelines such as the IMG/VR viral discovery pipeline (<xref ref-type="bibr" rid="B53">Paez-Espino et al., 2017</xref>) and computational pipelines combining several tools presented here, were not evaluated in this work but could be assessed using the same benchmark datasets developed here. Importantly, the combination of several tools using different computational approaches could enable the researchers to leverage the strength of each approach. This type of hybrid pipeline has been in previously used by several large-scale viral discovery studies for human-associated metagenomes (<xref ref-type="bibr" rid="B24">Gregory et al., 2020</xref>; <xref ref-type="bibr" rid="B12">Camarillo-Guerrero et al., 2021</xref>; <xref ref-type="bibr" rid="B50">Nayfach et al., 2021b</xref>) and environmental metagenomes (<xref ref-type="bibr" rid="B31">Jian et al., 2021</xref>; <xref ref-type="bibr" rid="B26">Hegarty et al., 2022</xref>). With the potential that one of these tools or pipelines performs better than those studied, it would be useful to benchmark them using the data and methods developed here.</p>
</sec>
</sec>
<sec id="S6" sec-type="data-availability">
<title>Data availability statement</title>
<p>The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.7194616">doi.org/10.5281/zenodo.7194616</ext-link>.</p>
</sec>
<sec id="S7" sec-type="author-contributions">
<title>Author contributions</title>
<p>KS: conceptualization, methodology, software, validation, formal analysis, investigation, data curation, writing&#x2014;original draft, writing&#x2014;review and editing, and visualization. JG: investigation, data curation, and writing&#x2014;review and editing. AP: conceptualization, methodology, software, validation, investigation, resources, data curation, writing&#x2014;review and editing, supervision, project administration, and funding acquisition. BH: conceptualization, resources, writing&#x2014;review and editing, supervision, project administration, and funding acquisition. All authors contributed to the article and approved the submitted version.</p>
</sec>
</body>
<back>
<sec id="S8" sec-type="funding-information">
<title>Funding</title>
<p>This work was supported by the grants from Academy of Finland (339172 to AP) and Gordon and Betty Moore Foundation (GBMF 8751 to BH). KS acknowledged funding from a PhD training grant through the U.S. National Institutes of Health T32GM132008. JG acknowledged funding from the MARC undergraduate training grant through the U.S. National Institutes of Health T34 GM 8718.</p>
</sec>
<ack><p>We thank Dr. Tim Secomb for financial and scientific support for this study. We thank Drs. Heidi Imker, Charles Cook, and the Global Biodata Coalition for their support of Kenneth Schackart. We thank all the members of the Hurwitz Lab for fruitful discussions and their scientific support. We also thank Dr. Kattika Kaarj for fruitful discussions and help in reviewing this manuscript.</p>
</ack>
<sec id="S9" sec-type="COI-statement">
<title>Conflict of interest</title>
<p>BH holds concurrent appointments as an Associate Professor of Biosystems Engineering at the University of Arizona and as an Amazon Scholar. This publication describes work performed at the University of Arizona and is not associated with Amazon. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec id="S10" sec-type="disclaimer">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<sec id="S11" sec-type="supplementary-material">
<title>Supplementary material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fmicb.2023.1078760/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fmicb.2023.1078760/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.pdf" id="DS1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<fn-group>
<fn id="footnote1">
<label>1</label>
<p><ext-link ext-link-type="uri" xlink:href="https://github.com/aponsero/Assembly_metagenomes">https://github.com/aponsero/Assembly_metagenomes</ext-link></p></fn>
<fn id="footnote2">
<label>2</label>
<p><ext-link ext-link-type="uri" xlink:href="https://github.com/hurwitzlab/phage_detection_benchmarks">https://github.com/hurwitzlab/phage_detection_benchmarks</ext-link></p></fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Abdelkareem</surname> <given-names>A. O.</given-names></name> <name><surname>Khalil</surname> <given-names>M. I.</given-names></name> <name><surname>Elaraby</surname> <given-names>M.</given-names></name> <name><surname>Abbas</surname> <given-names>H.</given-names></name> <name><surname>Elbehery</surname> <given-names>A. H. A.</given-names></name></person-group> (<year>2018</year>). &#x201C;<article-title>VirNet: Deep attention model for viral reads identification</article-title>,&#x201D; in <source><italic>Proceedings of the 2018 13th international conference on computer engineering and systems (ICCES)</italic></source> (<publisher-loc>Cairo</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>623</fpage>&#x2013;<lpage>626</lpage>. <pub-id pub-id-type="doi">10.1109/ICCES.2018.8639400</pub-id></citation></ref>
<ref id="B2"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Amgarten</surname> <given-names>D.</given-names></name> <name><surname>Braga</surname> <given-names>L.</given-names></name> <name><surname>da Silva</surname> <given-names>A.</given-names></name> <name><surname>Setubal</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). <article-title>MARVEL, a tool for prediction of bacteriophage sequences in metagenomic bins.</article-title> <source><italic>Front. Genet.</italic></source> <volume>9</volume>:<issue>304</issue>. <pub-id pub-id-type="doi">10.3389/fgene.2018.00304</pub-id> <pub-id pub-id-type="pmid">30131825</pub-id></citation></ref>
<ref id="B3"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Andrews</surname> <given-names>S.</given-names></name></person-group> (<year>2010</year>). <source><italic>FastQC: A quality control tool for high throughput sequence data.</italic></source> <publisher-loc>Cambridge</publisher-loc>: <publisher-name>Babraham Institute</publisher-name>.</citation></ref>
<ref id="B4"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Antipov</surname> <given-names>D.</given-names></name> <name><surname>Raiko</surname> <given-names>M.</given-names></name> <name><surname>Lapidus</surname> <given-names>A.</given-names></name> <name><surname>Pevzner</surname> <given-names>P.</given-names></name></person-group> (<year>2020</year>). <article-title>Metaviral SPAdes: Assembly of viruses from metagenomic data.</article-title> <source><italic>Bioinformatics</italic></source> <volume>36</volume> <fpage>4126</fpage>&#x2013;<lpage>4129</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btaa490</pub-id> <pub-id pub-id-type="pmid">32413137</pub-id></citation></ref>
<ref id="B5"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Auslander</surname> <given-names>N.</given-names></name> <name><surname>Gussow</surname> <given-names>A.</given-names></name> <name><surname>Benler</surname> <given-names>S.</given-names></name> <name><surname>Wolf</surname> <given-names>Y.</given-names></name> <name><surname>Koonin</surname> <given-names>E.</given-names></name></person-group> (<year>2020</year>). <article-title>Seeker: Alignment-free identification of bacteriophage genomes by deep learning.</article-title> <source><italic>Nucleic Acids Res.</italic></source> <volume>48</volume>:<issue>e121</issue>. <pub-id pub-id-type="doi">10.1093/nar/gkaa856</pub-id> <pub-id pub-id-type="pmid">33045744</pub-id></citation></ref>
<ref id="B6"><citation citation-type="journal"><collab>Babraham Bioinformatics</collab> (<year>2022</year>). <source><italic>&#x201C;Trim Galore!&#x201D;.</italic></source> Available online at: <ext-link ext-link-type="uri" xlink:href="https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/">https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/</ext-link> <comment>(accessed September 20, 2022)</comment>.</citation></ref>
<ref id="B7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bai</surname> <given-names>Z.</given-names></name> <name><surname>Zhang</surname> <given-names>Y.</given-names></name> <name><surname>Miyano</surname> <given-names>S.</given-names></name> <name><surname>Yamaguchi</surname> <given-names>R.</given-names></name> <name><surname>Fujimoto</surname> <given-names>K.</given-names></name> <name><surname>Uematsu</surname> <given-names>S.</given-names></name><etal/></person-group> (<year>2022</year>). <article-title>Identification of bacteriophage genome sequences with representation learning.</article-title> <source><italic>Bioinformatics</italic></source> <volume>38</volume> <fpage>4264</fpage>&#x2013;<lpage>4270</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btac509</pub-id> <pub-id pub-id-type="pmid">35920769</pub-id></citation></ref>
<ref id="B8"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Blazanin</surname> <given-names>M.</given-names></name> <name><surname>Turner</surname> <given-names>P.</given-names></name></person-group> (<year>2021</year>). <article-title>Community context matters for bacteria-phage ecology and evolution.</article-title> <source><italic>ISME J.</italic></source> <volume>15</volume> <fpage>3119</fpage>&#x2013;<lpage>3128</lpage>. <pub-id pub-id-type="doi">10.1038/s41396-021-01012-x</pub-id> <pub-id pub-id-type="pmid">34127803</pub-id></citation></ref>
<ref id="B9"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Breitbart</surname> <given-names>M.</given-names></name> <name><surname>Rohwer</surname> <given-names>F.</given-names></name></person-group> (<year>2005</year>). <article-title>Here a virus, there a virus, everywhere the same virus?</article-title> <source><italic>Trends Microbiol.</italic></source> <volume>13</volume> <fpage>278</fpage>&#x2013;<lpage>284</lpage>. <pub-id pub-id-type="doi">10.1016/j.tim.2005.04.003</pub-id> <pub-id pub-id-type="pmid">15936660</pub-id></citation></ref>
<ref id="B10"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Breitbart</surname> <given-names>M.</given-names></name> <name><surname>Bonnain</surname> <given-names>C.</given-names></name> <name><surname>Malki</surname> <given-names>K.</given-names></name> <name><surname>Sawaya</surname> <given-names>N.</given-names></name></person-group> (<year>2018</year>). <article-title>Phage puppet masters of the marine microbial realm.</article-title> <source><italic>Nat. Microbiol.</italic></source> <volume>3</volume> <fpage>754</fpage>&#x2013;<lpage>766</lpage>. <pub-id pub-id-type="doi">10.1038/s41564-018-0166-y</pub-id> <pub-id pub-id-type="pmid">29867096</pub-id></citation></ref>
<ref id="B11"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Breitbart</surname> <given-names>M.</given-names></name> <name><surname>Salamon</surname> <given-names>P.</given-names></name> <name><surname>Andresen</surname> <given-names>B.</given-names></name> <name><surname>Mahaffy</surname> <given-names>J.</given-names></name> <name><surname>Segall</surname> <given-names>A.</given-names></name> <name><surname>Mead</surname> <given-names>D.</given-names></name><etal/></person-group> (<year>2002</year>). <article-title>Genomic analysis of uncultured marine viral communities.</article-title> <source><italic>Proc. Natl. Acad. Sci. U.S.A.</italic></source> <volume>99</volume> <fpage>14250</fpage>&#x2013;<lpage>14255</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.202488399</pub-id> <pub-id pub-id-type="pmid">12384570</pub-id></citation></ref>
<ref id="B12"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Camarillo-Guerrero</surname> <given-names>L.</given-names></name> <name><surname>Almeida</surname> <given-names>A.</given-names></name> <name><surname>Rangel-Pineros</surname> <given-names>G.</given-names></name> <name><surname>Finn</surname> <given-names>R.</given-names></name> <name><surname>Lawley</surname> <given-names>T.</given-names></name></person-group> (<year>2021</year>). <article-title>Massive expansion of human gut bacteriophage diversity.</article-title> <source><italic>Cell</italic></source> <volume>184</volume> <fpage>1098</fpage>&#x2013;<lpage>1109.e9</lpage>. <pub-id pub-id-type="doi">10.1016/j.cell.2021.01.029</pub-id> <pub-id pub-id-type="pmid">33606979</pub-id></citation></ref>
<ref id="B13"><citation citation-type="journal"><collab>CyVerse</collab> (<year>2022</year>). <source><italic>CyVerse the open science workspace for collaborative data-driven discovery.</italic></source> Available online at: <ext-link ext-link-type="uri" xlink:href="https://cyverse.org/">https://cyverse.org/</ext-link> <comment>(accessed September 23, 2022)</comment>.</citation></ref>
<ref id="B14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>da Veiga Leprevost</surname> <given-names>F.</given-names></name> <name><surname>Gr&#x00FC;ning</surname> <given-names>B.</given-names></name> <name><surname>Alves Aflitos</surname> <given-names>S.</given-names></name> <name><surname>R&#x00F6;st</surname> <given-names>H.</given-names></name> <name><surname>Uszkoreit</surname> <given-names>J.</given-names></name> <name><surname>Barsnes</surname> <given-names>H.</given-names></name><etal/></person-group> (<year>2017</year>). <article-title>BioContainers: an open-source and community-driven framework for software standardization.</article-title> <source><italic>Bioinformatics</italic></source> <volume>33</volume> <fpage>2580</fpage>&#x2013;<lpage>2582</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btx192</pub-id> <pub-id pub-id-type="pmid">28379341</pub-id></citation></ref>
<ref id="B15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Deaton</surname> <given-names>J.</given-names></name> <name><surname>Yu</surname> <given-names>F. B.</given-names></name> <name><surname>Quake</surname> <given-names>S. R.</given-names></name></person-group> (<year>2017</year>). <article-title>PhaMers identifies novel bacteriophage sequences from thermophilic hot springs.</article-title> <source><italic>BioRxiv</italic></source> [<comment>Preprint]. 169672</comment>. <pub-id pub-id-type="doi">10.1101/169672</pub-id></citation></ref>
<ref id="B16"><citation citation-type="journal"><collab>Docker Hub</collab> (<year>2022</year>). <source><italic>Docker hub container image library | app containerization.</italic></source> Available online at: <ext-link ext-link-type="uri" xlink:href="https://hub.docker.com/">https://hub.docker.com/</ext-link> <comment>(accessed September 23, 2022)</comment>.</citation></ref>
<ref id="B17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Edlund</surname> <given-names>A.</given-names></name> <name><surname>Santiago-Rodriguez</surname> <given-names>T.</given-names></name> <name><surname>Boehm</surname> <given-names>T.</given-names></name> <name><surname>Pride</surname> <given-names>D.</given-names></name></person-group> (<year>2015</year>). <article-title>Bacteriophage and their potential roles in the human oral cavity.</article-title> <source><italic>J. Oral Microbiol.</italic></source> <volume>7</volume>:<issue>27423</issue>. <pub-id pub-id-type="doi">10.3402/jom.v7.27423</pub-id> <pub-id pub-id-type="pmid">25861745</pub-id></citation></ref>
<ref id="B18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fang</surname> <given-names>Z.</given-names></name> <name><surname>Tan</surname> <given-names>J.</given-names></name> <name><surname>Wu</surname> <given-names>S.</given-names></name> <name><surname>Li</surname> <given-names>M.</given-names></name> <name><surname>Xu</surname> <given-names>C.</given-names></name> <name><surname>Xie</surname> <given-names>Z.</given-names></name><etal/></person-group> (<year>2019</year>). <article-title>PPR-Meta: A tool for identifying phages and plasmids from metagenomic fragments using deep learning.</article-title> <source><italic>Gigascience</italic></source> <volume>8</volume>:<issue>giz066</issue>. <pub-id pub-id-type="doi">10.1093/gigascience/giz066</pub-id> <pub-id pub-id-type="pmid">31220250</pub-id></citation></ref>
<ref id="B19"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fernandes</surname> <given-names>M.</given-names></name> <name><surname>Verstraete</surname> <given-names>S.</given-names></name> <name><surname>Phan</surname> <given-names>T.</given-names></name> <name><surname>Deng</surname> <given-names>X.</given-names></name> <name><surname>Stekol</surname> <given-names>E.</given-names></name> <name><surname>LaMere</surname> <given-names>B.</given-names></name><etal/></person-group> (<year>2019</year>). <article-title>Enteric virome and bacterial microbiota in children with ulcerative colitis and crohn disease.</article-title> <source><italic>J. Pediatr. Gastroenterol. Nutr.</italic></source> <volume>68</volume> <fpage>30</fpage>&#x2013;<lpage>36</lpage>. <pub-id pub-id-type="doi">10.1097/MPG.0000000000002140</pub-id> <pub-id pub-id-type="pmid">30169455</pub-id></citation></ref>
<ref id="B20"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fuhrman</surname> <given-names>J.</given-names></name></person-group> (<year>1999</year>). <article-title>Marine viruses and their biogeochemical and ecological effects.</article-title> <source><italic>Nature</italic></source> <volume>399</volume> <fpage>541</fpage>&#x2013;<lpage>548</lpage>. <pub-id pub-id-type="doi">10.1038/21119</pub-id> <pub-id pub-id-type="pmid">10376593</pub-id></citation></ref>
<ref id="B21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Garretto</surname> <given-names>A.</given-names></name> <name><surname>Hatzopoulos</surname> <given-names>T.</given-names></name> <name><surname>Putonti</surname> <given-names>C.</given-names></name></person-group> (<year>2019</year>). <article-title>virMine: Automated detection of viral sequences from complex metagenomic samples.</article-title> <source><italic>PeerJ</italic></source> <volume>7</volume>:<issue>e6695</issue>. <pub-id pub-id-type="doi">10.7717/peerj.6695</pub-id> <pub-id pub-id-type="pmid">30993039</pub-id></citation></ref>
<ref id="B22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gebru</surname> <given-names>T.</given-names></name> <name><surname>Morgenstern</surname> <given-names>J.</given-names></name> <name><surname>Vecchione</surname> <given-names>B.</given-names></name> <name><surname>Vaughan</surname> <given-names>J.</given-names></name> <name><surname>Wallach</surname> <given-names>H.</given-names></name> <name><surname>Daum&#x00E9;</surname> <given-names>H.</given-names> <suffix>III</suffix></name><etal/></person-group> (<year>2021</year>). <article-title>Datasheets for datasets.</article-title> <source><italic>ArXiv</italic></source> [<comment>Preprint]. ArXiv:1803.09010</comment>.</citation></ref>
<ref id="B23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gourl&#x00E9;</surname> <given-names>H.</given-names></name> <name><surname>Karlsson-Lindsj&#x00F6;</surname> <given-names>O.</given-names></name> <name><surname>Hayer</surname> <given-names>J.</given-names></name> <name><surname>Bongcam-Rudloff</surname> <given-names>E.</given-names></name></person-group> (<year>2019</year>). <article-title>Simulating illumina metagenomic data with InSilicoSeq.</article-title> <source><italic>Bioinformatics</italic></source> <volume>35</volume> <fpage>521</fpage>&#x2013;<lpage>522</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/bty630</pub-id> <pub-id pub-id-type="pmid">30016412</pub-id></citation></ref>
<ref id="B24"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gregory</surname> <given-names>A. C.</given-names></name> <name><surname>Zablocki</surname> <given-names>O.</given-names></name> <name><surname>Zayed</surname> <given-names>A. A.</given-names></name> <name><surname>Howell</surname> <given-names>A.</given-names></name> <name><surname>Bolduc</surname> <given-names>B.</given-names></name> <name><surname>Sullivan</surname> <given-names>M. B.</given-names></name></person-group> (<year>2020</year>). <article-title>The gut virome database reveals age-dependent patterns of virome diversity in the human gut</article-title>. <source><italic>Cell Host Microbe</italic></source> <volume>28</volume>, <fpage>724</fpage>&#x2013;<lpage>740</lpage>.</citation></ref>
<ref id="B25"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Guo</surname> <given-names>J.</given-names></name> <name><surname>Bolduc</surname> <given-names>B.</given-names></name> <name><surname>Zayed</surname> <given-names>A.</given-names></name> <name><surname>Varsani</surname> <given-names>A.</given-names></name> <name><surname>Dominguez-Huerta</surname> <given-names>G.</given-names></name> <name><surname>Delmont</surname> <given-names>T.</given-names></name><etal/></person-group> (<year>2021</year>). <article-title>VirSorter2: A multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses.</article-title> <source><italic>Microbiome</italic></source> <volume>9</volume>:<issue>37</issue>. <pub-id pub-id-type="doi">10.1186/s40168-020-00990-y</pub-id> <pub-id pub-id-type="pmid">33522966</pub-id></citation></ref>
<ref id="B26"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hegarty</surname> <given-names>B.</given-names></name> <name><surname>Dai</surname> <given-names>Z.</given-names></name> <name><surname>Raskin</surname> <given-names>L.</given-names></name> <name><surname>Pinto</surname> <given-names>A.</given-names></name> <name><surname>Wigginton</surname> <given-names>K.</given-names></name> <name><surname>Duhaime</surname> <given-names>M.</given-names></name></person-group> (<year>2022</year>). <article-title>A snapshot of the global drinking water virome: Diversity and metabolic potential vary with residual disinfectant use.</article-title> <source><italic>Water Res.</italic></source> <volume>218</volume>:<issue>118484</issue>. <pub-id pub-id-type="doi">10.1016/j.watres.2022.118484</pub-id> <pub-id pub-id-type="pmid">35504157</pub-id></citation></ref>
<ref id="B27"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ho</surname> <given-names>S. F.</given-names></name> <name><surname>Millard</surname> <given-names>A. D.</given-names></name> <name><surname>van Schaik</surname> <given-names>W.</given-names></name></person-group> (<year>2021</year>). <article-title>Comprehensive benchmarking of tools to identify phages in metagenomic shotgun sequencing data.</article-title> <source><italic>bioRxiv</italic></source> [<comment>Preprint</comment>].</citation></ref>
<ref id="B28"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ho</surname> <given-names>S. F.</given-names></name> <name><surname>Wheeler</surname> <given-names>N.</given-names></name> <name><surname>Millard</surname> <given-names>A. D.</given-names></name> <name><surname>van Schaik</surname> <given-names>W.</given-names></name></person-group> (<year>2022</year>). <article-title>Gauge your phage: Benchmarking of bacteriophage identification tools in metagenomic sequencing data.</article-title> <source><italic>bioRxiv</italic></source> [<comment>Preprint</comment>]. <pub-id pub-id-type="doi">10.1101/2021.04.12.438782</pub-id></citation></ref>
<ref id="B29"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hou</surname> <given-names>S.</given-names></name> <name><surname>Cheng</surname> <given-names>S.</given-names></name> <name><surname>Chen</surname> <given-names>T.</given-names></name> <name><surname>Fuhrman</surname> <given-names>J. A.</given-names></name> <name><surname>Sun</surname> <given-names>F.</given-names></name></person-group> (<year>2021</year>). <article-title>DeepMicrobeFinder Sorts metagenomes into prokaryotes, eukaryotes and viruses, with marine applications.</article-title> <source><italic>bioRxiv</italic></source> [<comment>Preprint</comment>]. <pub-id pub-id-type="doi">10.1101/2021.10.26.466018</pub-id></citation></ref>
<ref id="B30"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hurwitz</surname> <given-names>B.</given-names></name> <name><surname>U&#x2019;Ren</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>Viral metabolic reprogramming in marine ecosystems.</article-title> <source><italic>Curr. Opin. Microbiol.</italic></source> <volume>31</volume> <fpage>161</fpage>&#x2013;<lpage>168</lpage>. <pub-id pub-id-type="doi">10.1016/j.mib.2016.04.002</pub-id> <pub-id pub-id-type="pmid">27088500</pub-id></citation></ref>
<ref id="B31"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jian</surname> <given-names>H.</given-names></name> <name><surname>Yi</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Hao</surname> <given-names>Y.</given-names></name> <name><surname>Zhang</surname> <given-names>M.</given-names></name> <name><surname>Wang</surname> <given-names>S.</given-names></name><etal/></person-group> (<year>2021</year>). <article-title>Diversity and distribution of viruses inhabiting the deepest ocean on Earth.</article-title> <source><italic>ISME J.</italic></source> <volume>15</volume> <fpage>3094</fpage>&#x2013;<lpage>3110</lpage>. <pub-id pub-id-type="doi">10.1038/s41396-021-00994-y</pub-id> <pub-id pub-id-type="pmid">33972725</pub-id></citation></ref>
<ref id="B32"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Johansen</surname> <given-names>J.</given-names></name> <name><surname>Plichta</surname> <given-names>D.</given-names></name> <name><surname>Nissen</surname> <given-names>J.</given-names></name> <name><surname>Jespersen</surname> <given-names>M.</given-names></name> <name><surname>Shah</surname> <given-names>S.</given-names></name> <name><surname>Deng</surname> <given-names>L.</given-names></name><etal/></person-group> (<year>2022</year>). <article-title>Genome binning of viral entities from bulk metagenomics data.</article-title> <source><italic>Nat. Commun.</italic></source> <volume>13</volume>:<issue>965</issue>. <pub-id pub-id-type="doi">10.1038/s41467-022-28581-5</pub-id> <pub-id pub-id-type="pmid">35181661</pub-id></citation></ref>
<ref id="B33"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Johnson</surname> <given-names>G.</given-names></name> <name><surname>Putonti</surname> <given-names>C.</given-names></name></person-group> (<year>2022</year>). <article-title>virMine 2.0: Identifying viral sequences in microbial communities.</article-title> <source><italic>Microbiol. Resour. Announc.</italic></source> <volume>11</volume>:<issue>e0010722</issue>. <pub-id pub-id-type="doi">10.1128/mra.00107-22</pub-id> <pub-id pub-id-type="pmid">35499341</pub-id></citation></ref>
<ref id="B34"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jurtz</surname> <given-names>V.</given-names></name> <name><surname>Villarroel</surname> <given-names>J.</given-names></name> <name><surname>Lund</surname> <given-names>O.</given-names></name> <name><surname>Voldby Larsen</surname> <given-names>M.</given-names></name> <name><surname>Nielsen</surname> <given-names>M.</given-names></name></person-group> (<year>2016</year>). <article-title>MetaPhinder-identifying bacteriophage sequences in metagenomic data sets.</article-title> <source><italic>PLoS One</italic></source> <volume>11</volume>:<issue>e0163111</issue>. <pub-id pub-id-type="doi">10.1371/journal.pone.0163111</pub-id> <pub-id pub-id-type="pmid">27684958</pub-id></citation></ref>
<ref id="B35"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kang</surname> <given-names>D.</given-names></name> <name><surname>Li</surname> <given-names>F.</given-names></name> <name><surname>Kirton</surname> <given-names>E.</given-names></name> <name><surname>Thomas</surname> <given-names>A.</given-names></name> <name><surname>Egan</surname> <given-names>R.</given-names></name> <name><surname>An</surname> <given-names>H.</given-names></name><etal/></person-group> (<year>2019</year>). <article-title>MetaBAT 2: An adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies.</article-title> <source><italic>PeerJ</italic></source> <volume>7</volume>:<issue>e7359</issue>. <pub-id pub-id-type="doi">10.7717/peerj.7359</pub-id> <pub-id pub-id-type="pmid">31388474</pub-id></citation></ref>
<ref id="B36"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Karl</surname> <given-names>D. M.</given-names></name> <name><surname>Lukas</surname> <given-names>R.</given-names></name></person-group> (<year>1996</year>). <article-title>The Hawaii ocean time-series (HOT) program: Background, Rationale and field implementation.</article-title> <source><italic>Deep Sea Res. II Top. Stud. Oceanogr.</italic></source> <volume>43</volume> <fpage>129</fpage>&#x2013;<lpage>156</lpage>. <pub-id pub-id-type="doi">10.1016/0967-0645(96)00005-7</pub-id></citation></ref>
<ref id="B37"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kieft</surname> <given-names>K.</given-names></name> <name><surname>Zhou</surname> <given-names>Z.</given-names></name> <name><surname>Anantharaman</surname> <given-names>K.</given-names></name></person-group> (<year>2020</year>). <article-title>VIBRANT: Automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences.</article-title> <source><italic>Microbiome</italic></source> <volume>8</volume>:<issue>90</issue>. <pub-id pub-id-type="doi">10.1186/s40168-020-00867-0</pub-id> <pub-id pub-id-type="pmid">32522236</pub-id></citation></ref>
<ref id="B38"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>K&#x00F6;ster</surname> <given-names>J.</given-names></name> <name><surname>Rahmann</surname> <given-names>S.</given-names></name></person-group> (<year>2012</year>). <article-title>Snakemake&#x2013;a scalable bioinformatics workflow engine.</article-title> <source><italic>Bioinformatics</italic></source> <volume>28</volume> <fpage>2520</fpage>&#x2013;<lpage>2522</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/bts480</pub-id> <pub-id pub-id-type="pmid">22908215</pub-id></citation></ref>
<ref id="B39"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Langmead</surname> <given-names>B.</given-names></name> <name><surname>Salzberg</surname> <given-names>S.</given-names></name></person-group> (<year>2012</year>). <article-title>Fast gapped-read alignment with Bowtie 2.</article-title> <source><italic>Nat. Methods</italic></source> <volume>9</volume> <fpage>357</fpage>&#x2013;<lpage>359</lpage>. <pub-id pub-id-type="doi">10.1038/nmeth.1923</pub-id> <pub-id pub-id-type="pmid">22388286</pub-id></citation></ref>
<ref id="B40"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>D.</given-names></name> <name><surname>Liu</surname> <given-names>C.</given-names></name> <name><surname>Luo</surname> <given-names>R.</given-names></name> <name><surname>Sadakane</surname> <given-names>K.</given-names></name> <name><surname>Lam</surname> <given-names>T. W.</given-names></name></person-group> (<year>2015</year>). <article-title>MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph.</article-title> <source><italic>Bioinformatics</italic></source> <volume>31</volume> <fpage>1674</fpage>&#x2013;<lpage>1676</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btv033</pub-id> <pub-id pub-id-type="pmid">25609793</pub-id></citation></ref>
<ref id="B41"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Handley</surname> <given-names>S.</given-names></name> <name><surname>Baldridge</surname> <given-names>M.</given-names></name></person-group> (<year>2021</year>). <article-title>The dark side of the gut: Virome-host interactions in intestinal homeostasis and disease.</article-title> <source><italic>J. Exp. Med.</italic></source> <volume>218</volume>:<issue>e20201044</issue>. <pub-id pub-id-type="doi">10.1084/jem.20201044</pub-id> <pub-id pub-id-type="pmid">33760921</pub-id></citation></ref>
<ref id="B42"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>F.</given-names></name> <name><surname>Miao</surname> <given-names>Y.</given-names></name> <name><surname>Liu</surname> <given-names>Y.</given-names></name> <name><surname>Hou</surname> <given-names>T. R. N. N.</given-names></name></person-group> (<year>2022</year>). <article-title>VirSeeker: A deep learning method for identification of short viral sequences from metagenomes.</article-title> <source><italic>IEEE/ACM Trans. Comput. Biol. Bioinform.</italic></source> <volume>19</volume> <fpage>1840</fpage>&#x2013;<lpage>1849</lpage>. <pub-id pub-id-type="doi">10.1109/TCBB.2020.3044575</pub-id> <pub-id pub-id-type="pmid">33315571</pub-id></citation></ref>
<ref id="B43"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>Q.</given-names></name> <name><surname>Liu</surname> <given-names>F.</given-names></name> <name><surname>Miao</surname> <given-names>Y.</given-names></name> <name><surname>He</surname> <given-names>J.</given-names></name> <name><surname>Dong</surname> <given-names>T.</given-names></name> <name><surname>Hou</surname> <given-names>T.</given-names></name><etal/></person-group> (<year>2022</year>). <article-title>virSearcher: Identifying bacteriophages from metagenomes by combining convolutional neural network and gene information.</article-title> <source><italic>IEEE/ACM Trans. Comput. Biol. Bioinform.</italic></source> <pub-id pub-id-type="doi">10.1109/TCBB.2022.3161135</pub-id> <pub-id pub-id-type="pmid">35316191</pub-id></citation></ref>
<ref id="B44"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lu</surname> <given-names>J.</given-names></name> <name><surname>Breitwieser</surname> <given-names>F. P.</given-names></name> <name><surname>Thielen</surname> <given-names>P.</given-names></name> <name><surname>Salzberg</surname> <given-names>S. L.</given-names></name></person-group> (<year>2017</year>). <article-title>Bracken: Estimating species abundance in metagenomics data.</article-title> <source><italic>PeerJ Comput. Sci.</italic></source> <volume>3</volume>:<issue>e104</issue>. <pub-id pub-id-type="doi">10.7717/peerj-cs.104</pub-id></citation></ref>
<ref id="B45"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Manrique</surname> <given-names>P.</given-names></name> <name><surname>Dills</surname> <given-names>M.</given-names></name> <name><surname>Young</surname> <given-names>M.</given-names></name></person-group> (<year>2017</year>). <article-title>The human gut phage community and its implications for health and disease.</article-title> <source><italic>Viruses</italic></source> <volume>9</volume>:<issue>141</issue>. <pub-id pub-id-type="doi">10.3390/v9060141</pub-id> <pub-id pub-id-type="pmid">28594392</pub-id></citation></ref>
<ref id="B46"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Marquet</surname> <given-names>M.</given-names></name> <name><surname>H&#x00F6;lzer</surname> <given-names>M.</given-names></name> <name><surname>Pletz</surname> <given-names>M.</given-names></name> <name><surname>Viehweger</surname> <given-names>A.</given-names></name> <name><surname>Makarewicz</surname> <given-names>O.</given-names></name> <name><surname>Ehricht</surname> <given-names>R.</given-names></name><etal/></person-group> (<year>2022</year>). <article-title>What the phage: A scalable workflow for the identification and analysis of phage sequences.</article-title> <source><italic>Gigascience</italic></source> <volume>11</volume>:<issue>giac110</issue>. <pub-id pub-id-type="doi">10.1093/gigascience/giac110</pub-id> <pub-id pub-id-type="pmid">36399058</pub-id></citation></ref>
<ref id="B47"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>McElroy</surname> <given-names>K.</given-names></name> <name><surname>Luciani</surname> <given-names>F.</given-names></name> <name><surname>Thomas</surname> <given-names>T.</given-names></name></person-group> (<year>2012</year>). <article-title>GemSIM: General, error-model based simulator of next-generation sequencing data.</article-title> <source><italic>BMC Genomics</italic></source> <volume>13</volume>:<issue>74</issue>. <pub-id pub-id-type="doi">10.1186/1471-2164-13-74</pub-id> <pub-id pub-id-type="pmid">22336055</pub-id></citation></ref>
<ref id="B48"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Meyer</surname> <given-names>F.</given-names></name> <name><surname>Bremges</surname> <given-names>A.</given-names></name> <name><surname>Belmann</surname> <given-names>P.</given-names></name> <name><surname>Janssen</surname> <given-names>S.</given-names></name> <name><surname>McHardy</surname> <given-names>A.</given-names></name> <name><surname>Koslicki</surname> <given-names>D.</given-names></name></person-group> (<year>2019</year>). <article-title>Assessing taxonomic metagenome profilers with OPAL.</article-title> <source><italic>Genome Biol.</italic></source> <volume>20</volume>:<issue>51</issue>. <pub-id pub-id-type="doi">10.1186/s13059-019-1646-y</pub-id> <pub-id pub-id-type="pmid">30832730</pub-id></citation></ref>
<ref id="B49"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nayfach</surname> <given-names>S.</given-names></name> <name><surname>Camargo</surname> <given-names>A.</given-names></name> <name><surname>Schulz</surname> <given-names>F.</given-names></name> <name><surname>Eloe-Fadrosh</surname> <given-names>E.</given-names></name> <name><surname>Roux</surname> <given-names>S.</given-names></name> <name><surname>Kyrpides</surname> <given-names>N.</given-names></name></person-group> (<year>2021a</year>). <article-title>CheckV assesses the quality and completeness of metagenome-assembled viral genomes.</article-title> <source><italic>Nat. Biotechnol.</italic></source> <volume>39</volume> <fpage>578</fpage>&#x2013;<lpage>585</lpage>. <pub-id pub-id-type="doi">10.1038/s41587-020-00774-7</pub-id> <pub-id pub-id-type="pmid">33349699</pub-id></citation></ref>
<ref id="B50"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nayfach</surname> <given-names>S.</given-names></name> <name><surname>P&#x00E1;ez-Espino</surname> <given-names>D.</given-names></name> <name><surname>Call</surname> <given-names>L.</given-names></name> <name><surname>Low</surname> <given-names>S.</given-names></name> <name><surname>Sberro</surname> <given-names>H.</given-names></name> <name><surname>Ivanova</surname> <given-names>N.</given-names></name><etal/></person-group> (<year>2021b</year>). <article-title>Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome.</article-title> <source><italic>Nat. Microbiol.</italic></source> <volume>6</volume> <fpage>960</fpage>&#x2013;<lpage>970</lpage>. <pub-id pub-id-type="doi">10.1038/s41564-021-00928-6</pub-id> <pub-id pub-id-type="pmid">34168315</pub-id></citation></ref>
<ref id="B51"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ofir</surname> <given-names>G.</given-names></name> <name><surname>Sorek</surname> <given-names>R.</given-names></name></person-group> (<year>2018</year>). <article-title>Contemporary phage biology: From classic models to new insights.</article-title> <source><italic>Cell</italic></source> <volume>172</volume> <fpage>1260</fpage>&#x2013;<lpage>1270</lpage>. <pub-id pub-id-type="doi">10.1016/j.cell.2017.10.045</pub-id> <pub-id pub-id-type="pmid">29522746</pub-id></citation></ref>
<ref id="B52"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>O&#x2019;Leary</surname> <given-names>N.</given-names></name> <name><surname>Wright</surname> <given-names>M.</given-names></name> <name><surname>Brister</surname> <given-names>J.</given-names></name> <name><surname>Ciufo</surname> <given-names>S.</given-names></name> <name><surname>Haddad</surname> <given-names>D.</given-names></name> <name><surname>McVeigh</surname> <given-names>R.</given-names></name><etal/></person-group> (<year>2016</year>). <article-title>Reference sequence (RefSeq) database at NCBI: Current status, taxonomic expansion, and functional annotation.</article-title> <source><italic>Nucleic Acids Res.</italic></source> <volume>44</volume> <fpage>D733</fpage>&#x2013;<lpage>D745</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkv1189</pub-id> <pub-id pub-id-type="pmid">26553804</pub-id></citation></ref>
<ref id="B53"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Paez-Espino</surname> <given-names>D.</given-names></name> <name><surname>Pavlopoulos</surname> <given-names>G. A.</given-names></name> <name><surname>Ivanova</surname> <given-names>N. N.</given-names></name> <name><surname>Kyrpides</surname> <given-names>N. C.</given-names></name></person-group> (<year>2017</year>). <article-title>Nontargeted virus sequence discovery pipeline and virus clustering for metagenomic data.</article-title> <source><italic>Nat. Protoc.</italic></source> <volume>12</volume> <fpage>1673</fpage>&#x2013;<lpage>1682</lpage>. <pub-id pub-id-type="doi">10.1038/nprot.2017.063</pub-id> <pub-id pub-id-type="pmid">28749930</pub-id></citation></ref>
<ref id="B54"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ponsero</surname> <given-names>A.</given-names></name> <name><surname>Hurwitz</surname> <given-names>B.</given-names></name></person-group> (<year>2019</year>). <article-title>The Promises and pitfalls of machine learning for detecting viruses in aquatic metagenomes.</article-title> <source><italic>Front. Microbiol.</italic></source> <volume>10</volume>:<issue>806</issue>. <pub-id pub-id-type="doi">10.3389/fmicb.2019.00806</pub-id> <pub-id pub-id-type="pmid">31057513</pub-id></citation></ref>
<ref id="B55"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pratama</surname> <given-names>A.</given-names></name> <name><surname>Bolduc</surname> <given-names>B.</given-names></name> <name><surname>Zayed</surname> <given-names>A.</given-names></name> <name><surname>Zhong</surname> <given-names>Z.</given-names></name> <name><surname>Guo</surname> <given-names>J.</given-names></name> <name><surname>Vik</surname> <given-names>D.</given-names></name><etal/></person-group> (<year>2021</year>). <article-title>Expanding standards in viromics: In silico evaluation of dsDNA viral genome identification, classification, and auxiliary metabolic gene curation.</article-title> <source><italic>PeerJ</italic></source> <volume>9</volume>:<issue>e11447</issue>. <pub-id pub-id-type="doi">10.7717/peerj.11447</pub-id> <pub-id pub-id-type="pmid">34178438</pub-id></citation></ref>
<ref id="B56"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pu</surname> <given-names>L.</given-names></name> <name><surname>Shamir</surname> <given-names>R.</given-names></name></person-group> (<year>2022</year>). <article-title>3CAC: Improving the classification of phages and plasmids in metagenomic assemblies using assembly graphs.</article-title> <source><italic>Bioinformatics</italic></source> <volume>38</volume>(<issue>Suppl. 2</issue>) <fpage>ii56</fpage>&#x2013;<lpage>ii61</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btac468</pub-id> <pub-id pub-id-type="pmid">36124804</pub-id></citation></ref>
<ref id="B57"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ren</surname> <given-names>J.</given-names></name> <name><surname>Ahlgren</surname> <given-names>N.</given-names></name> <name><surname>Lu</surname> <given-names>Y.</given-names></name> <name><surname>Fuhrman</surname> <given-names>J.</given-names></name> <name><surname>Sun</surname> <given-names>F.</given-names></name></person-group> (<year>2017</year>). <article-title>VirFinder: A novel k-mer based tool for identifying viral sequences from assembled metagenomic data.</article-title> <source><italic>Microbiome</italic></source> <volume>5</volume>:<issue>69</issue>. <pub-id pub-id-type="doi">10.1186/s40168-017-0283-5</pub-id> <pub-id pub-id-type="pmid">28683828</pub-id></citation></ref>
<ref id="B58"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ren</surname> <given-names>J.</given-names></name> <name><surname>Song</surname> <given-names>K.</given-names></name> <name><surname>Deng</surname> <given-names>C.</given-names></name> <name><surname>Ahlgren</surname> <given-names>N.</given-names></name> <name><surname>Fuhrman</surname> <given-names>J.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name><etal/></person-group> (<year>2020</year>). <article-title>Identifying viruses from metagenomic data using deep learning.</article-title> <source><italic>Quant. Biol.</italic></source> <volume>8</volume> <fpage>64</fpage>&#x2013;<lpage>77</lpage>. <pub-id pub-id-type="doi">10.1007/s40484-019-0187-4</pub-id> <pub-id pub-id-type="pmid">34084563</pub-id></citation></ref>
<ref id="B59"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Richter</surname> <given-names>D.</given-names></name> <name><surname>Ott</surname> <given-names>F.</given-names></name> <name><surname>Auch</surname> <given-names>A.</given-names></name> <name><surname>Schmid</surname> <given-names>R.</given-names></name> <name><surname>Huson</surname> <given-names>D.</given-names></name></person-group> (<year>2008</year>). <article-title>MetaSim: A sequencing simulator for genomics and metagenomics.</article-title> <source><italic>PLoS One</italic></source> <volume>3</volume>:<issue>e3373</issue>. <pub-id pub-id-type="doi">10.1371/journal.pone.0003373</pub-id> <pub-id pub-id-type="pmid">18841204</pub-id></citation></ref>
<ref id="B60"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Roach</surname> <given-names>M. J.</given-names></name> <name><surname>McNair</surname> <given-names>K.</given-names></name> <name><surname>Michalczyk</surname> <given-names>M.</given-names></name> <name><surname>Giles</surname> <given-names>S. K.</given-names></name> <name><surname>Inglis</surname> <given-names>L. K.</given-names></name> <name><surname>Pargin</surname> <given-names>E.</given-names></name><etal/></person-group> (<year>2022</year>). <article-title>Philympics 2021: Prophage predictions perplex programs.</article-title> <source><italic>F1000Res.</italic></source> <volume>10</volume>. <pub-id pub-id-type="doi">10.12688/f1000research.54449.2</pub-id></citation></ref>
<ref id="B61"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Roux</surname> <given-names>S.</given-names></name> <name><surname>Enault</surname> <given-names>F.</given-names></name> <name><surname>Hurwitz</surname> <given-names>B.</given-names></name> <name><surname>Sullivan</surname> <given-names>M.</given-names></name></person-group> (<year>2015</year>). <article-title>VirSorter: Mining viral signal from microbial genomic data.</article-title> <source><italic>PeerJ</italic></source> <volume>3</volume>:<issue>e985</issue>. <pub-id pub-id-type="doi">10.7717/peerj.985</pub-id> <pub-id pub-id-type="pmid">26038737</pub-id></citation></ref>
<ref id="B62"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Satinsky</surname> <given-names>B.</given-names></name> <name><surname>Zielinski</surname> <given-names>B.</given-names></name> <name><surname>Doherty</surname> <given-names>M.</given-names></name> <name><surname>Smith</surname> <given-names>C.</given-names></name> <name><surname>Sharma</surname> <given-names>S.</given-names></name> <name><surname>Paul</surname> <given-names>J.</given-names></name><etal/></person-group> (<year>2014</year>). <article-title>The Amazon continuum dataset: Quantitative metagenomic and metatranscriptomic inventories of the Amazon River plume, June 2010.</article-title> <source><italic>Microbiome</italic></source> <volume>2</volume>:<issue>17</issue>. <pub-id pub-id-type="doi">10.1186/2049-2618-2-17</pub-id> <pub-id pub-id-type="pmid">24883185</pub-id></citation></ref>
<ref id="B63"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sczyrba</surname> <given-names>A.</given-names></name> <name><surname>Hofmann</surname> <given-names>P.</given-names></name> <name><surname>Belmann</surname> <given-names>P.</given-names></name> <name><surname>Koslicki</surname> <given-names>D.</given-names></name> <name><surname>Janssen</surname> <given-names>S.</given-names></name> <name><surname>Dr&#x00F6;ge</surname> <given-names>J.</given-names></name><etal/></person-group> (<year>2017</year>). <article-title>Critical assessment of metagenome interpretation-a benchmark of metagenomics software.</article-title> <source><italic>Nat. Methods</italic></source> <volume>14</volume> <fpage>1063</fpage>&#x2013;<lpage>1071</lpage>. <pub-id pub-id-type="doi">10.1038/nmeth.4458</pub-id> <pub-id pub-id-type="pmid">28967888</pub-id></citation></ref>
<ref id="B64"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shang</surname> <given-names>J.</given-names></name> <name><surname>Tang</surname> <given-names>X.</given-names></name> <name><surname>Guo</surname> <given-names>R.</given-names></name> <name><surname>Sun</surname> <given-names>Y.</given-names></name></person-group> (<year>2022</year>). <article-title>Accurate identification of bacteriophages from metagenomic data using transformer.</article-title> <source><italic>Brief. Bioinform.</italic></source> <volume>23</volume>:<issue>bbac258</issue>. <pub-id pub-id-type="doi">10.1093/bib/bbac258</pub-id> <pub-id pub-id-type="pmid">35769000</pub-id></citation></ref>
<ref id="B65"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sharma</surname> <given-names>N.</given-names></name> <name><surname>Bhatia</surname> <given-names>S.</given-names></name> <name><surname>Sodhi</surname> <given-names>A.</given-names></name> <name><surname>Batra</surname> <given-names>N.</given-names></name></person-group> (<year>2018</year>). <article-title>Oral microbiome and health.</article-title> <source><italic>AIMS Microbiol.</italic></source> <volume>4</volume> <fpage>42</fpage>&#x2013;<lpage>66</lpage>. <pub-id pub-id-type="doi">10.3934/microbiol.2018.1.42</pub-id> <pub-id pub-id-type="pmid">31294203</pub-id></citation></ref>
<ref id="B66"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tampuu</surname> <given-names>A.</given-names></name> <name><surname>Bzhalava</surname> <given-names>Z.</given-names></name> <name><surname>Dillner</surname> <given-names>J.</given-names></name> <name><surname>Vicente</surname> <given-names>R.</given-names></name></person-group> (<year>2019</year>). <article-title>ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples.</article-title> <source><italic>PLoS One</italic></source> <volume>14</volume>:<issue>e0222271</issue>. <pub-id pub-id-type="doi">10.1371/journal.pone.0222271</pub-id> <pub-id pub-id-type="pmid">31509583</pub-id></citation></ref>
<ref id="B67"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tisza</surname> <given-names>M.</given-names></name> <name><surname>Belford</surname> <given-names>A.</given-names></name> <name><surname>Dom&#x00ED;nguez-Huerta</surname> <given-names>G.</given-names></name> <name><surname>Bolduc</surname> <given-names>B.</given-names></name> <name><surname>Buck</surname> <given-names>C.</given-names></name></person-group> (<year>2020</year>). <article-title>Cenote-taker 2 democratizes virus discovery and sequence annotation.</article-title> <source><italic>Virus Evol.</italic></source> <volume>7</volume>:<issue>veaa100</issue>. <pub-id pub-id-type="doi">10.1093/ve/veaa100</pub-id> <pub-id pub-id-type="pmid">33505708</pub-id></citation></ref>
<ref id="B68"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Turner</surname> <given-names>D.</given-names></name> <name><surname>Kropinski</surname> <given-names>A.</given-names></name> <name><surname>Adriaenssens</surname> <given-names>E. M. A.</given-names></name></person-group> (<year>2021</year>). <article-title>Roadmap for genome-based phage taxonomy.</article-title> <source><italic>Viruses</italic></source> <volume>13</volume>:<issue>506</issue>. <pub-id pub-id-type="doi">10.3390/v13030506</pub-id> <pub-id pub-id-type="pmid">33803862</pub-id></citation></ref>
<ref id="B69"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wommack</surname> <given-names>K.</given-names></name> <name><surname>Bhavsar</surname> <given-names>J.</given-names></name> <name><surname>Polson</surname> <given-names>S.</given-names></name> <name><surname>Chen</surname> <given-names>J.</given-names></name> <name><surname>Dumas</surname> <given-names>M.</given-names></name> <name><surname>Srinivasiah</surname> <given-names>S.</given-names></name><etal/></person-group> (<year>2012</year>). <article-title>VIROME: A standard operating procedure for analysis of viral metagenome sequences.</article-title> <source><italic>Stand. Genomic Sci.</italic></source> <volume>6</volume> <fpage>427</fpage>&#x2013;<lpage>439</lpage>. <pub-id pub-id-type="doi">10.4056/sigs.2945050</pub-id> <pub-id pub-id-type="pmid">23407591</pub-id></citation></ref>
<ref id="B70"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wood</surname> <given-names>D.</given-names></name> <name><surname>Lu</surname> <given-names>J.</given-names></name> <name><surname>Langmead</surname> <given-names>B.</given-names></name></person-group> (<year>2019</year>). <article-title>Improved metagenomic analysis with Kraken 2.</article-title> <source><italic>Genome Biol.</italic></source> <volume>20</volume>:<issue>257</issue>. <pub-id pub-id-type="doi">10.1186/s13059-019-1891-0</pub-id> <pub-id pub-id-type="pmid">31779668</pub-id></citation></ref>
<ref id="B71"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yoo</surname> <given-names>A. B.</given-names></name> <name><surname>Jette</surname> <given-names>M. A.</given-names></name> <name><surname>Grondona</surname> <given-names>M.</given-names></name></person-group> (<year>2003</year>). &#x201C;<article-title>SLURM: Simple linux utility for resource management</article-title>,&#x201D; in <source><italic>Job scheduling strategies for parallel processing</italic></source>, <role>eds</role> <person-group person-group-type="editor"><name><surname>Feitelson</surname> <given-names>D.</given-names></name> <name><surname>Rudolph</surname> <given-names>L.</given-names></name> <name><surname>Schwiegelshohn</surname> <given-names>U.</given-names></name></person-group> (<publisher-loc>Berlin</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>44</fpage>&#x2013;<lpage>60</lpage>. <pub-id pub-id-type="doi">10.1007/10968987_3</pub-id></citation></ref>
<ref id="B72"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yu</surname> <given-names>Z.</given-names></name> <name><surname>Du</surname> <given-names>F.</given-names></name> <name><surname>Ban</surname> <given-names>R.</given-names></name> <name><surname>Zhang</surname> <given-names>Y.</given-names></name></person-group> (<year>2020</year>). <article-title>SimuSCoP: Reliably simulate illumina sequencing data based on position and context dependent profiles.</article-title> <source><italic>BMC Bioinformatics</italic></source> <volume>21</volume>:<issue>331</issue>. <pub-id pub-id-type="doi">10.1186/s12859-020-03665-5</pub-id> <pub-id pub-id-type="pmid">32703148</pub-id></citation></ref>
<ref id="B73"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zeller</surname> <given-names>G.</given-names></name> <name><surname>Tap</surname> <given-names>J.</given-names></name> <name><surname>Voigt</surname> <given-names>A.</given-names></name> <name><surname>Sunagawa</surname> <given-names>S.</given-names></name> <name><surname>Kultima</surname> <given-names>J.</given-names></name> <name><surname>Costea</surname> <given-names>P.</given-names></name><etal/></person-group> (<year>2014</year>). <article-title>Potential of fecal microbiota for early-stage detection of colorectal cancer.</article-title> <source><italic>Mol. Syst. Biol.</italic></source> <volume>10</volume>:<issue>766</issue>. <pub-id pub-id-type="doi">10.15252/msb.20145645</pub-id> <pub-id pub-id-type="pmid">25432777</pub-id></citation></ref>
<ref id="B74"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>G.</given-names></name> <name><surname>Wu</surname> <given-names>G.</given-names></name> <name><surname>Lim</surname> <given-names>E.</given-names></name> <name><surname>Droit</surname> <given-names>L.</given-names></name> <name><surname>Krishnamurthy</surname> <given-names>S.</given-names></name> <name><surname>Barouch</surname> <given-names>D.</given-names></name><etal/></person-group> (<year>2017</year>). <article-title>VirusSeeker, a computational pipeline for virus discovery and virome composition analysis.</article-title> <source><italic>Virology</italic></source> <volume>503</volume> <fpage>21</fpage>&#x2013;<lpage>30</lpage>. <pub-id pub-id-type="doi">10.1016/j.virol.2017.01.005</pub-id> <pub-id pub-id-type="pmid">28110145</pub-id></citation></ref>
<ref id="B75"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>M.</given-names></name> <name><surname>Liu</surname> <given-names>D.</given-names></name> <name><surname>Qu</surname> <given-names>H.</given-names></name></person-group> (<year>2017</year>). <article-title>Systematic review of next-generation sequencing simulators: Computational tools, features and perspectives.</article-title> <source><italic>Brief. Funct. Genomics</italic></source> <volume>16</volume> <fpage>121</fpage>&#x2013;<lpage>128</lpage>. <pub-id pub-id-type="doi">10.1093/bfgp/elw012</pub-id> <pub-id pub-id-type="pmid">27069250</pub-id></citation></ref>
<ref id="B76"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zheng</surname> <given-names>T.</given-names></name> <name><surname>Li</surname> <given-names>J.</given-names></name> <name><surname>Ni</surname> <given-names>Y.</given-names></name> <name><surname>Kang</surname> <given-names>K.</given-names></name> <name><surname>Misiakou</surname> <given-names>M.</given-names></name> <name><surname>Imamovic</surname> <given-names>L.</given-names></name><etal/></person-group> (<year>2019</year>). <article-title>Mining, analyzing, and integrating viral signals from metagenomic data.</article-title> <source><italic>Microbiome</italic></source> <volume>7</volume>:<issue>42</issue>. <pub-id pub-id-type="doi">10.1186/s40168-019-0657-y</pub-id> <pub-id pub-id-type="pmid">30890181</pub-id></citation></ref>
</ref-list>
</back>
</article>