<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Microbiol.</journal-id>
<journal-title>Frontiers in Microbiology</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Microbiol.</abbrev-journal-title>
<issn pub-type="epub">1664-302X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fmicb.2023.1197329</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Microbiology</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Identification of microbial metabolic functional guilds from large genomic datasets</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Reynolds</surname> <given-names>Ryan</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2253957/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Hyun</surname> <given-names>Sangwon</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2334259/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Tully</surname> <given-names>Benjamin</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/43379/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Bien</surname> <given-names>Jacob</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Levine</surname> <given-names>Naomi M.</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/465358/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Marine and Environmental Biology, University of Southern California</institution>, <addr-line>Los Angeles, CA</addr-line>, <country>United States</country></aff>
<aff id="aff2"><sup>2</sup><institution>Department of Data Sciences and Operations, University of Southern California</institution>, <addr-line>Los Angeles, CA</addr-line>, <country>United States</country></aff>
<aff id="aff3"><sup>3</sup><institution>Wrigley Institute for Environmental Studies, University of Southern California</institution>, <addr-line>Los Angeles, CA</addr-line>, <country>United States</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Jackie L. Collier, Stony Brook University, United States</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Jeffrey A. Kimbrel, Lawrence Livermore National Laboratory (DOE), United States; Sixing Huang, German Collection of Microorganisms and Cell Cultures GmbH (DSMZ), Germany</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Naomi M. Levine <email>n.levine&#x00040;usc.edu</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>30</day>
<month>06</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>14</volume>
<elocation-id>1197329</elocation-id>
<history>
<date date-type="received">
<day>30</day>
<month>03</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>26</day>
<month>05</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2023 Reynolds, Hyun, Tully, Bien and Levine.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Reynolds, Hyun, Tully, Bien and Levine</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license> </permissions>
<abstract>
<p>Heterotrophic microbes play an important role in the Earth System as key drivers of major biogeochemical cycles. Specifically, the consumption rate of organic matter is set by the interaction between diverse microbial communities and the chemical and physical environment in which they reside. Modeling these dynamics requires reducing the complexity of microbial communities and linking directly with biogeochemical functions. Microbial metabolic functional guilds provide one approach for reducing microbial complexity and incorporating microbial biogeochemical functions into models. However, we lack a way to identify these guilds. In this study, we present a method for defining metabolic functional guilds from annotated genomes, which are derived from both uncultured and cultured organisms. This method utilizes an Aspect Bernoulli (AB) model and was tested on three large genomic datasets with 1,733&#x02013;3,840 genomes each. Ecologically relevant microbial metabolic functional guilds were identified including guilds related to DMSP degradation, dissimilatory nitrate reduction to ammonia, and motile copiotrophy. This method presents a way to generate hypotheses about functions co-occurring within individual microbes without relying on cultured representatives. Applying the concept of metabolic functional guilds to environmental samples will provide new insight into the role that heterotrophic microbial communities play in setting rates of carbon cycling.</p></abstract>
<kwd-group>
<kwd>modeling</kwd>
<kwd>community assembly</kwd>
<kwd>biogeochemical cycling</kwd>
<kwd>marine microbiology</kwd>
<kwd>microbial metabolisms</kwd>
<kwd>functional guilds</kwd>
</kwd-group>
<contract-sponsor id="cn001">Simons Foundation<named-content content-type="fundref-id">10.13039/100000893</named-content></contract-sponsor>
<counts>
<fig-count count="7"/>
<table-count count="1"/>
<equation-count count="6"/>
<ref-count count="93"/>
<page-count count="15"/>
<word-count count="12821"/>
</counts>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Aquatic Microbiology</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1. Introduction</title>
<p>Microbes are the engines that drive many global processes critical for maintaining Earth as a habitable planet, including the cycling of carbon and nitrogen. In particular, heterotrophic microbes (bacteria and archaea) control the rate at which organic compounds are cycled (Pomeroy, <xref ref-type="bibr" rid="B55">1974</xref>; Fuhrman and Azam, <xref ref-type="bibr" rid="B17">1980</xref>, <xref ref-type="bibr" rid="B18">1982</xref>; Falkowski et al., <xref ref-type="bibr" rid="B15">2008</xref>), which has important implications for atmospheric CO<sub>2</sub> concentrations and thus climate. However, we currently have limited knowledge of what sets the rate of organic matter cycling (Dittmar et al., <xref ref-type="bibr" rid="B12">2021</xref>; Zakem et al., <xref ref-type="bibr" rid="B91">2021</xref>) and how these rates vary as a function of microbial community composition.</p>
<p>Global ecological models, which are used to study large-scale carbon cycling, typically consider the impact of microbial heterotrophy to be a constant or a bulk approximation acting on a generic organic carbon pool (Aumont and Bopp, <xref ref-type="bibr" rid="B2">2006</xref>; S&#x000E9;f&#x000E9;rian et al., <xref ref-type="bibr" rid="B67">2013</xref>). Thus, these models are unable to capture variations in rates of biogeochemical cycling driven by dynamic and diverse microbial communities. This is partially due to the lack of a tractable framework for explicitly modeling complex heterotrophic microbial communities, their biogeochemical function, and how these functions vary both temporally and spatially. Such a framework requires an understanding of organismal-level metabolic potential (i.e., which metabolic pathways co-occur within individual cells) and how microbes are assembled to form communities. While such a framework exists for phytoplankton (Quere et al., <xref ref-type="bibr" rid="B58">2005</xref>; Raitsos et al., <xref ref-type="bibr" rid="B59">2008</xref>), we lack a similar framework for defining meaningful heterotrophic functional types or metabolic functional guilds. Metabolic functional guilds are defined here as groups of organisms that are capable of the same biogeochemical or ecological function (e.g., nitrogen fixation or chitin degradation) in an ecosystem.</p>
<p>Microbial communities have primarily been characterized using the amplification of marker genes (e.g., 16S small subunit RNA gene). Analysis of functional diversity has either relied upon &#x02018;omics analyses (Venter, <xref ref-type="bibr" rid="B83">2004</xref>; Yooseph et al., <xref ref-type="bibr" rid="B90">2007</xref>; Larkin et al., <xref ref-type="bibr" rid="B33">2021</xref>; Ustick et al., <xref ref-type="bibr" rid="B82">2021</xref>) or closest cultured representatives (Staley et al., <xref ref-type="bibr" rid="B71">2014</xref>; Hornick and Buschmann, <xref ref-type="bibr" rid="B21">2018</xref>; Roth Rosenberg et al., <xref ref-type="bibr" rid="B64">2021</xref>). The former provides an account of which genes are present but does not provide insight into which functions are co-occurring within individual organisms. The latter extends phylogenetic analyses to gain insight into function by using genomic data from the closest cultured representative via tools such as PICRUSt or Tax4Fun2 (Langille et al., <xref ref-type="bibr" rid="B32">2013</xref>; Wemheuer et al., <xref ref-type="bibr" rid="B84">2020</xref>). While this provides insights into the metabolic potential of the community, it relies on having a cultured representative where the vast majority of organisms in the ocean do not have such representatives (Sogin et al., <xref ref-type="bibr" rid="B69">2006</xref>; Parks et al., <xref ref-type="bibr" rid="B54">2017</xref>). In addition, the cultured representative approach relies on the assumption that biogeochemically relevant functions are highly phylogenetically conserved, which may not always hold due to high rates of horizontal gene transfer (McDaniel et al., <xref ref-type="bibr" rid="B43">2010</xref>). Several experimental and observational studies have demonstrated that function and phylogeny are often decoupled in a variety of environments (Louca et al., <xref ref-type="bibr" rid="B39">2016</xref>, <xref ref-type="bibr" rid="B38">2017</xref>, <xref ref-type="bibr" rid="B40">2018</xref>; Tully et al., <xref ref-type="bibr" rid="B80">2018a</xref>). Pangenomics has revealed microdiversity within individual species that results in genetically distinct species sub-groups or sub-clades (Delmont and Eren, <xref ref-type="bibr" rid="B11">2018</xref>) further complicating the link between function and phylogeny.</p>
<p>Recent advances in bioinformatic techniques have allowed for the high throughput assembly of organismal genomes from metagenomes, termed metagenome assembled genomes (MAGs) (Strous et al., <xref ref-type="bibr" rid="B74">2012</xref>; Imelfort et al., <xref ref-type="bibr" rid="B23">2014</xref>; MetaHIT Consortium et al., <xref ref-type="bibr" rid="B44">2014</xref>; Kang et al., <xref ref-type="bibr" rid="B26">2015</xref>, <xref ref-type="bibr" rid="B27">2019</xref>; Lu et al., <xref ref-type="bibr" rid="B41">2016</xref>; Wu et al., <xref ref-type="bibr" rid="B87">2016</xref>; Graham et al., <xref ref-type="bibr" rid="B19">2017</xref>). In addition, microfluidics techniques have enabled the sequencing of single cells [single-cell amplified genomes (SAGs)] (Stepanauskas and Sieracki, <xref ref-type="bibr" rid="B73">2007</xref>; Swan et al., <xref ref-type="bibr" rid="B77">2011</xref>, <xref ref-type="bibr" rid="B76">2013</xref>; Martinez-Garcia et al., <xref ref-type="bibr" rid="B42">2012</xref>; Pachiadaki et al., <xref ref-type="bibr" rid="B50">2019</xref>; Sieracki et al., <xref ref-type="bibr" rid="B68">2019</xref>). Combined, these innovations have led to large datasets of publicly available annotated MAGs and SAGs (Klemetsen et al., <xref ref-type="bibr" rid="B29">2018</xref>; Pachiadaki et al., <xref ref-type="bibr" rid="B50">2019</xref>; Paoli et al., <xref ref-type="bibr" rid="B51">2021</xref>), thus significantly increasing our knowledge of microbial diversity. Most notable is the <italic>Tara</italic> Oceans circumnavigation expedition (Sunagawa et al., <xref ref-type="bibr" rid="B75">2015</xref>), which collected metagenomes from a global set of sampling stations that have been subsequently assembled into thousands of MAGs (Lombard et al., <xref ref-type="bibr" rid="B37">2014</xref>; Baker et al., <xref ref-type="bibr" rid="B3">2015</xref>; Graham et al., <xref ref-type="bibr" rid="B20">2018</xref>; Rawlings et al., <xref ref-type="bibr" rid="B61">2018</xref>; Zhang et al., <xref ref-type="bibr" rid="B92">2018</xref>; Zhou et al., <xref ref-type="bibr" rid="B93">2019</xref>). These large, well-annotated datasets provide an unprecedented opportunity to assess co-occurring functions within a cell for uncultured organisms.</p>
<p>In this study, we present a new statistical approach for defining microbial metabolic functional guilds and show that the guilds we identify are specific and ecologically relevant. This approach also establishes a framework that can be used to generate new hypotheses for co-occurring functions. As our approach is agnostic to phylogeny with no <italic>a priori</italic> phylogenetic data provided, this framework provides an excellent tool for interrogating the metabolic potential of uncultured organisms. This study lays the foundation for defining microbial communities in terms of metabolic functional guilds that will allow us to better understand the role that dynamic microbes play in determining the rates of biogeochemical cycles.</p>
</sec>
<sec id="s2">
<title>2. Materials and methods</title>
<sec>
<title>2.1. Dataset</title>
<p>Three different sources of genomes were used for this analysis, MAGs, isolate genomes (i.e., from cultures), and SAGs. Specifically, we used 1,859 MAGs (Tully et al., <xref ref-type="bibr" rid="B81">2018b</xref>) assembled from the <italic>Tara</italic> Oceans metagenomes (Sunagawa et al., <xref ref-type="bibr" rid="B75">2015</xref>) using the BinSanity v0.2.6.1 technique and assembly pipeline (Graham et al., <xref ref-type="bibr" rid="B19">2017</xref>). Only bins that met the following minimum requirements were assigned as draft genomes and included as MAGs: &#x0003E;90% complete and &#x0003C; 10% contamination, 80&#x02013;90% complete with &#x0003C; 5% contamination, or 50&#x02013;80% complete with &#x0003C; 2% contamination. These genomes can be found at NCBI under BioProject ID PRJNA391943. A total of 6,872 SAG genomes were obtained from the GORG-Tropics database (Pachiadaki et al., <xref ref-type="bibr" rid="B50">2019</xref>), which can be found at NCBI under BioProject ID PRJEB33281 and at Open Science Framework under DOI 10.17605/OSF.IO/PCWJ9. Only SAGs with at least 70% completeness were included in our analysis (<italic>N</italic> = 1,733). In addition, 967 isolate genomes and 980 genomes with unresolved provenance (i.e., unclear from the metadata whether MAGs or isolates) were obtained from the MarDB (Klemetsen et al., <xref ref-type="bibr" rid="B29">2018</xref>) (<ext-link ext-link-type="uri" xlink:href="https://mmp.sfb.uit.no/databases/">https://mmp.sfb.uit.no/databases/</ext-link>) (accessed 31 May 2018). A composite genomic dataset was generated using the <italic>Tara</italic> Oceans MAGs, isolates, and MarDB genomes (<italic>N</italic> = 3,840). To compare and contrast the guilds derived from different methods of genome reconstruction, two additional datasets were used. The 1,859 known MAGs from the composite dataset were separated out into a second dataset, and the 1,7333 high-quality SAGs from the GORG-Tropics database were separated out into a third dataset.</p>
<p>Genomes from the composite and SAG datasets were classified using the GTDB taxonomy toolkit (GTDB-Tk) (Chaumeil et al., <xref ref-type="bibr" rid="B8">2022</xref>) using r207 of the Genome Taxonomy Database (Parks et al., <xref ref-type="bibr" rid="B52">2018</xref>). GTDB-Tk v2.1.0 utilized Prodigal v2.6.3 (Hyatt et al., <xref ref-type="bibr" rid="B22">2010</xref>) to predict genes on the 3,840 input genomes provided as FASTA nucleotide sequence files. The set of 120 bacterial and 53 archaeal target marker genes used in GTDB-Tk was identified with HMMER 3 v3.1b2 (Eddy, <xref ref-type="bibr" rid="B13">2011</xref>). Phylogenetic estimation was performed with FastTree2 v2.1.11 (Price et al., <xref ref-type="bibr" rid="B56">2010</xref>), and then FastANI v1.32 (Jain et al., <xref ref-type="bibr" rid="B25">2018</xref>) and Mash v2.3 (Ondov et al., <xref ref-type="bibr" rid="B49">2016</xref>) were used to confirm phylogenetic groups with ANI measures. Quality analysis of the genomes in both datasets was performed using CheckM v1.2.1 (Parks et al., <xref ref-type="bibr" rid="B53">2015</xref>). The average completeness for the composite dataset was 90.8% with an average contamination of 1.5%, and the average completeness for the SAG dataset was 80.6% with an average contamination of 0.15%. Phylogenomic trees were constructed for the full set of genomes using GToTree v1.7.05 (Lee, <xref ref-type="bibr" rid="B35">2019</xref>), as well as for the guilds shown in <xref ref-type="supplementary-material" rid="SM1">Supplementary Table 2</xref> using the taxonomic classifications from GTDB-Tk to annotate each tree. Similar to GTDB-Tk, GToTree utilized Prodigal v.2.6.3 (Hyatt et al., <xref ref-type="bibr" rid="B22">2010</xref>) to predict functional genes for the 3,840 input genomes provided as FASTA sequence files. Target genes from the pre-built Archaea_and_Bacteria gene set (25 genes) were identified with HMMER 3 v3.3.2 (Eddy, <xref ref-type="bibr" rid="B13">2011</xref>), aligned with muscle v5.1 (Edgar, <xref ref-type="bibr" rid="B14">2021</xref>), trimmed with TrimAl v1.4 (Capella-Gutierrez et al., <xref ref-type="bibr" rid="B7">2009</xref>), and concatenated before phylogenetic estimation was performed using FastTree 2 v2.1.11 (Price et al., <xref ref-type="bibr" rid="B56">2010</xref>).</p>
<p>To further assess the phylogenetic diversity of the composite dataset, we also computed the average nucleotide identity (ANI) and average amino acid identity (AAI). ANI values were computed on the whole genomes using fastANI v1.33 (Jain et al., <xref ref-type="bibr" rid="B25">2018</xref>) while AAI values were computed using fastAAI v0.1.20 (<ext-link ext-link-type="uri" xlink:href="https://github.com/cruizperez/FastAAI">https://github.com/cruizperez/FastAAI</ext-link>). fastAAI also used Pyrodigal (Larralde, <xref ref-type="bibr" rid="B34">2022</xref>), a Python library binding to Prodigal (Hyatt et al., <xref ref-type="bibr" rid="B22">2010</xref>), to predict genes, as well as PyHMMER (Larralde, <xref ref-type="bibr" rid="B34">2022</xref>) to perform the alignments to fastAAI&#x00027;s single-copy protein (SCP) datasets. A full breakdown of this pipeline is presented in <xref ref-type="supplementary-material" rid="SM1">Supplementary material S1</xref>.</p>
<p>We selected 212 experimentally verified and well-characterized metabolic pathways from the KEGG database (Ogata et al., <xref ref-type="bibr" rid="B47">1999</xref>) (<xref ref-type="supplementary-material" rid="SM1">Supplementary Table 1</xref>). These functions were chosen due to their biogeochemical (e.g., nitrogen fixation and methanogenesis) and ecological (e.g., motility and chemotaxis) relevance. All genomes were then analyzed using KEGG-Decoder v0.6sbp and KEGG-Expander v0.5 (Graham et al., <xref ref-type="bibr" rid="B20">2018</xref>) to identify the presence or absence of the 212 pathways. KEGG-Decoder is informed by KEGG pathways/modules; however, specific steps and key biogeochemical reactions are broken down to reflect essential steps. Specifically, several different criteria or thresholds were used in order to determine whether pathways were present in a given genome. KEGG-Decoder first assumes that core metabolisms must be present for normal cellular functioning for most organisms, and thus it is unlikely to find a fragmentary pathway that is non-functional. Thus for core metabolisms (e.g., glycolysis, gluconeogenesis, ATP synthase, etc.), a low threshold of 25% total gene presence was used. Conversely, KEGG-Decoder assumes that the same is not true for complex/geochemically relevant pathways, thus a higher threshold is implemented to ensure that it is tracking actual functionality rather than misannotation. Thus, for pathways that were either complex (e.g., multiple branching options), geochemically relevant (e.g., thiosulfate oxidation), or both (e.g., secretion pathways), a total gene presence between 50 and 75% was required. An intermediate threshold of 33&#x02013;40% total gene presence was used for simple pathways constituting 3 to 4 genes. For &#x0201C;pathways&#x0201D; that possess only a single reaction, presence/absence was directly determined.</p>
<p>This large binary dataset was used as input for metabolic guild identification both using classical methods and our new Aspect Bernoulli (AB)-based method (<italic>see below</italic>). It is important to note that the AB method presented here is not restricted to this number of functions and can be extended to include as many functions or hypothetical proteins as the user desires. Furthermore, genome annotations can be performed in any manner the user desires so long as the resulting data matrix is binary. However, we emphasize that the choice of annotations is paramount in determining the types of metabolic signals the user can receive when running this method. This is a discovery-based dimension reduction method and as such can only directly identify patterns based on the data presented to it.</p>
</sec>
<sec>
<title>2.2. Classic methods</title>
<p>We tested several clustering and dimensionality reduction methods to attempt to identify microbial metabolic guilds including Non-metric Multidimensional Scaling (NMDS) (Kruskal, <xref ref-type="bibr" rid="B30">1964</xref>) of the functions and complete linkage hierarchical clustering of both the genomes and functions concurrently. NMDS was performed using the <italic>metaMDS</italic> function from the vegan package v2.6.4 (Oksanen et al., <xref ref-type="bibr" rid="B48">2019</xref>) in R v4.2.3 with two dimensions, Bray-Curtis dissimilarity (Bray and Curtis, <xref ref-type="bibr" rid="B6">1957</xref>) and a maximum of 50 iterations. We also analyzed our composite dataset using an agglomerative hierarchical clustering method using the <italic>clustergram</italic> function from the Statistics and Machine Learning toolbox v12.1 from MATLAB R2021a (The Math Works, <xref ref-type="bibr" rid="B78">2021</xref>). We applied these two statistical methods to our composite dataset of 3,840 genomes and assessed their ability to extract a low-dimensional structure of co-occurring functions in the form of guilds.</p>
<p>Finally, we sought a method that could reduce our data to a lower number of dimensions with defined and clear separation into clusters of functions that represent metabolic guilds. Therefore, it was essential that our method could identify signals of metabolic guilds driven by relatively rare functions even in the presence of high abundance functions such as core carbon metabolism or housekeeping genes. This aspect was important because we expected many of these core metabolisms to strongly co-occur due to their essential nature and thus could potentially limit our ability to define more biogeochemically relevant metabolic functional guilds. We found that an augmented AB model was able to best accommodate all of these requirements. We present this model and the underlying statistical method that defines this approach in the following section.</p>
</sec>
<sec>
<title>2.3. Aspect bernoulli</title>
<p>We used the AB model (Bingham et al., <xref ref-type="bibr" rid="B4">2009</xref>) to perform a statistical matrix decomposition of our binary data matrix <italic>Y</italic>&#x02208;<italic>R</italic><sup><italic>G</italic> &#x000D7; <italic>F</italic></sup>. The AB model was selected as it is designed for sparse matrices of binary data. AB is similar to Latent Dirichlet Allocation (LDA) that has been applied to similar problems [e.g., topic modeling, population structure (Pritchard et al., <xref ref-type="bibr" rid="B57">2000</xref>; Blei, <xref ref-type="bibr" rid="B5">2003</xref>)] but is not designed to handle binary data. The AB model assumes that each entry <italic>Y</italic><sub><italic>g, f</italic></sub> in the data matrix <italic>Y</italic> is a random Bernoulli realization of an underlying scalar probability <italic>V</italic><sub><italic>g, f</italic></sub>&#x02208;[0, 1]. Here, <italic>g</italic> denotes genome, and <italic>f</italic> denotes function. In other words, the AB method assumes that the observed pattern in the data is the result of a Bernoulli coin flip based on the probability of a specific function occuring in a specific genome. Thus, we can define another matrix {<sub><italic>V</italic><sub><italic>gf</italic></sub>}<italic>g</italic> &#x0003D; 1, &#x02026;, <italic>G, f</italic> &#x0003D; 1, &#x02026;, <italic>F</italic></sub> with the same dimensions as the data matrix that represents these underlying probabilities.</p>
<p>We then assume that this matrix of probabilities {<sub><italic>V</italic><sub><italic>gf</italic></sub>}<italic>g</italic> &#x0003D; 1, &#x02026;, <italic>G, f</italic> &#x0003D; 1, &#x02026;, <italic>F</italic></sub> can be defined as the product of two additional matrices &#x003B2; and &#x00393; such that</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mtext>&#x00393;</mml:mtext></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mo>&#x000B7;</mml:mo></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>for each probability <italic>V</italic><sub><italic>gf</italic></sub> in the matrix. The &#x003B2; and &#x00393; matrices are of size G by <italic>k</italic> and <italic>k</italic> by F, respectively, where G is the total number of genomes in the data set and F is the total number of functions. These two matrices allow us to identify <italic>k</italic> groups or aspects in our dataset (see <xref ref-type="boxed-text" rid="Box1">Box 1</xref> for definition). Aspects are distinct from guilds in that they are defined on the entire set of functions, rather than a co-occurring subset of functions (guilds). The term aspect is used to describe the direct output of the AB method. As we describe below, we can then define metabolic functional guilds based on the &#x003B2; matrix, which provides the probability that function <italic>f</italic> is present in a given genome if that genome is associated with the <italic>k</italic><sup><italic>th</italic></sup> aspect. Particularly, if &#x003B2;<sub><italic>kf</italic></sub> is close to 1 then function <italic>f</italic> is highly associated with aspect <italic>k</italic>. The &#x00393; matrix quantifies how strong the <italic>k</italic><sup><italic>th</italic></sup> aspect is, within each genome <italic>g</italic>. Specifically, if &#x00393;<sub><italic>gk</italic></sub> is close to 1, then genome g is strongly associated with aspect <italic>k</italic>. and &#x00393; are then optimized using an iterative Expectation Maximization (EM) algorithm as described in Bingham et al. (<xref ref-type="bibr" rid="B4">2009</xref>). For a detailed, rigorous description of the methods, please see <xref ref-type="supplementary-material" rid="SM1">Supplementary material 1.2</xref>.</p>
<boxed-text id="Box1">
<label>Box 1</label>
<title>Terminology Box.</title>
<p><inline-graphic xlink:href="fmicb-14-1197329-i0001.tif"/></p>
</boxed-text>
<p>One key advantage of the AB method is that the use of the matrix of probabilities {<sub><italic>V</italic><sub><italic>gf</italic></sub>}<italic>g</italic> &#x0003D; 1, &#x02026;, <italic>G, f</italic> &#x0003D; 1, &#x02026;, <italic>F</italic></sub> allows the method to deal with inaccuracies in the data (e.g., false absences or presences) as detailed in the study by Bingham et al. (<xref ref-type="bibr" rid="B4">2009</xref>). Specifically, the AB method can accommodate instances where the presence (absence) of a function in the genome is otherwise inconsistent with the main aspects associated with it.</p>
</sec>
<sec>
<title>2.4. Scoring</title>
<p>In order to define metabolic functional guilds (see <xref ref-type="boxed-text" rid="Box1">Box 1</xref> for definition) from the AB model output, we needed a way to quantify the relative importance of functions within an aspect. To this end, we introduced a post-processing score to order the functions within each aspect such that two conditions were met: (1) functions that were strong indicators of membership in that aspect were highly scored (i.e., if that function was present in a genome, then it was likely that the aspect <italic>k</italic> was present); (2) genomes that were identified as being associated with the aspect <italic>k</italic> were likely to contain functions at the top of aspect <italic>k</italic>&#x00027;s list (i.e., if genome <italic>g</italic> was associated with the aspect <italic>k</italic>, it was likely to have function A which was at the top of aspect <italic>k</italic>&#x00027;s list). The functions that combined to define a metabolic functional guild could then be identified based on high-ranking functions in the aspect lists.</p>
<p>To meet the first condition, we posed the following question: having observed a function <italic>f</italic> to be present in a randomly chosen genome <italic>g</italic>, how likely was it that the function was present due to aspect <italic>k</italic>? We could quantify this likelihood by calculating</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M2"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>f</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mtext>&#x000A0;</mml:mtext><mml:mo>|</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mrow><mml:mi>Y</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Using Bayes&#x00027; rule, we computed the above conditional probability in terms of the AB parameters:</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M3"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>f</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>Y</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x00393;</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mtext>&#x000A0;</mml:mtext><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Next, we identified the genomes that were most strongly associated with each aspect (i.e., having large &#x00393; values). We will hereafter refer to this set of genomes <italic>A</italic><sub><italic>k</italic></sub>&#x02286;{1, &#x02026;, <italic>G</italic>} as aspect <italic>k</italic>&#x00027;s &#x0201C;probabilistic representatives.&#x0201D; We filtered {1, &#x022EF;&#x02009;, <italic>G</italic>} into <italic>K</italic> non-overlapping sets <italic>A</italic><sub>1</sub>, &#x022EF;&#x02009;, <italic>A</italic><sub><italic>K</italic></sub>, each set <italic>A</italic><sub><italic>k</italic></sub> was defined as the genome <italic>g</italic> that placed the highest value of &#x00393;<sub><italic>g</italic></sub> on <italic>k</italic> and also had a large enough &#x00393;<sub><italic>g, k</italic></sub> &#x0003D; <italic>P</italic>(<italic>Z</italic><sub><italic>gfk</italic></sub> &#x0003D; 1) (specifically, &#x00393;<sub><italic>g, k</italic></sub>&#x0003E;2/<italic>K</italic>). This 2/<italic>K</italic> threshold ensured that we excluded genomes that had nearly uniform &#x00393; vectors. For our composite dataset, this threshold did not exclude any genomes.</p>
<p>From <italic>A</italic><sub><italic>k</italic></sub>, we calculated <italic>q</italic><sub><italic>fk</italic></sub>:</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M4"><mml:mrow><mml:mtable columnalign='left'><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:msub><mml:mi>q</mml:mi><mml:mrow><mml:mi>f</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mstyle displaystyle='true'><mml:msub><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>g</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mi>A</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mrow><mml:msub><mml:mi>Y</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mstyle></mml:mrow><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mi>F</mml:mi></mml:mfrac><mml:mstyle displaystyle='true'><mml:msubsup><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>f</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>F</mml:mi></mml:msubsup><mml:mrow><mml:mstyle displaystyle='true'><mml:msub><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>g</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mi>A</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mrow><mml:msub><mml:mi>Y</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle></mml:mrow></mml:mfrac></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:math></disp-formula>
<p>which is the ratio of the abundance of each function within <italic>A</italic><sub><italic>k</italic></sub> and the mean abundance within <italic>A</italic><sub><italic>k</italic></sub>. Finally, we multiplied the marginal probability <italic>r</italic><sub><italic>fk</italic></sub> (Equation 1) by the adjustment factor <italic>q</italic><sub><italic>fk</italic></sub> (Equation 4). This gave us the score metric <italic>s</italic><sub><italic>fk</italic></sub> that we used to identify our guilds:</p>
<disp-formula id="E5"><label>(5)</label><mml:math id="M5"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>In this score, <italic>q</italic><sub><italic>fk</italic></sub> upweights functions <italic>f</italic> that are more abundant among probabilistic representatives of aspect <italic>k</italic> than average (<xref ref-type="fig" rid="F1">Figure 1</xref>) and makes the score (Equation 5) more comparable across aspects. Since a function that is highly specific to aspect <italic>k</italic> is highly scored, top-scoring functions are attractive candidates for forming metabolic function guilds from aspects. Next, we describe how to choose a small set of functions to form such guilds. The full algorithm for the AB procedure can be found in the extended methods (<xref ref-type="supplementary-material" rid="SM1">Supplementary material S1</xref>).</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Abundances of functions within an example aspect&#x00027;s probabilistic representatives, A<sub>k</sub>, compared to their score rank before (<italic>r</italic><sub><italic>fk</italic></sub>, cyan) and after (<italic>s</italic><sub><italic>fk</italic></sub>, orange) applying the score adjustment <italic>q</italic><sub><italic>fk</italic></sub> (step 2). After the adjustment, a large density of points in the upper left quadrant is observed indicating that the highest rank functions using <italic>s</italic><sub><italic>fk</italic></sub> are also found within a large number of probabilistic representative genomes.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmicb-14-1197329-g0001.tif"/>
</fig>
</sec>
<sec>
<title>2.5. Guild identification and mapback genomes</title>
<p>After identifying the probabilistic representatives <italic>A</italic><sub><italic>k</italic></sub> based on our pipeline, we further narrowed each aspect down to metabolic functional guilds <italic>F</italic><sub><italic>k</italic></sub> according to the scores <italic>s</italic><sub><italic>fk</italic></sub>. Then, we obtained the mapback genomes <italic>B</italic><sub><italic>k</italic></sub> (see <xref ref-type="boxed-text" rid="Box1">Box 1</xref>) for guild <italic>F</italic><sub><italic>k</italic></sub> as the set of genomes possessing all of the functions in <italic>F</italic><sub><italic>k</italic></sub>. We used two alternative approaches to identify the set of functions that comprise metabolic functional guilds: (1) using a fixed number of functions, five functions in this case (Option 1 in <xref ref-type="supplementary-material" rid="SM1">Supplementary material S1</xref>) or (2) requiring a minimum number of genomes in the dataset to be associated with a given guild (Option 2 in <xref ref-type="supplementary-material" rid="SM1">Supplementary material S1</xref>). The number of mapback genomes is an important criterion in our pipeline, as it quantifies how strongly the original data support the proposed metabolic functional guilds. For instance, if we found many mapback genomes for a fixed-size functional guild, we would be more confident in the validity of that guild.</p>
</sec>
<sec>
<title>2.6. Guild specificity</title>
<p>A key objective of the pipeline was to identify functions co-occurring within individual genomes that were meaningfully associated. Ideally, for a guild <italic>k</italic> containing functions A and B, the presence of function A in a genome would indicate both that the genome was a member of guild <italic>k</italic> and that the genome would also contain function B. To test the association between pairs of functions within our guilds, we calculated the confidence (Agrawal et al., <xref ref-type="bibr" rid="B1">1993</xref>) of seeing B given A (<italic>A</italic>&#x02192;<italic>B</italic>) as</p>
<disp-formula id="E6"><label>(6)</label><mml:math id="M6"><mml:mrow><mml:mtable columnalign='left'><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>f</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>A</mml:mi><mml:mo>,</mml:mo><mml:mi>B</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mstyle displaystyle='true'><mml:msubsup><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>g</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>G</mml:mi></mml:msubsup><mml:mrow><mml:msub><mml:mi>Y</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>A</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>Y</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>B</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mstyle></mml:mrow><mml:mrow><mml:mstyle displaystyle='true'><mml:msubsup><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>g</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>G</mml:mi></mml:msubsup><mml:mrow><mml:msub><mml:mi>Y</mml:mi><mml:mrow><mml:mi>g</mml:mi><mml:mi>A</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mstyle></mml:mrow></mml:mfrac></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:math></disp-formula>
<p>where A and B are functions from our dataset and <italic>Y</italic><sub><italic>gA</italic></sub> and <italic>Y</italic><sub><italic>gB</italic></sub> are the presence or absence of A and B in genome <italic>g</italic>. High confidence values suggested that the presence of function B was highly conserved with that of function A. We computed the forward and reverse confidence values for every pair of functions in the guilds identified from our data. Because of the way we defined mapback genomes, these confidence values were all 1 within our mapback genomes and ranged between 0 and 1 for our &#x02018;outgroup&#x00027; genomes (i.e., the rest of the dataset).</p>
</sec>
<sec>
<title>2.7. Artificial datasets</title>
<p>The number of aspects, <italic>K</italic>, is a free parameter in the AB model that determines the maximum number of guilds that can be identified. The ideal choice of <italic>K</italic> is dataset specific and is a function of the underlying structure of the data matrix. To test the impact of this choice on the resulting guilds identified by our method, we constructed a large collection of synthetic datasets comprised of either one or three artificial guilds appended to our original composite dataset of 3,840 genomes and 212 functions. These guilds were defined to be &#x0201C;perfect&#x0201D; guilds when genomes either had all the artificial guild functions or none of them. For example, an artificial guild with 5 functions and 2% total abundance in the dataset would have all 5 functions perfectly co-occurring in 77 genomes, while the remaining 3,763 genomes would not possess any of these artificial functions (all zeros). Guild parameters were drawn from three possible abundances (2%, 5%, or 10% of the genomes containing the artificial guild) and three possible sizes (guilds consisting of 5, 7, or 9 functions) with all unique combinations tested (<xref ref-type="supplementary-material" rid="SM1">Supplementary Table 2</xref>). Each artificial guild was inserted in a non-overlapping manner such that each genome could only belong to a maximum of one artificial guild. For each combination, we created 100 replicates of our synthetic data. Additional sensitivity analyses were conducted where we assigned guilds randomly, allowing some genomes to belong to multiple artificial guilds (<xref ref-type="supplementary-material" rid="SM1">Supplementary material S2</xref>).</p>
</sec>
<sec>
<title>2.8. Data visualization</title>
<p>All data visualizations in MATLAB were performed using the Statistics and Machine Learning Toolbox v12.1 from MATLAB R2021a (The Math Works, <xref ref-type="bibr" rid="B78">2021</xref>). Data visualizations in R v4.2.3 were performed using the ggplot2 v3.4.2 and ggbreak v0.1.1 packages (Wickham, <xref ref-type="bibr" rid="B86">2009</xref>; Xu et al., <xref ref-type="bibr" rid="B88">2021</xref>), as well as the lattice v0.21.8 package (Sarkar, <xref ref-type="bibr" rid="B66">2008</xref>).</p>
</sec>
</sec>
<sec id="s3">
<title>3. Results</title>
<sec>
<title>3.1. Phylogeny of datasets</title>
<p>The phylogeny of our composite dataset of 3,840 genomes was assessed using GtoTree and GTDB-Tk. From this large dataset, 65 genomes (60 archaeal and 5 bacterial) were excluded due to insufficient marker gene coverage. Another 39 genomes that were included in the tree were flagged during the quality assessment step for high redundancy estimates (an average of 16.7% redundancy) but were still highly complete (an average of 95.7% completeness). Of the 3,775 high-quality genomes, there were 3,529 bacterial genomes representing 51 unique bacterial phyla. Among these phyla were the key marine superphylum Proteobacteria (Yarza et al., <xref ref-type="bibr" rid="B89">2014</xref>) with 1,774 genomic representatives, as well as other notable phyla such as the Cyanobacteria (108 genomes), Bacteroidota (545 genomes), Firmicutes (111 genomes), Desulfobacterota (55 genomes), and the Verrucomicrobiota (91 genomes). In addition, there were 246 archaeal genomes representing 2 unique archaeal phyla, Thermoplasmatota and Thermoproteota. <xref ref-type="supplementary-material" rid="SM1">Supplementary Figure 1</xref> shows the full phylogenomic tree visualized in the iTOL web application (Letunic and Bork, <xref ref-type="bibr" rid="B36">2021</xref>), which is colored by individual bacterial phylum identity.</p>
<p>We passed our high-quality SAG dataset of 1,733 genomes through GtoTree and GTDB-Tk and determined the phylogeny for 1,415 genomes (<xref ref-type="supplementary-material" rid="SM1">Supplementary Figure 2</xref>). In total, 318 genomes (301 bacterial and 17 archaeal) were excluded for insufficient marker gene coverage while three of the included genomes were flagged during the quality assessment step for high redundancy estimates (an average of 14% redundancy). Of the 1,415 high-quality genomes, there were 1,409 bacterial genomes representing 9 unique bacterial phyla and 6 archaeal genomes representing 2 unique archaeal phyla. Like the composite dataset, many of the bacterial genomes were classified in the phylum Proteobacteria (1,158 genomes). The next two largest phyla were Bacteroidota (103) and Cyanobacteria (83). Collectively, these three phyla accounted for 95.4% of all SAGs with an ascribed bacterial phylogeny.</p>
</sec>
<sec>
<title>3.2. Classic methods</title>
<p>We applied two classic statistical methods (NMDS and <italic>clustergram</italic>) to our dataset and assessed their ability to extract the low-dimensional structure of co-occurring functions in the form of guilds. The results of the NMDS are shown in <xref ref-type="fig" rid="F2">Figure 2</xref> where each point in the NMDS represents a function such that clusters of points could, potentially, indicate guilds. No distinct features emerge along either axis. The majority of data points group into a dense cloud of points with no clear separation along an axis of variance. While approaches for analyzing variance in reduced dimensions, such as NMDS, can be powerful for identifying clusters of similarly acting samples, NMDS was unable to identify clusters that could be interpreted as metabolic guilds when applied to our dataset.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Results of the NMDS run on the composite dataset. Points plotted are the loadings of the functions in the dataset on MDS axes 1 and 2. Points are semi-transparent to emphasize points that overlap one another. The NMDS algorithm did not reach convergence with a minimum stress value of 0.211.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmicb-14-1197329-g0002.tif"/>
</fig>
<p>Next, we present results using a standard clustering approach, namely hierarchical clustering, as implemented by <italic>clustergram</italic>. Here, we clustered both the genomes and functions (rows and columns) using the Jaccard distance metric with complete linkage and two different cut heights, 0.9 and 1 (<xref ref-type="fig" rid="F3">Figure 3</xref>). We selected the Jaccard distance for <italic>clustergram</italic> because of the binary format of our data. However, unlike the AB method, Jaccard treats all presences/absences equally and thus does not provide differential weights for rare vs. highly abundant functions. We chose to use cut heights of 0.9 and 1 based on the resulting dendrograms as they produced clusters among both rare and high abundance functions. At lower cut heights, we found that a large bulk of the functions clustered out as singletons, and the clusters that formed were primarily the core, high abundance functions. Thus, we considered that 0.9 and 1 were good values for comparing the microbial metabolic functional guilds identified by <italic>clustergram</italic> and AB.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Resulting clustergram plot on the presence/absence pathway data for our composite dataset (red = present, black = absent) using a cut height of 0.9 with rows (genomes) and columns (functions) clustered based on Jaccard distance.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmicb-14-1197329-g0003.tif"/>
</fig>
<p>Applying <italic>clustergram</italic> to our data with a cut height of 0.9 yielded 30 distinct clusters of functions that we interpreted as potential metabolic guilds (<xref ref-type="fig" rid="F3">Figure 3</xref>). These clusters averaged 5.8 functions (ranging from 2 to 42 functions) and 38.8 mapback genomes (ranging from 3 to 354 genomes). Approximately 20% of the total functions (<italic>N</italic> = 42) were in a single guild of highly abundant core functions. We also tested <italic>clustergram</italic> with a cut height of 1 that produced 17 distinct clusters of functions. The average number of functions in a cluster increased to an average value of 11.1 (ranging from 2 to 66 functions) but the number of mapback genomes dropped sharply to an average of just 3.2 mapback genomes (ranging from 0 to 17 genomes) per guild. Seven of these guilds had no mapback genomes, and the two largest guilds alone accounted for 46.7% of the total data used for this clustering procedure.</p>
<p>We identified several disadvantages of the classic statistical methods. First, large numbers of core metabolisms found in many genomes (such as housekeeping genes, core carbon metabolism, etc.) formed huge guilds with few mapback genomes, which were therefore not informative as metabolic guilds (see <xref ref-type="fig" rid="F3">Figure 3</xref>). Second, these methods do not permit functions to be part of more than one guild, which is inconsistent with the high functional redundancy that has been demonstrated in microbial communities (Louca et al., <xref ref-type="bibr" rid="B39">2016</xref>, <xref ref-type="bibr" rid="B38">2017</xref>, <xref ref-type="bibr" rid="B40">2018</xref>; Tully et al., <xref ref-type="bibr" rid="B80">2018a</xref>). Finally, these methods do not provide an intrinsic ranking of the importance of each function for defining a guild, e.g., the functions that are strong indicators of membership in the guild. In the following section, we will compare the guilds from <italic>clustergram</italic> to that of the AB model and demonstrate that both methods identify similar guilds but that <italic>clustergram</italic> both breaks the AB guilds up into smaller groups (fewer functions) and results in guilds with fewer mapback genomes. Thus, the AB method can better capture metabolic functional guilds that contain a meaningful number of functions (&#x0003E;3) with substantial numbers of mapback genomes.</p>
</sec>
<sec>
<title>3.3. AB model</title>
<p>In the following sections, we present an assessment of the robustness of the AB model for detecting guilds, a summary of the AB model guilds from the composite dataset, and then a comparison between the AB model and the classic methods.</p>
<sec>
<title>3.3.1. Choosing a value for K</title>
<p>The AB model requires the user to define <italic>K</italic> prior to running the algorithm. To test the impact of the choice of <italic>K</italic> on the ability to detect different-sized guilds (i.e., numbers of functions) and guilds with different abundances in the dataset (i.e., frequency), we ran the artificial datasets through the method with a wide range of <italic>K</italic> values (<italic>K</italic> &#x0003D; 5, &#x022EF;&#x02009;, 20). This analysis (described in <xref ref-type="supplementary-material" rid="SM1">Supplementary material S2</xref> and summarized below) identified a clear trade-off between using low <italic>K</italic> values, which inhibited the detection of low abundance guilds, and using high <italic>K</italic> values, which overfitted the dataset. The values that qualify as &#x0201C;low&#x0201D; vs. &#x0201C;high&#x0201D; <italic>K</italic> values will be specific to the dataset. The analysis described below allows the user to identify a range of reasonable <italic>K</italic> values for a given dataset and the type of guilds (e.g., abundance and size) that are being targeted in the analysis. For this study, we manually assessed guilds derived from <italic>K</italic> values within the identified range in order to select our final value of <italic>K</italic> (<italic>K</italic> &#x0003D; 10). We recommend that a similar analysis be performed prior to applying this method to a new dataset.</p>
<p>We quantified the ability of our method to identify artificial guilds in our artificial datasets (see Section 2) over a range of <italic>K</italic> values using two metrics: hit rate and extra hits. The hit rate describes the overall frequency with which we identified our artificial guilds. In the ideal case, we would observe all of an artificial guild&#x00027;s functions present at the top of the score-ordered function list (top 15) in exactly one aspect. Thus, for a simulation using three distinct artificial guilds, we would expect to see three hits per simulated dataset (i.e., each guild showing up at the top of only one aspect list), which would give us a 100% hit rate, or a hit rate frequency of 1. Extra hits catalog instances where we observed an artificial guild occurring at the top of more than one aspect list, i.e., an artificial guild being divided across two aspects.</p>
<p>The size of the guild and abundance of the guild in the dataset impacted the ability of the method to identify artificial guilds at different <italic>K</italic> values (<xref ref-type="fig" rid="F4">Figure 4</xref>). As guild size and abundance in the dataset increased, the hit rate at low <italic>K</italic> values increased to 1. In other words, it was easier to identify larger and more abundant guilds, as one might expect. When <italic>K</italic> was low, extra hits were zero. As we increased the value of <italic>K</italic>, the hit rate remained high, but we started to see extra hits. When guilds were large and/or abundant, extra hits increased more quickly and at lower values of <italic>K</italic> than for smaller and less abundant guilds. This analysis demonstrated that when the choice of <italic>K</italic> was too small, only the largest and most abundant guilds were identified (under-fitting system). On the other hand, if <italic>K</italic> was too large, guilds showed up in multiple aspects (over-fitting system). We concluded that a good range for <italic>K</italic> was around the point where the hit rate was maximized while extra hits remained zero. A full analysis of the impact of guild size, guild abundance, and <italic>K</italic> value on guild identification, as well as the impact of randomly inserting guilds and the number of artificial guilds inserted, is presented in <xref ref-type="supplementary-material" rid="SM1">Supplementary material S2</xref>.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Hit rate and the number of extra hits for 100 simulated datasets with three artificial guilds inserted in a non-overlapping manner across a range of K values. Results are colored by the guild parameters where &#x00023;fn denotes the number of functions in each artificial guild. The red (&#x00023;fn = 5/Abundance = 0.02) vs. the green (&#x00023;fn = 5/Abundance = 0.1) lines illustrate the impact of a change in guild abundance. The impact of guild size on hit rate and extra hits is shown in <xref ref-type="supplementary-material" rid="SM1">Supplementary Table 2</xref>.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmicb-14-1197329-g0004.tif"/>
</fig>
<p>We also tested various numbers of iterations for the expectation-maximization (EM) algorithm implemented as detailed by Bingham et al. (<xref ref-type="bibr" rid="B4">2009</xref>) to determine how quickly the model converged to a local maximum. For each iteration value (ranging from 10 to 1,500 steps), we initialized and ran 10 random restarts. For our chosen value of <italic>K</italic> &#x0003D; 10, the likelihood appeared to plateau at its maximum value after &#x0007E;500 iterations (<xref ref-type="supplementary-material" rid="SM1">Supplementary Figure 12</xref>). We also assessed the stability of the AB results and showed that the identification of guilds was consistent across runs initialized with different random seeds (<xref ref-type="supplementary-material" rid="SM1">Supplementary Figure 13</xref>).</p>
</sec>
<sec>
<title>3.3.2. Guild identification in the composite dataset</title>
<p>The AB method successfully identified guilds within the composite dataset that were found in a substantial number of genomes in the dataset and contained functions that were specific to that guild (see Section 2). When defined using the top 5 scoring functions (approach 1), the resulting guilds averaged 116.2 mapback genomes (ranging from 11 to 468 genomes). When guilds were defined to include functions co-occurring within at least 100 genomes (approach 2), the average guild size was 5.7 functions per guild (ranging from 2 to 20 functions). <xref ref-type="fig" rid="F5">Figure 5</xref> shows the number of mapback genomes present in the dataset as the number of functions defining each guild is increased from 2 to 20.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>The number of genomes that possess all of the functions in a guild (mapback genomes) as guild size is expanded to include more functions in decreasing score order (starting at size 2).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmicb-14-1197329-g0005.tif"/>
</fig>
<p>Both approaches for defining guilds resulted in guilds comprised of functions that were specific to that guild. When looking at the co-occurrence of each pair of functions from the guild set of functions (guild function pairs), low confidence values were observed in the outgroup genomes for each guild function pair as compared to the value of 1 for the guild function pairs in the mapback genomes (by definition). Guilds identified using approach 1 (top 5 scoring functions) had a 0.455 average confidence value in the outgroup genomes. However, many pairs of functions were substantially less conserved in the outgroup genomes (i.e., these pairs were strongly indicative of membership in the guild). For this, we looked at the minimum outgroup confidence value across all pairs of functions in each guild (i.e., the two functions that most strongly indicated membership in the guild). For approach 1, the average across all 10 guilds (<italic>K</italic> &#x0003D; 10) of the minimum confidence values was 0.09 (ranging from 0.029 to 0.132). In other words, functions <italic>A</italic> and <italic>B</italic> in guild <italic>k</italic> were found together only &#x0007E;10% of the time in the non-mapback genomes and 100% of the time in the mapback genomes. Guilds defined using approach 2 (&#x0007E;100 mapback genomes) had a 0.338 average confidence value in the outgroup genomes and a 0.029 (ranging from 0 to 0.105) average minimum confidence value. <xref ref-type="fig" rid="F6">Figure 6</xref> shows an example heatmap of both the forward and reverse confidence values for a putative DMSP guild. Low confidence values for the outgroup genomes confirm that this method identified functional co-occurrences that are specific only to a subset of genomes.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Specificity of guild function pairs for a guild related to the degradation of DMSP. Values are shown for the confidence of the guild function pairs in the outgroup genomes such that low values indicate high specificity of the guild function pairs for the DMSP guild. Note that the colorbar is scaled from 0 to 0.8. The diagonal is omitted since it is 1 by definition. The axes are non-symmetric because DmdA &#x02192; ddd&#x0002A; is fundamentally different from ddd&#x0002A; &#x02192; DmdA (see Equation 6).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmicb-14-1197329-g0006.tif"/>
</fig>
</sec>
</sec>
<sec>
<title>3.4. Comparison between the AB model and clustergram guilds</title>
<p>We compared the guild sizes and mapback genome numbers of the <italic>clustergram</italic> guilds to guilds generated using the AB method approaches 1 and 2. <xref ref-type="fig" rid="F7">Figure 7</xref> shows the distribution of guild sizes vs. the number of mapback genomes for each of these three methods. Based on our simulated data analysis described in Section 3.3, we determined that <italic>K</italic> &#x0003D; 10 was an appropriate number of guilds for the AB method. Overall, we found that the <italic>clustergram</italic> method identified more guilds with fewer functions and fewer mapback genomes than the AB method. Specifically, with a cut height of 0.9, <italic>clustergram</italic> identified three times as many guilds (<italic>N</italic> = 30) as the AB method (<italic>N</italic> = 10). Of these 30 <italic>clustergram</italic> guilds, the majority (60% of the guilds) possessed three or fewer functions with 33.3% of the guilds constituting just a pair of functions. When we used the conservative criteria of at least 100 mapback genomes per guild (approach 2), the AB method generated a comparable number of guilds with 3 or fewer functions (50% of the total guilds). However, the two methods differ substantially in terms of number of mapback genomes identified for each guild. <italic>Clustergram</italic> yielded guilds with an average of 38.8 mapback genomes per guild, substantially less than the two AB methods which averaged 116.2 and 142.9 mapback genomes for approaches 1 and 2, respectively. When we reduced the threshold for AB approach 2 to the <italic>clustergram</italic> average of 39 mapback genomes per guild, we found just one guild with three or fewer functions (10% of the total guilds). To make a more direct comparison to the <italic>clustergram</italic> guilds, we re-ran the AB pipeline with <italic>K</italic> &#x0003D; 30. Allowing for a higher number of guilds in the AB method resulted in a similar number of mapback genomes per guild as the runs with K = 10 with an average of 113 mapback genomes (ranging from 0 to 1436) for approach 1 and with only one guild having no mapbacks. However, when <italic>K</italic> &#x0003D; 30, the AB method resulted in a high frequency of duplicate guilds, either fully duplicated or partially duplicated (see <xref ref-type="fig" rid="F4">Figure 4</xref> and Section 3.3.1).</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>The distribution of guild sizes (number of functions) and the number of mapback genomes for guilds generated with clustergram at cut heights of 0.9 (blue square) and 1 (purple plus sign) as well as for AB. AB approach 1 (red circle) defined guilds using a fixed size of 5 functions while AB approach 2 (green triangle) defined guilds using a minimum mapback genome cut-off of 100. Points were jittered using the built-in position_jitter function in the ggplot2 package v3.4.2 with h = 0.1, w = 0.35 using the random seed 123.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmicb-14-1197329-g0007.tif"/>
</fig>
<p>To test the impact of the cut height on guild size, we increased the <italic>clustergram</italic> cut height to 1 (<xref ref-type="supplementary-material" rid="SM1">Supplementary Figure 8</xref>). This results in a more similar number of total guilds (17 for <italic>clustergram</italic> compared to 10 for AB) between the different methods. A cut height of 1 reduced the number of small <italic>clustergram</italic> guilds (3 or fewer functions) to 41.2%. However, this even further decreased the number of mapback genomes for each guild (an average of 3.2 genomes per guild with some guilds having no mapbacks). For both cut heights, <italic>clustergram</italic> identified one guild with 42 functions (cut height = 0.9) and 66 functions (cut height = 1), which correspond to 19.8% and 31.1% of all functions in the dataset, respectively. This large guild was comprised entirely of highly abundant functions and was substantially larger than the largest guild produced by AB approach 2 (28 functions using the lower threshold of 39 or more mapback genomes). Furthermore, the large <italic>clustergram</italic> guild had just 4 and 0 mapback genomes for cut heights of 0.9 and 1, respectively, while the 28-function AB guild had 61 mapback genomes. Finally, we attempted using a dynamic cut height method for clustering functions which improved the guild sizes and number of mapback genomes over the static height but still resulted in guilds with fewer mapback genomes than the AB guilds (see <xref ref-type="supplementary-material" rid="SM1">Supplementary material 1.3</xref>).</p>
<p>We next assessed the differences in guilds functions identified by the two methods using AB approach 1 where guilds were defined with a static number of functions. We observed several reoccurring patterns. When using a cut height of 0.9 for <italic>clustergram</italic>, the five AB guild functions were typically split between two distinct <italic>clustergram</italic> guilds (range split between 1 and 3 guilds) with only two of the ten AB guilds being contained within a single <italic>clustergram</italic> cluster. When we examined the <italic>clustergram</italic> guilds that contain the AB guild functions, we found that they average 52.8 mapback genomes compared to 116.2 for the corresponding AB guilds. This suggests that the AB method can identify groups of functions that are more commonly found together in the dataset.</p>
<p>Increasing the cut height to 1 resulted in fewer <italic>clustergram</italic> clusters and marginally reduced the fragmentation of AB guilds between <italic>clustergram</italic> guilds with AB guilds now being split across 1.7 <italic>clustergram</italic> guilds on average (ranging from 1 to 3 guilds). At this linkage, the <italic>clustergram</italic> guilds which contained the AB guild functions had on average 30 additional functions (ranging from 5.5 to 61) and only 0.33 mapback genomes (ranging from 0 to 1) compared to the corresponding AB guilds which had 116.2 mapback genomes (ranging from 11 to 468). There were several instances (4 of 10), where the AB guild functions clustered fully or partially into the large <italic>clustergram</italic> guild with 66 functions containing the highly abundant functions in the dataset with no mapback genomes.</p>
<p>This analysis demonstrated that both the AB and clustering methods can identify functional guilds from our dataset and that there was an overlap in the functions that were grouped together into guilds using the two methods. We showed that the AB guilds both contained more functions and were more highly represented in the dataset (have substantially more mapback genomes) than the guilds defined using the clustering method. As with any method, there are both advantages and disadvantages to the AB method. One disadvantage of the AB method is the need to choose a value of the free parameter <italic>K</italic>, which determines the number of guilds identified (see discussion above in Section 3.3.1). However, we demonstrate how a user can use our pipeline to make an informed decision as to how to choose the best value of <italic>K</italic>. Another key distinction between the two methods is that clustering methods precisely define the functions belonging to each guild. The AB method provides information both about which functions are strong indicators of the guild and which genomes have a high probability of membership in the guild. The user must then decide which set of functions to define as a guild. We provide two approaches for making this distinction and highlight how this additional information generated by the AB method can be used to generate hypotheses (see discussion below in Section 4.1). Additional advantages to the AB method are that the AB method does not require all functions to be members of a guild or a function to be a member of just one guild and that the AB method can distinguish between false and true absences/presences in the dataset. Finally, it is important to note for the AB method that if there are mapback genomes for a guild then the guild is by definition meaningful (i.e., found in the dataset). However, the absence of a guild does not necessitate that that guild does not exist. The AB method might not have identified a guild for several other reasons, including other structures in the data matrix which can make rare guilds difficult to find, or the absence of a key annotation that is crucial for distinguishing it from the rest of the dataset.</p>
</sec>
</sec>
<sec id="s4">
<title>4. Discussion</title>
<sec>
<title>4.1. Emergent microbial metabolic guilds</title>
<p>Our approach identified several biogeochemically relevant metabolic functional guilds with numerous genome representatives in the composite dataset. It is important to note that these guilds emerged from this analysis without any curation or <italic>a priori</italic> knowledge. As such, the identification of known guilds (e.g., photosynthesis) is a strong indication that the method can detect biologically meaningful phenomena even when these associations are in low abundance in the dataset. In this study, we highlight three emergent guilds and draw connections to previously identified co-occurring biochemical processes. The other seven guilds identified by the method are also of significance (11&#x02013;235 mapback genomes) and are listed in <xref ref-type="supplementary-material" rid="SM1">Supplementary Table 4</xref>. For example, we identified a guild associated with phosphorus acquisition (C-P lyase genes, see Section 4.2) and several associated with different types of carbon metabolisms (see Guilds 8 and 9 in <xref ref-type="supplementary-material" rid="SM1">Supplementary Figure 4</xref>). However, for succinctness, we describe in detail just three guilds that illustrate the power of the AB method.</p>
<p>The photosynthetic functions served as a good test case of our method. Our composite dataset was curated in such a way that photosystems I and II were only present in 2.5% (<italic>N</italic> = 95) and 2.7% (<italic>N</italic> = 105) of the genomes, respectively. However, our method was able to identify a photosynthesis guild with 10 total functions including photosystems I and II, NAD(P)H quinone oxidoreductase, cytochrome <italic>b</italic><sub>6</sub><italic>f</italic> complex, and RuBisCO (<xref ref-type="supplementary-material" rid="SM1">Supplementary Table 4</xref>). This 10-function guild had 12 mapback genomes in the composite dataset. We were also able to identify this photosynthetic guild in the SAG dataset where photosystems I and II have abundances of 6.3% and 5.8%, respectively. The identification of this well-characterized system provided an excellent &#x0201C;ground truth&#x0201D; validation of our method.</p>
<p>The approach identified a guild related to the consumption of the organic sulfur compound dimethylsulfoniopropionate (DMSP). This guild consisted of DMSP demethylation, DMSP lyase, and sulfite dehydrogenase (quinone), and had 139 mapback genomes. These three functions were the highest-ranked functions within a single aspect (<xref ref-type="table" rid="T1">Table 1</xref>). For this analysis, we assessed the presence of at least one of 7 different DMSP lyases (DddL, DddQ, DddP, DddD, DddK, DddY, and DddW). DMSP lyase has been shown experimentally to co-occur with the enzyme DMSP demethylase (DmdA), which performs the demethylation reaction for DMSP (Reisch et al., <xref ref-type="bibr" rid="B62">2008</xref>, <xref ref-type="bibr" rid="B63">2011</xref>), though this association is not obligatory. These pathways have been characterized in abundant marine clades, such as Roseobacters (Moran et al., <xref ref-type="bibr" rid="B46">2007</xref>) and SAR11 (Tripp et al., <xref ref-type="bibr" rid="B79">2008</xref>). Sulfite dehydrogenase has also been implicated as a potential pathway through which DMSP-derived sulfur is reduced from sulfite to sulfate (Reisch et al., <xref ref-type="bibr" rid="B63">2011</xref>).</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Top 15 functions based on score (see Section 2) for two aspects related to DMSP degradation and motility.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>DMSP aspect</bold></th>
<th valign="top" align="left"><bold>Scores</bold></th>
<th valign="top" align="left"><bold>Motility aspect</bold></th>
<th valign="top" align="left"><bold>Scores</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><bold>DMSP demethylation</bold></td>
<td valign="top" align="left">30.908</td>
<td valign="top" align="left"><bold>Type II Secretion</bold></td>
<td valign="top" align="left">20.603</td>
</tr> <tr>
<td valign="top" align="left"><bold>DMSP lyase (dddLQPDKW)</bold></td>
<td valign="top" align="left">29.901</td>
<td valign="top" align="left"><bold>Ubiquinol Cytochrome c reductase</bold></td>
<td valign="top" align="left">18.733</td>
</tr> <tr>
<td valign="top" align="left"><bold>Sulfite dehydrogenase(quinone)</bold></td>
<td valign="top" align="left">27.231</td>
<td valign="top" align="left"><bold>Cytochrome-c oxidase cbb3-type</bold></td>
<td valign="top" align="left">17.174</td>
</tr> <tr>
<td valign="top" align="left">Trimethylamine methyltransferase</td>
<td valign="top" align="left">22.441</td>
<td valign="top" align="left"><bold>Flagellum</bold></td>
<td valign="top" align="left">12.752</td>
</tr> <tr>
<td valign="top" align="left">Dimethylamine/trimethylamine dehydrogenase</td>
<td valign="top" align="left">17.902</td>
<td valign="top" align="left"><bold>Phospholipid SBP</bold></td>
<td valign="top" align="left">12.180</td>
</tr> <tr>
<td valign="top" align="left">Putative simple sugar SBP</td>
<td valign="top" align="left">16.735</td>
<td valign="top" align="left"><bold>Chemotaxis</bold></td>
<td valign="top" align="left">11.285</td>
</tr> <tr>
<td valign="top" align="left">Microcinc SBP</td>
<td valign="top" align="left">13.544</td>
<td valign="top" align="left"><bold>Glyoxylate shunt</bold></td>
<td valign="top" align="left">7.971</td>
</tr> <tr>
<td valign="top" align="left">Ubiquinol cytochrome c reductase</td>
<td valign="top" align="left">13.391</td>
<td valign="top" align="left">Thiamin biosynthesis</td>
<td valign="top" align="left">7.577</td>
</tr> <tr>
<td valign="top" align="left">Taurine SBP</td>
<td valign="top" align="left">13.029</td>
<td valign="top" align="left">Phosphate transporter</td>
<td valign="top" align="left">7.430</td>
</tr> <tr>
<td valign="top" align="left">Glycine betaine/proline SBP</td>
<td valign="top" align="left">12.989</td>
<td valign="top" align="left">Cytochrome bd complex</td>
<td valign="top" align="left">7.406</td>
</tr> <tr>
<td valign="top" align="left">General l-amino acid SBP</td>
<td valign="top" align="left">12.160</td>
<td valign="top" align="left">Type I Secretion</td>
<td valign="top" align="left">7.304</td>
</tr> <tr>
<td valign="top" align="left">Spermindine/putrescine SBP</td>
<td valign="top" align="left">11.625</td>
<td valign="top" align="left">Cationic peptide SBP</td>
<td valign="top" align="left">7.006</td>
</tr> <tr>
<td valign="top" align="left">Putative spermidine/putrescine SBP</td>
<td valign="top" align="left">11.493</td>
<td valign="top" align="left">Ammonia transporter</td>
<td valign="top" align="left">6.610</td>
</tr> <tr>
<td valign="top" align="left">Tungstate SBP</td>
<td valign="top" align="left">10.723</td>
<td valign="top" align="left">Sec/SRP</td>
<td valign="top" align="left">6.484</td>
</tr>
<tr>
<td valign="top" align="left"><italic><bold>Thiosulfate oxidation</bold></italic></td>
<td valign="top" align="left">10.663</td>
<td valign="top" align="left">TCA cycle</td>
<td valign="top" align="left">6.458</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Functions that constitute the resulting DMSP and motility guilds are highlighted in bold and bold and italics, respectively. SBP is the substrate-binding protein associated with the respective ABC transporter.</p>
</table-wrap-foot>
</table-wrap>
<p>The AB method suggests that there are several additional functions that might commonly co-occur with these three DMSP-related functions (<xref ref-type="table" rid="T1">Table 1</xref>). For example, taurine and glycine betaine transport, either into the cell to meet metabolic demands or out of the cell to excrete waste products, could be features of this guild. In fact, previous work suggests that many Roseobacters utilize a diverse suite of labile dissolved organic sulfur (DOS) metabolites to meet their sulfur requirements (Landa et al., <xref ref-type="bibr" rid="B31">2019</xref>). In a co-culture experiment with <italic>R. pomeroyi</italic> strain DSS-3 and two phytoplankton species, Landa et al. (<xref ref-type="bibr" rid="B31">2019</xref>) demonstrated enriched expression patterns of transport and catabolism genes for seven sulfur-rich phytoplankton exometabolites, including DMSP and taurine. These findings are consistent with the fact that both DMSP and taurine are produced in high concentrations by certain phytoplankton groups (Saltzman and Cooper, <xref ref-type="bibr" rid="B65">1989</xref>; Jackson et al., <xref ref-type="bibr" rid="B24">1992</xref>). The nitrogen-rich compatible solute glycine betaine is also produced by certain phytoplankton groups (Keller et al., <xref ref-type="bibr" rid="B28">1999</xref>) and has been implicated as a nitrogen source for Roseobacters (Moran et al., <xref ref-type="bibr" rid="B46">2007</xref>). Therefore, the capacity to use these substrates co-occurring within a single organism is consistent with known ecological interactions and might indicate that organisms in the DMSP guild could be associated with the phycosphere. Including taurine as a 4<sup>th</sup> function in the guild resulted in 100 mapback genomes, including glycine betaine as a 4<sup>th</sup> function resulted in 134 mapback genomes, and including both (5 function guild) resulted in 98 mapback genomes.</p>
<p>Thiosulfate oxidation also occurs in the top 15 ranked score list (rank 15). Previous experimental study has shown that this pathway is involved in DMSP degradation (Reisch et al., <xref ref-type="bibr" rid="B63">2011</xref>). In fact, if we included thiosulfate oxidation within the DMSP guild, we obtained a guild of four DMSP functions with 89 mapback genomes in the composite dataset all co-occurring with a high degree of specificity (<xref ref-type="fig" rid="F6">Figure 6</xref>).</p>
<p>The last example guild was a large guild related to motile microbial lifestyles. The key functions in the motility guild were type II secretion, <italic>cbb</italic><sub>3</sub>-type cytochrome <italic>c</italic> oxidase, flagellum, chemotaxis, ubiquinol cytochrome <italic>c</italic> reductase, a phospholipid SBP, and the glyoxylate shunt, totaling seven guild functions with 385 mapback genomes (<xref ref-type="table" rid="T1">Table 1</xref>). These functions are all consistent with copiotrophic lifestyles where organisms are motile and capable of responding to signals in the environment through chemotaxis. Similar to the DMSP guild, a key advantage to our approach is that it provides a list of functions that co-occur with classic &#x0201C;copiotrophic&#x0201D; functions (e.g., chemotaxis and flagellum) with high specificity to the guild mapback genomes. This can allow us to develop hypotheses related to the ecological and biogeochemical roles played by this group. For this motility guild, type II secretion and the Glyoxylate shunt co-occur with both chemotaxis and flagellum with a high degree of specificity (average outgroup confidence of 0.35).</p>
</sec>
<sec>
<title>4.2. MAG vs. SAG guild comparison</title>
<p>We ran both our MAG and SAG datasets through our method to investigate the differences in guilds generated by these two different datasets. These datasets not only used different methodologies but also sampled different oceanographic regions. The MAG dataset was comprised of globally distributed samples, most notably 68 sampling sites from <italic>Tara</italic> Oceans (Sunagawa et al., <xref ref-type="bibr" rid="B75">2015</xref>) spanning all major oceanographic regions (except the Arctic Ocean) and three depths from the surface (5 m) to the mesopelagic zone (600 m). The SAG dataset on the other hand was obtained from samples primarily located in the North Atlantic and Pacific Oceans at a mean depth of 70.7 m and was prefiltered (Pachiadaki et al., <xref ref-type="bibr" rid="B50">2019</xref>). Thus, the expectation is that these different datasets will yield different guilds because they sampled fundamentally different communities. Indeed, while guilds related to DMSP, the C-P lyase pathway, motility, and rhodopsins (<xref ref-type="supplementary-material" rid="SM1">Supplementary Table 5</xref>) were identified in the MAG dataset, the SAG dataset generated guilds primarily related to the uptake of substrates (<xref ref-type="supplementary-material" rid="SM1">Supplementary Table 6</xref>).</p>
<p>A guild associated with the acquisition of phosphorus was identified in both datasets. In the SAG dataset, this guild comprised of four functions and 163 mapback genomes, which consisted of the C-P lyase complex (PhnGHIJ), CP-lyase operon (PhnFKLMNOP), CP-lyase cleavage (PhnJ), and a phosphonate transporter (PhnCED). The C-P lyase pathway has been shown to break down a variety of phosphonate bonds, including phosphonates associated with semi-labile high molecular weight dissolved organic matter (Metcalf and Wanner, <xref ref-type="bibr" rid="B45">1993</xref>; White and Metcalf, <xref ref-type="bibr" rid="B85">2004</xref>; Sosa et al., <xref ref-type="bibr" rid="B70">2017</xref>). It is unsurprising to see the CP-lyases grouped together since they are co-located in a single operon. However, this guild served as another example that our method can extract well-known functional co-occurrences (our method does not take into account the co-location of genes within the genome). These four functions associated with the SAG phosphorus guild were also found together in one of the MAG guilds with 62 mapback genomes.</p>
<p>The guilds identified by our method were an emergent property of the dataset itself. This means that the absence of a known or potential guild in the model output does not necessarily mean that guild was not present in the dataset. Using a different collection of annotated genomes could potentially change the abundances of the functions within the dataset, which could greatly impact whether the method identified a specific group of functions as a guild or not. For example, we demonstrated that guilds with abundances of 2% or lower were difficult to consistently observe. Furthermore, as discussed above, <italic>K</italic> is a crucial free parameter that needs to be selected for each novel dataset to which this method is applied. We recommend constraining <italic>K</italic> using a similar heuristic approach to the one we described above or using other previously suggested methods such as the Akaike information criterion (deLeeuw, <xref ref-type="bibr" rid="B10">1992</xref>; Bingham et al., <xref ref-type="bibr" rid="B4">2009</xref>).</p>
</sec>
</sec>
<sec id="s5">
<title>5. Conclusion</title>
<p>The co-occurrence of metabolic functions has long been studied in the field of biochemistry where metabolic pathways are elucidated. However, these studies are typically very labor-intensive and require cultured representatives. This can present an issue since only a small fraction of marine microbes have been cultured (Rapp&#x000E9; and Giovannoni, <xref ref-type="bibr" rid="B60">2003</xref>; Steen et al., <xref ref-type="bibr" rid="B72">2019</xref>). Our method described in this study presents a way to generate hypotheses about co-occurring functions across large collections of genomes without relying on cultured representatives. These hypotheses might aid in future biochemical studies by providing targeted functions to test.</p>
<p>In addition to generating testable hypotheses, this method presents several potential future applications. One possibility is in assisting with genome annotation through the incorporation of hypothetical gene products that have not yet been functionally characterized. One recent study (Faure et al., <xref ref-type="bibr" rid="B16">2021</xref>) developed a large-scale sequence similarity network to identify protein functional clusters (PFCs) and demonstrated the potential for characterizing PFCs of previously unannotated proteins and correlating them with multiple environmental variables. Rather than focusing on whole community functional composition, our method identifies collections of ecologically relevant functions that are found to co-occur within assembled and isolate genomes. Using our method, one could construct a dataset composed of a mix of annotated and unannotated genes/proteins. Any mapback genomes identified for those hypothetical functions would be excellent culture candidates for characterizing that hypothetical gene. This method offers the potential to significantly refine the targeting of these culturing efforts to make them more nimble and more cost-effective.</p>
<p>Understanding microbial metabolic functional guilds is an essential step in describing microbial communities based on their metabolic activity, particularly for key heterotrophic communities. Rather than focusing on the functional composition of the entire community, our method identifies collections of co-occurring functions that form the building blocks of a community&#x00027;s functional structure. Defining the community as such will allow us to develop improved numerical ecosystem models that capture these metabolic capabilities. In addition, it will help us to better build and validate models, such as the trait-based ecosystem model GENOME described in Coles et al. (<xref ref-type="bibr" rid="B9">2017</xref>) study, that directly simulated the metagenomes and metatranscriptomes of communities. Furthermore, because our approach is phylogenetically independent, it also provides the ability to disentangle analyses of function and phylogeny when assessing the structure of a given community. This provides a window into the level of functional redundancy present both within a single guild and across the community as a whole. Additionally, our approach generates hypotheses about potential co-occurring metabolic functions that can be tested experimentally. Furthermore, since we demonstrate that this approach works for both MAG and SAG genomes, this method offers the ability to characterize the genomic potential of uncultured organisms from a wide range of studies.</p>
</sec>
<sec sec-type="data-availability" id="s6">
<title>Data availability statement</title>
<p>Publicly available datasets were analyzed in this study. This data can be found here: <ext-link ext-link-type="uri" xlink:href="https://github.com/LevineLab/AB-guilds_model">https://github.com/LevineLab/AB-guilds_model</ext-link>; <ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA391943">https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA391943</ext-link> BioProject ID PRJNA391943, <ext-link ext-link-type="uri" xlink:href="https://mmp.sfb.uit.no/databases/">https://mmp.sfb.uit.no/databases/</ext-link>; <ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/bioproject/572885">https://www.ncbi.nlm.nih.gov/bioproject/572885</ext-link> BioProject ID PRJEB33281.</p>
</sec>
<sec sec-type="author-contributions" id="s7">
<title>Author contributions</title>
<p>RR and NL designed the project. RR, SH, NL, and JB developed and tested the AB model. RR and BT compiled the datasets and conducted the taxonomic classification, quality assessment, and phylogenetic analysis of the genomes. RR conducted the model simulations and guild analyses. All authors contributed to the writing of the manuscript, article, and approved the submitted version.</p>
</sec>
</body>
<back>
<sec sec-type="funding-information" id="s8">
<title>Funding</title>
<p>This study was supported by a grant from the Simons Collaboration on Principles of Microbial Ecosystems/PriME (Grant ID: 542387 to NL) and the Simons Collaboration on Computational Biogeochemical Modeling of Marine Ecosystems/CBIOMES (Grant ID: 549939 to JB).</p>
</sec>
<ack><p>The authors acknowledge the Center for Advanced Research Computing (CARC) at the University of Southern California for providing computing resources that have contributed to the research results reported within this publication.</p>
</ack>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<sec sec-type="supplementary-material" id="s10">
<title>Supplementary material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fmicb.2023.1197329/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fmicb.2023.1197329/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.pdf" id="SM1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Agrawal</surname> <given-names>R.</given-names></name> <name><surname>Imieli&#x00144;ski</surname> <given-names>T.</given-names></name> <name><surname>Swami</surname> <given-names>A.</given-names></name></person-group> (<year>1993</year>). <article-title>Mining association rules between sets of items in large databases,</article-title> in <source>Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data - SIGMOD &#x00027;93. Presented at the the 1993 ACM SIGMOD International Conference</source>, <publisher-loc>ACM Press, Washington, D.C., United States,</publisher-loc> pp. 207&#x02013;216. <pub-id pub-id-type="doi">10.1145./170035.170072</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Aumont</surname> <given-names>O.</given-names></name> <name><surname>Bopp</surname> <given-names>L.</given-names></name></person-group> (<year>2006</year>). <article-title>Globalizing results from ocean <italic>in situ</italic> iron fertilization studies: globalizing iron fertilization</article-title>. <source>Glob. Biogeochem. Cycles</source> <volume>20</volume>, <fpage>2591</fpage>. <pub-id pub-id-type="doi">10.1029./2005GB002591</pub-id><pub-id pub-id-type="pmid">16370118</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Baker</surname> <given-names>B. J.</given-names></name> <name><surname>Lazar</surname> <given-names>C. S.</given-names></name> <name><surname>Teske</surname> <given-names>A. P.</given-names></name> <name><surname>Dick</surname> <given-names>G. J.</given-names></name></person-group> (<year>2015</year>). <article-title>Genomic resolution of linkages in carbon, nitrogen, and sulfur cycling among widespread estuary sediment bacteria</article-title>. <source>Microbiome</source> <volume>3</volume>, <fpage>14</fpage>. <pub-id pub-id-type="doi">10.1186/s40168-015-0077-6</pub-id><pub-id pub-id-type="pmid">25922666</pub-id></citation></ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bingham</surname> <given-names>E.</given-names></name> <name><surname>Kab&#x000E1;n</surname> <given-names>A.</given-names></name> <name><surname>Fortelius</surname> <given-names>M.</given-names></name></person-group> (<year>2009</year>). <article-title>The aspect Bernoulli model: multiple causes of presences and absences</article-title>. <source>Pattern Anal. Appl</source>. <volume>12</volume>, <fpage>55</fpage>&#x02013;<lpage>78</lpage>. <pub-id pub-id-type="doi">10.1007/s10044-007-0096-4</pub-id></citation>
</ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Blei</surname> <given-names>D. M.</given-names></name></person-group> (<year>2003</year>). <article-title>Latent dirichlet allocation. <italic>J. Mach. Learn</italic></article-title>. <source>Res</source>. <volume>30</volume>, <fpage>25</fpage>&#x02013;<lpage>35</lpage>.<pub-id pub-id-type="pmid">23520254</pub-id></citation></ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bray</surname> <given-names>J. R.</given-names></name> <name><surname>Curtis</surname> <given-names>J. T.</given-names></name></person-group> (<year>1957</year>). <article-title>An ordination of the upland forest communities of Southern Wisconsin</article-title>. <source>Ecol. Monogr.</source> <volume>27</volume>, <fpage>325</fpage>&#x02013;<lpage>349</lpage>. <pub-id pub-id-type="doi">10.2307/1942268</pub-id></citation>
</ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Capella-Gutierrez</surname> <given-names>S.</given-names></name> <name><surname>Silla-Martinez</surname> <given-names>J. M.</given-names></name> <name><surname>Gabaldon</surname> <given-names>T.</given-names></name></person-group> (<year>2009</year>). <article-title>trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses</article-title>. <source>Bioinformatics</source> <volume>25</volume>, <fpage>1972</fpage>&#x02013;<lpage>1973</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btp348</pub-id><pub-id pub-id-type="pmid">19505945</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chaumeil</surname> <given-names>P. A.</given-names></name> <name><surname>Mussig</surname> <given-names>A. J.</given-names></name> <name><surname>Hugenholtz</surname> <given-names>P.</given-names></name> <name><surname>Parks</surname> <given-names>D. H.</given-names></name></person-group> (<year>2022</year>). <article-title>GTDB-Tk v2: memory friendly classification with the genome taxonomy database</article-title>. <source>Bioinformatics</source> <volume>38</volume>, <fpage>5315</fpage>&#x02013;<lpage>5316</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btac672</pub-id><pub-id pub-id-type="pmid">36218463</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Coles</surname> <given-names>V. J.</given-names></name> <name><surname>Stukel</surname> <given-names>M. R.</given-names></name> <name><surname>Brooks</surname> <given-names>M. T.</given-names></name> <name><surname>Burd</surname> <given-names>A.</given-names></name> <name><surname>Crump</surname> <given-names>B. C.</given-names></name> <name><surname>Moran</surname> <given-names>M. A.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Ocean biogeochemistry modeled with emergent trait-based genomics</article-title>. <source>Science</source> <volume>358</volume>, <fpage>1149</fpage>&#x02013;<lpage>1154</lpage>. <pub-id pub-id-type="doi">10.1126/science.aan5712</pub-id><pub-id pub-id-type="pmid">29191900</pub-id></citation></ref>
<ref id="B10">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>deLeeuw</surname> <given-names>J.</given-names></name></person-group> (<year>1992</year>). <article-title>&#x0201C;Introduction to Akaike (1973) information theory and an extension of the maximum likelihood principle,&#x0201D;</article-title> in Breakthroughs in Statistics, Springer Series in Statistics, eds Kotz, S., Johnson, N.L. (<publisher-loc>Springer New York, New York, NY</publisher-loc>), pp. 599&#x02013;609. <pub-id pub-id-type="doi">10.1007./978-1-4612-0919-5_37</pub-id></citation>
</ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Delmont</surname> <given-names>T. O.</given-names></name> <name><surname>Eren</surname> <given-names>A. M.</given-names></name></person-group> (<year>2018</year>). <article-title>Linking pangenomes and metagenomes: the <italic>Prochlorococcus</italic> metapangenome</article-title>. <source>PeerJ</source> <volume>6</volume>, <fpage>e4320</fpage>. <pub-id pub-id-type="doi">10.7717/peerj.4320</pub-id><pub-id pub-id-type="pmid">29423345</pub-id></citation></ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dittmar</surname> <given-names>T.</given-names></name> <name><surname>Lennartz</surname> <given-names>S. T.</given-names></name> <name><surname>Buck-Wiese</surname> <given-names>H.</given-names></name> <name><surname>Hansell</surname> <given-names>D. A.</given-names></name> <name><surname>Santinelli</surname> <given-names>C.</given-names></name> <name><surname>Vanni</surname> <given-names>C.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Enigmatic persistence of dissolved organic matter in the ocean</article-title>. <source>Nat. Rev. Earth Environ</source>. <volume>2</volume>, <fpage>570</fpage>&#x02013;<lpage>583</lpage>. <pub-id pub-id-type="doi">10.1038/s43017-021-00183-7</pub-id><pub-id pub-id-type="pmid">36347804</pub-id></citation></ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Eddy</surname> <given-names>S. R.</given-names></name></person-group> (<year>2011</year>). <article-title>Accelerated profile HMM searches</article-title>. <source>PLoS Comput. Biol</source>. 7, e1002195. <pub-id pub-id-type="doi">10.1371/journal.pcbi.1002195</pub-id><pub-id pub-id-type="pmid">22039361</pub-id></citation></ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Edgar</surname> <given-names>R. C.</given-names></name></person-group> (<year>2021</year>). <article-title>High-accuracy alignment ensembles enable unbiased assessments of sequence homology and phylogeny (preprint)</article-title>. <source>Bioinformatics</source> 3, 9169. <pub-id pub-id-type="doi">10.1101/0620.449169</pub-id></citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Falkowski</surname> <given-names>P. G.</given-names></name> <name><surname>Fenchel</surname> <given-names>T.</given-names></name> <name><surname>Delong</surname> <given-names>E. F.</given-names></name></person-group> (<year>2008</year>). <article-title>The microbial engines that drive earth&#x00027;s biogeochemical cycles</article-title>. <source>Science</source> <volume>320</volume>, <fpage>1034</fpage>&#x02013;<lpage>1039</lpage>. <pub-id pub-id-type="doi">10.1126/science.1153213</pub-id><pub-id pub-id-type="pmid">18497287</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Faure</surname> <given-names>E.</given-names></name> <name><surname>Ayata</surname> <given-names>S-. D.</given-names></name> <name><surname>Bittner</surname> <given-names>L.</given-names></name></person-group> (<year>2021</year>). <article-title>Towards omics-based predictions of planktonic functional composition from environmental data</article-title>. <source>Nat. Commun</source>. 12, 4361. <pub-id pub-id-type="doi">10.1038/s41467-021-24547-1</pub-id><pub-id pub-id-type="pmid">34272373</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fuhrman</surname> <given-names>J. A.</given-names></name> <name><surname>Azam</surname> <given-names>F.</given-names></name></person-group> (<year>1980</year>). <article-title>Bacterioplankton secondary production estimates for Coastal Waters of British Columbia, Antarctica, and California</article-title>. <source>Appl. Environ. Microbiol</source>. <volume>39</volume>, <fpage>1085</fpage>&#x02013;<lpage>1095</lpage>. <pub-id pub-id-type="doi">10.1128/aem.39.6.1085-1095.1980</pub-id><pub-id pub-id-type="pmid">16345577</pub-id></citation></ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fuhrman</surname> <given-names>J. A.</given-names></name> <name><surname>Azam</surname> <given-names>F.</given-names></name></person-group> (<year>1982</year>). <article-title>Thymidine incorporation as a measure of heterotrophic bacterioplankton production in marine surface waters: evaluation and field results</article-title>. <source>Mar. Biol.</source> <volume>66</volume>, <fpage>109</fpage>&#x02013;<lpage>120</lpage>. <pub-id pub-id-type="doi">10.1007/BF00397184</pub-id></citation>
</ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Graham</surname> <given-names>E. D.</given-names></name> <name><surname>Heidelberg</surname> <given-names>J. F.</given-names></name> <name><surname>Tully</surname> <given-names>B. J.</given-names></name></person-group> (<year>2017</year>). <article-title>BinSanity: unsupervised clustering of environmental microbial assemblies using coverage and affinity propagation</article-title>. <source>PeerJ</source> <volume>5</volume>, <fpage>e3035</fpage>. <pub-id pub-id-type="doi">10.7717/peerj.3035</pub-id><pub-id pub-id-type="pmid">28289564</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Graham</surname> <given-names>E. D.</given-names></name> <name><surname>Heidelberg</surname> <given-names>J. F.</given-names></name> <name><surname>Tully</surname> <given-names>B. J.</given-names></name></person-group> (<year>2018</year>). <article-title>Potential for primary productivity in a globally-distributed bacterial phototroph</article-title>. <source>ISME J</source>. <volume>12</volume>, <fpage>1861</fpage>&#x02013;<lpage>1866</lpage>. <pub-id pub-id-type="doi">10.1038/s41396-018-0091-3</pub-id><pub-id pub-id-type="pmid">29523891</pub-id></citation></ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hornick</surname> <given-names>K. M.</given-names></name> <name><surname>Buschmann</surname> <given-names>A. H.</given-names></name></person-group> (<year>2018</year>). <article-title>Insights into the diversity and metabolic function of bacterial communities in sediments from Chilean salmon aquaculture sites</article-title>. <source>Ann. Microbiol</source>. <volume>68</volume>, <fpage>63</fpage>&#x02013;<lpage>77</lpage>. <pub-id pub-id-type="doi">10.1007/s13213-017-1317-8</pub-id></citation>
</ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hyatt</surname> <given-names>D.</given-names></name> <name><surname>Chen</surname> <given-names>G. L.</given-names></name> <name><surname>LoCascio</surname> <given-names>P.F. L.</given-names></name> <name><surname>Larimer</surname> <given-names>M. L</given-names></name> <name><surname>Hauser</surname> <given-names>F. W</given-names></name></person-group>. (<year>2010</year>). <article-title>Prodigal: prokaryotic gene recognition and translation initiation site identification</article-title>. <source>BMC Bioinform.</source> <volume>11</volume>, <fpage>119</fpage>. <pub-id pub-id-type="doi">10.1186/1471-2105-11-119</pub-id><pub-id pub-id-type="pmid">20211023</pub-id></citation></ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Imelfort</surname> <given-names>M.</given-names></name> <name><surname>Parks</surname> <given-names>D.</given-names></name> <name><surname>Woodcroft</surname> <given-names>B. J.</given-names></name> <name><surname>Dennis</surname> <given-names>P.</given-names></name> <name><surname>Hugenholtz</surname> <given-names>P.</given-names></name> <name><surname>Tyson</surname> <given-names>G. W.</given-names></name> <etal/></person-group>. (<year>2014</year>). <article-title>GroopM: an automated tool for the recovery of population genomes from related metagenomes</article-title>. <source>PeerJ</source> <volume>2</volume>, <fpage>e603</fpage>. <pub-id pub-id-type="doi">10.7717/peerj.603</pub-id><pub-id pub-id-type="pmid">25289188</pub-id></citation></ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jackson</surname> <given-names>A. E.</given-names></name> <name><surname>Ayer</surname> <given-names>S. W.</given-names></name> <name><surname>Laycock</surname> <given-names>M. V.</given-names></name></person-group> (<year>1992</year>). <article-title>The effect of salinity on growth and amino acid composition in the marine diatom <italic>Nitzschia pungens</italic></article-title>. <source>Can. J. Bot</source>. <volume>70</volume>, <fpage>2198</fpage>&#x02013;<lpage>2201</lpage>. <pub-id pub-id-type="doi">10.1139/b92-272</pub-id></citation>
</ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jain</surname> <given-names>C.</given-names></name> <name><surname>Rodriguez-R</surname> <given-names>L.M.</given-names></name> <name><surname>Phillippy</surname> <given-names>A.M.</given-names></name> <name><surname>Konstantinidis</surname> <given-names>K.T.</given-names></name> <name><surname>Aluru</surname> <given-names>S.</given-names></name></person-group> (<year>2018</year>). <article-title>High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries</article-title>. <source>Nat. Commun</source>. 9, 5114. <pub-id pub-id-type="doi">10.1038/s41467-018-07641-9</pub-id><pub-id pub-id-type="pmid">30504855</pub-id></citation></ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kang</surname> <given-names>D. D.</given-names></name> <name><surname>Froula</surname> <given-names>J.</given-names></name> <name><surname>Egan</surname> <given-names>R.</given-names></name> <name><surname>Wang</surname> <given-names>Z.</given-names></name></person-group> (<year>2015</year>). <article-title>MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities</article-title>. <source>PeerJ</source> <volume>3</volume>, <fpage>e1165</fpage>. <pub-id pub-id-type="doi">10.7717/peerj.1165</pub-id><pub-id pub-id-type="pmid">26336640</pub-id></citation></ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kang</surname> <given-names>D. D.</given-names></name> <name><surname>Li</surname> <given-names>F.</given-names></name> <name><surname>Kirton</surname> <given-names>E.</given-names></name> <name><surname>Thomas</surname> <given-names>A.</given-names></name> <name><surname>Egan</surname> <given-names>R.</given-names></name> <name><surname>An</surname> <given-names>H.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies</article-title>. <source>PeerJ</source> <volume>7</volume>, <fpage>e7359</fpage>. <pub-id pub-id-type="doi">10.7717/peerj.7359</pub-id><pub-id pub-id-type="pmid">31388474</pub-id></citation></ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Keller</surname> <given-names>M. D.</given-names></name> <name><surname>Kiene</surname> <given-names>R. P.</given-names></name> <name><surname>Matrai</surname> <given-names>P. A.</given-names></name> <name><surname>Bellows</surname> <given-names>W. K.</given-names></name></person-group> (<year>1999</year>). <article-title>Production of glycine betaine and dimethylsulfoniopropionate in marine phytoplankton</article-title>. <source>I. Batch cultures. Mar. Biol</source>. <volume>135</volume>, <fpage>237</fpage>&#x02013;<lpage>248</lpage>. <pub-id pub-id-type="doi">10.1007/s002270050621</pub-id><pub-id pub-id-type="pmid">22130520</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Klemetsen</surname> <given-names>T.</given-names></name> <name><surname>Raknes</surname> <given-names>I. A.</given-names></name> <name><surname>Fu</surname> <given-names>J.</given-names></name> <name><surname>Agafonov</surname> <given-names>A.</given-names></name> <name><surname>Balasundaram</surname> <given-names>S. V.</given-names></name> <name><surname>Tartari</surname> <given-names>G.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>The MAR databases: development and implementation of databases specific for marine metagenomics</article-title>. <source>Nucleic Acids Res</source>. <volume>46</volume>, <fpage>D692</fpage>&#x02013;<lpage>D699</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkx1036</pub-id><pub-id pub-id-type="pmid">29106641</pub-id></citation></ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kruskal</surname> <given-names>J. B.</given-names></name></person-group> (<year>1964</year>). <article-title>Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis</article-title>. <source>Psychometrika</source> <volume>29</volume>, <fpage>1</fpage>&#x02013;<lpage>27</lpage>. <pub-id pub-id-type="doi">10.1007/BF02289565</pub-id></citation>
</ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Landa</surname> <given-names>M.</given-names></name> <name><surname>Burns</surname> <given-names>A. S.</given-names></name> <name><surname>Durham</surname> <given-names>B. P.</given-names></name> <name><surname>Esson</surname> <given-names>K.</given-names></name> <name><surname>Nowinski</surname> <given-names>B.</given-names></name> <name><surname>Sharma</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>Sulfur metabolites that facilitate oceanic phytoplankton&#x02013;bacteria carbon flux</article-title>. <source>ISME J</source>. <volume>13</volume>, <fpage>2536</fpage>&#x02013;<lpage>2550</lpage>. <pub-id pub-id-type="doi">10.1038/s41396-019-0455-3</pub-id><pub-id pub-id-type="pmid">31227817</pub-id></citation></ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Langille</surname> <given-names>M. G. I.</given-names></name> <name><surname>Zaneveld</surname> <given-names>J.</given-names></name> <name><surname>Caporaso</surname> <given-names>J. G.</given-names></name> <name><surname>McDonald</surname> <given-names>D.</given-names></name> <name><surname>Knights</surname> <given-names>D.</given-names></name> <name><surname>Reyes</surname> <given-names>J. A.</given-names></name> <etal/></person-group>. (<year>2013</year>). <article-title>Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences</article-title>. <source>Nat. Biotechnol.</source> <volume>31</volume>, <fpage>814</fpage>&#x02013;<lpage>821</lpage>. <pub-id pub-id-type="doi">10.1038/nbt.2676</pub-id><pub-id pub-id-type="pmid">23975157</pub-id></citation></ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Larkin</surname> <given-names>A. A.</given-names></name> <name><surname>Garcia</surname> <given-names>C. A.</given-names></name> <name><surname>Garcia</surname> <given-names>N.</given-names></name> <name><surname>Brock</surname> <given-names>M. L.</given-names></name> <name><surname>Lee</surname> <given-names>J. A.</given-names></name> <name><surname>Ustick</surname> <given-names>L. J.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>High spatial resolution global ocean metagenomes from Bio-GO-SHIP repeat hydrography transects</article-title>. <source>Sci. Data</source> <volume>8</volume>, <fpage>107</fpage>. <pub-id pub-id-type="doi">10.1038/s41597-021-00889-9</pub-id><pub-id pub-id-type="pmid">33863919</pub-id></citation></ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Larralde</surname> <given-names>M.</given-names></name></person-group> (<year>2022</year>). <article-title>Pyrodigal: python bindings and interface to Prodigal,an efficient method for gene prediction in prokaryotes</article-title>. <source>J. Open Source Softw</source>. 7, 4296. <pub-id pub-id-type="doi">10.21105/joss.04296</pub-id></citation>
</ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>M. D.</given-names></name></person-group> (<year>2019</year>). <article-title>GToTree: a user-friendly workflow for phylogenomics</article-title>. <source>Bioinformatics</source> <volume>35</volume>, <fpage>4162</fpage>&#x02013;<lpage>4164</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btz188</pub-id><pub-id pub-id-type="pmid">30865266</pub-id></citation></ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Letunic</surname> <given-names>I.</given-names></name> <name><surname>Bork</surname> <given-names>P.</given-names></name></person-group> (<year>2021</year>). <article-title>Interactive tree of life (iTOL) v5: an online tool for phylogenetic tree display and annotation</article-title>. <source>Nucleic Acids Res</source>. <volume>49</volume>, <fpage>W293</fpage>&#x02013;<lpage>W296</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkab301</pub-id><pub-id pub-id-type="pmid">33885785</pub-id></citation></ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lombard</surname> <given-names>V.</given-names></name> <name><surname>Golaconda Ramulu</surname> <given-names>H.</given-names></name> <name><surname>Drula</surname> <given-names>E.</given-names></name> <name><surname>Coutinho</surname> <given-names>P. M.</given-names></name> <name><surname>Henrissat</surname> <given-names>B.</given-names></name></person-group> (<year>2014</year>). <article-title>The carbohydrate-active enzymes database (CAZy) in 2013</article-title>. <source>Nucleic Acids Res</source>. <volume>42</volume>, <fpage>D490</fpage>&#x02013;<lpage>D495</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkt1178</pub-id><pub-id pub-id-type="pmid">24270786</pub-id></citation></ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Louca</surname> <given-names>S.</given-names></name> <name><surname>Jacques</surname> <given-names>S. M. S.</given-names></name> <name><surname>Pires</surname> <given-names>A. P. F.</given-names></name> <name><surname>Leal</surname> <given-names>J. S.</given-names></name> <name><surname>Srivastava</surname> <given-names>D. S.</given-names></name> <name><surname>Parfrey</surname> <given-names>L. W.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>High taxonomic variability despite stable functional structure across microbial communities</article-title>. <source>Nat. Ecol</source>. E1, 0015. <pub-id pub-id-type="doi">10.1038/s41559-016-0015</pub-id><pub-id pub-id-type="pmid">28812567</pub-id></citation></ref>
<ref id="B39">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Louca</surname> <given-names>S.</given-names></name> <name><surname>Parfrey</surname> <given-names>L. W.</given-names></name> <name><surname>Doebeli</surname> <given-names>M.</given-names></name></person-group> (<year>2016</year>). <article-title>Decoupling function and taxonomy in the global ocean microbiome</article-title>. <source>Science</source> <volume>353</volume>, <fpage>1272</fpage>&#x02013;<lpage>1277</lpage>. <pub-id pub-id-type="doi">10.1126/science.aaf4507</pub-id><pub-id pub-id-type="pmid">27634532</pub-id></citation></ref>
<ref id="B40">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Louca</surname> <given-names>S.</given-names></name> <name><surname>Polz</surname> <given-names>M. F.</given-names></name> <name><surname>Mazel</surname> <given-names>F.</given-names></name> <name><surname>Albright</surname> <given-names>M. B. N.</given-names></name> <name><surname>Huber</surname> <given-names>J. A.</given-names></name> <name><surname>O&#x00027;Connor</surname> <given-names>M. I.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>Function and functional redundancy in microbial systems</article-title>. <source>Nat. Ecol.</source> <volume>E2</volume>, <fpage>936</fpage>&#x02013;<lpage>943</lpage>. <pub-id pub-id-type="doi">10.1038/s41559-018-0519-1</pub-id><pub-id pub-id-type="pmid">29662222</pub-id></citation></ref>
<ref id="B41">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lu</surname> <given-names>Y. Y.</given-names></name> <name><surname>Chen</surname> <given-names>T.</given-names></name> <name><surname>Fuhrman</surname> <given-names>J. A.</given-names></name> <name><surname>Sun</surname> <given-names>F.</given-names></name></person-group> (<year>2016</year>). <article-title>COCACOLA: binning metagenomic contigs using sequence composition, read coverage, co-alignment and paired-end read LinkAge</article-title>. <source>Bioinformatics</source> 3, btw290. <pub-id pub-id-type="doi">10.1093./bioinformatics/btw290</pub-id><pub-id pub-id-type="pmid">27256312</pub-id></citation></ref>
<ref id="B42">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Martinez-Garcia</surname> <given-names>M.</given-names></name> <name><surname>Brazel</surname> <given-names>D.M.</given-names></name> <name><surname>Swan</surname> <given-names>B.K.</given-names></name> <name><surname>Arnosti</surname> <given-names>C.</given-names></name> <name><surname>Chain</surname> <given-names>P.S.G.</given-names></name> <name><surname>Reitenga</surname> <given-names>K.G.</given-names></name> <etal/></person-group>. (<year>2012</year>). <article-title>Capturing single cell genomes of active polysaccharide degraders: an unexpected contribution of verrucomicrobia</article-title>. <source>PLoS ONE</source> <volume>7</volume>, <fpage>e35314</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0035314</pub-id><pub-id pub-id-type="pmid">22536372</pub-id></citation></ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>McDaniel</surname> <given-names>L. D.</given-names></name> <name><surname>Young</surname> <given-names>E.</given-names></name> <name><surname>Delaney</surname> <given-names>J.</given-names></name> <name><surname>Ruhnau</surname> <given-names>F.</given-names></name> <name><surname>Ritchie</surname> <given-names>K. B.</given-names></name> <name><surname>Paul</surname> <given-names>J. H.</given-names></name> <etal/></person-group>. (<year>2010</year>). <article-title>High frequency of horizontal gene transfer in the oceans</article-title>. <source>Science</source> <volume>330</volume>, <fpage>50</fpage>&#x02013;<lpage>50</lpage>. <pub-id pub-id-type="doi">10.1126/science.1192243</pub-id><pub-id pub-id-type="pmid">20929803</pub-id></citation></ref>
<ref id="B44">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>MetaHIT Consortium Nielsen</surname> <given-names>H.B.</given-names></name> <name><surname>Almeida</surname> <given-names>M.</given-names></name> <name><surname>Juncker</surname> <given-names>A.S.</given-names></name> <name><surname>Rasmussen</surname> <given-names>S.</given-names></name> <name><surname>Li</surname> <given-names>J.</given-names></name> <name><surname>Sunagawa</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2014</year>). <article-title>Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes</article-title>. <source>Nat. Biotechnol</source>. <volume>32</volume>, <fpage>822</fpage>&#x02013;<lpage>828</lpage>. <pub-id pub-id-type="doi">10.1038/nbt.2939</pub-id><pub-id pub-id-type="pmid">24997787</pub-id></citation></ref>
<ref id="B45">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Metcalf</surname> <given-names>W. W.</given-names></name> <name><surname>Wanner</surname> <given-names>B. L.</given-names></name></person-group> (<year>1993</year>). <article-title>Evidence for a fourteen-gene, phnC to phnP locus for phosphonate metabolism in Escherichia coli</article-title>. <source>Gen</source>e <volume>129</volume>, <fpage>27</fpage>&#x02013;<lpage>32</lpage>. <pub-id pub-id-type="doi">10.1016/0378-1119(93)90692-V</pub-id><pub-id pub-id-type="pmid">8335257</pub-id></citation></ref>
<ref id="B46">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Moran</surname> <given-names>M.A.</given-names></name> <name><surname>Belas</surname> <given-names>R.</given-names></name> <name><surname>Schell</surname> <given-names>M.A.</given-names></name> <name><surname>Gonz&#x000E1;lez</surname> <given-names>J.M.</given-names></name> <name><surname>Sun</surname> <given-names>F.</given-names></name> <name><surname>Sun</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2007</year>). <article-title>Ecological genomics of marine roseobacters</article-title>. <source>Appl. Environ. Microbiol</source>. <volume>73</volume>, <fpage>4559</fpage>&#x02013;<lpage>4569</lpage>. <pub-id pub-id-type="doi">10.1128/AEM.02580-06</pub-id><pub-id pub-id-type="pmid">17526795</pub-id></citation></ref>
<ref id="B47">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ogata</surname> <given-names>H.</given-names></name> <name><surname>Goto</surname> <given-names>S.</given-names></name> <name><surname>Sato</surname> <given-names>K.</given-names></name> <name><surname>Fujibuchi</surname> <given-names>W.</given-names></name> <name><surname>Bono</surname> <given-names>H.</given-names></name> <name><surname>Kanehisa</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>1999</year>). <article-title>KEGG: Kyoto encyclopedia of genes and genomes</article-title>. <source>Nucleic Acids Res</source>. <volume>27</volume>, <fpage>29</fpage>&#x02013;<lpage>34</lpage>. <pub-id pub-id-type="doi">10.1093/nar/27.1.29</pub-id><pub-id pub-id-type="pmid">9847135</pub-id></citation></ref>
<ref id="B48">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Oksanen</surname> <given-names>J.</given-names></name> <name><surname>Blanchet</surname> <given-names>F. G.</given-names></name> <name><surname>Friendly</surname> <given-names>M.</given-names></name> <name><surname>Kindt</surname> <given-names>R.</given-names></name> <name><surname>Legendre</surname> <given-names>P.</given-names></name> <name><surname>McGlinn</surname> <given-names>D.</given-names></name> <etal/></person-group> (<year>2019</year>). <publisher-loc>Vegan</publisher-loc>: <publisher-name>community ecology package.</publisher-name> R package version 2, 5&#x02013;6. <ext-link ext-link-type="uri" xlink:href="https://CRAN.R-project.org/package=vegan">https://CRAN.R-project.org/package=vegan</ext-link> (accessed January 9, 2023).</citation>
</ref>
<ref id="B49">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ondov</surname> <given-names>B. D.</given-names></name> <name><surname>Treangen</surname> <given-names>T. J.</given-names></name> <name><surname>Melsted</surname> <given-names>P.</given-names></name> <name><surname>Mallonee</surname> <given-names>A. B.</given-names></name> <name><surname>Bergman</surname> <given-names>N. H.</given-names></name> <name><surname>Koren</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>Mash: fast genome and metagenome distance estimation using MinHash</article-title>. <source>Genome Biol</source>. 17, 132. <pub-id pub-id-type="doi">10.1186/s13059-016-0997-x</pub-id><pub-id pub-id-type="pmid">27323842</pub-id></citation></ref>
<ref id="B50">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pachiadaki</surname> <given-names>M. G.</given-names></name> <name><surname>Brown</surname> <given-names>J. M.</given-names></name> <name><surname>Brown</surname> <given-names>J.</given-names></name> <name><surname>Bezuidt</surname> <given-names>O.</given-names></name> <name><surname>Berube</surname> <given-names>P. M.</given-names></name> <name><surname>Biller</surname> <given-names>S. J.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>Charting the complexity of the marine microbiome through single-cell genomics</article-title>. <source>Cell</source> <volume>179</volume>, <fpage>1623</fpage>-1635.e11. <pub-id pub-id-type="doi">10.1016/j.cell.11</pub-id>, 017.<pub-id pub-id-type="pmid">31835036</pub-id></citation></ref>
<ref id="B51">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Paoli</surname> <given-names>L.</given-names></name> <name><surname>Ruscheweyh</surname> <given-names>H.-J.</given-names></name> <name><surname>Forneris</surname> <given-names>C.C.</given-names></name> <name><surname>Kautsar</surname> <given-names>S.</given-names></name> <name><surname>Clayssen</surname> <given-names>Q.</given-names></name> <name><surname>Salazar</surname> <given-names>G.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Uncharted biosynthetic potential of the ocean microbiome (preprint)</article-title>. <source>Microbiology</source> 4, 6479. <pub-id pub-id-type="doi">10.1101/0324.436479</pub-id></citation>
</ref>
<ref id="B52">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Parks</surname> <given-names>D. H.</given-names></name> <name><surname>Chuvochina</surname> <given-names>M.</given-names></name> <name><surname>Waite</surname> <given-names>D. W.</given-names></name> <name><surname>Rinke</surname> <given-names>C.</given-names></name> <name><surname>Skarshewski</surname> <given-names>A.</given-names></name> <name><surname>Chaumeil</surname> <given-names>P-. A.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life</article-title>. <source>Nat. Biotechnol</source>. <volume>36</volume>, <fpage>996</fpage>&#x02013;<lpage>1004</lpage>. <pub-id pub-id-type="doi">10.1038/nbt.4229</pub-id><pub-id pub-id-type="pmid">30148503</pub-id></citation></ref>
<ref id="B53">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Parks</surname> <given-names>D. H.</given-names></name> <name><surname>Imelfort</surname> <given-names>M.</given-names></name> <name><surname>Skennerton</surname> <given-names>C. T.</given-names></name> <name><surname>Hugenholtz</surname> <given-names>P.</given-names></name> <name><surname>Tyson</surname> <given-names>G. W.</given-names></name></person-group> (<year>2015</year>). <article-title>CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes</article-title>. <source>Genome Res.</source> <volume>25</volume>, <fpage>1043</fpage>&#x02013;<lpage>1055</lpage>. <pub-id pub-id-type="doi">10.1101/gr.186072.114</pub-id><pub-id pub-id-type="pmid">25977477</pub-id></citation></ref>
<ref id="B54">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Parks</surname> <given-names>D. H.</given-names></name> <name><surname>Rinke</surname> <given-names>C.</given-names></name> <name><surname>Chuvochina</surname> <given-names>M.</given-names></name> <name><surname>Chaumeil</surname> <given-names>P-. A.</given-names></name> <name><surname>Woodcroft</surname> <given-names>B. J.</given-names></name> <name><surname>Evans</surname> <given-names>P. N.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life</article-title>. <source>Nat. Microbiol.</source> <volume>2</volume>, <fpage>1533</fpage>&#x02013;<lpage>1542</lpage>. <pub-id pub-id-type="doi">10.1038/s41564-017-0012-7</pub-id><pub-id pub-id-type="pmid">29234139</pub-id></citation></ref>
<ref id="B55">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pomeroy</surname> <given-names>L. R.</given-names></name></person-group> (<year>1974</year>). <article-title>The ocean&#x00027;s food web, a changing paradigm</article-title>. <source>BioScience</source> <volume>24</volume>, <fpage>499</fpage>&#x02013;<lpage>504</lpage>. <pub-id pub-id-type="doi">10.2307/1296885</pub-id></citation>
</ref>
<ref id="B56">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Price</surname> <given-names>M. N.</given-names></name> <name><surname>Dehal</surname> <given-names>P. S.</given-names></name> <name><surname>Arkin</surname> <given-names>A. P.</given-names></name></person-group> (<year>2010</year>). <article-title>FastTree 2&#x02014;Approximately maximum-likelihood trees for large alignments</article-title>. <source>PLoS ONE</source> <volume>5</volume>, <fpage>e9490</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0009490</pub-id><pub-id pub-id-type="pmid">20224823</pub-id></citation></ref>
<ref id="B57">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pritchard</surname> <given-names>J. K.</given-names></name> <name><surname>Stephens</surname> <given-names>M.</given-names></name> <name><surname>Donnelly</surname> <given-names>P.</given-names></name></person-group> (<year>2000</year>). <article-title>Inference of population structure using multilocus genotype data</article-title>. <source>Genetics</source> 155, 945. <pub-id pub-id-type="doi">10.1093/genetics/155.2.945</pub-id><pub-id pub-id-type="pmid">18784791</pub-id></citation></ref>
<ref id="B58">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Quere</surname> <given-names>C.L.</given-names></name> <name><surname>Harrison</surname> <given-names>S.P.</given-names></name> <name><surname>Colin Prentice</surname> <given-names>I.</given-names></name> <name><surname>Buitenhuis</surname> <given-names>E.T.</given-names></name> <name><surname>Aumont</surname> <given-names>O.</given-names></name> <name><surname>Bopp</surname> <given-names>L.</given-names></name> <etal/></person-group>. (<year>2005</year>). <article-title>Ecosystem dynamics based on plankton functional types for global ocean biogeochemistry models</article-title>. <source>Glob. Change Biol</source>. 3, 051013014052005. <pub-id pub-id-type="doi">10.1111/j.1365-20051004.x</pub-id></citation>
</ref>
<ref id="B59">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Raitsos</surname> <given-names>D. E.</given-names></name> <name><surname>Lavender</surname> <given-names>S. J.</given-names></name> <name><surname>Maravelias</surname> <given-names>C. D.</given-names></name> <name><surname>Haralabous</surname> <given-names>J.</given-names></name> <name><surname>Richardson</surname> <given-names>A. J.</given-names></name> <name><surname>Reid</surname> <given-names>P. C.</given-names></name> <etal/></person-group>. (<year>2008</year>). <article-title>Identifying four phytoplankton functional types from space: an ecological approach</article-title>. <source>Limnol. Oceanogr</source>. <volume>53</volume>, <fpage>605</fpage>&#x02013;<lpage>613</lpage>. <pub-id pub-id-type="doi">10.4319/lo.53</pub-id>, 2.0605</citation>
</ref>
<ref id="B60">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rapp&#x000E9;</surname> <given-names>M. S.</given-names></name> <name><surname>Giovannoni</surname> <given-names>S. J.</given-names></name></person-group> (<year>2003</year>). <article-title>The uncultured microbial majority</article-title>. <source>Annu. Rev. Microbiol.</source> <volume>57</volume>, <fpage>369</fpage>&#x02013;<lpage>394</lpage>. <pub-id pub-id-type="doi">10.1146/annurev.micro.57.030502.090759</pub-id><pub-id pub-id-type="pmid">14527284</pub-id></citation></ref>
<ref id="B61">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rawlings</surname> <given-names>N. D.</given-names></name> <name><surname>Barrett</surname> <given-names>A. J.</given-names></name> <name><surname>Thomas</surname> <given-names>P. D.</given-names></name> <name><surname>Huang</surname> <given-names>X.</given-names></name> <name><surname>Bateman</surname> <given-names>A.</given-names></name> <name><surname>Finn</surname> <given-names>R. D.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>The MEROPS database of proteolytic enzymes, their substrates and inhibitors in 2017 and a comparison with peptidases in the PANTHER database</article-title>. <source>Nucleic Acids Res</source>. <volume>46</volume>, <fpage>D624</fpage>&#x02013;<lpage>D632</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkx1134</pub-id><pub-id pub-id-type="pmid">29145643</pub-id></citation></ref>
<ref id="B62">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Reisch</surname> <given-names>C. R.</given-names></name> <name><surname>Moran</surname> <given-names>M. A.</given-names></name> <name><surname>Whitman</surname> <given-names>W. B.</given-names></name></person-group> (<year>2008</year>). <article-title>Dimethylsulfoniopropionate-dependent demethylase (DmdA) from <italic>Pelagibacter ubique</italic> and <italic>Silicibacter pomeroyi</italic></article-title>. <source>J. Bacteriol</source>. <volume>190</volume>, <fpage>8018</fpage>&#x02013;<lpage>8024</lpage>. <pub-id pub-id-type="doi">10.1128/JB.00770-08</pub-id><pub-id pub-id-type="pmid">18849431</pub-id></citation></ref>
<ref id="B63">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Reisch</surname> <given-names>C. R.</given-names></name> <name><surname>Moran</surname> <given-names>M. A.</given-names></name> <name><surname>Whitman</surname> <given-names>W. B.</given-names></name></person-group> (<year>2011</year>). <article-title>Bacterial catabolism of dimethylsulfoniopropionate (DMSP)</article-title>. <source>Front. Microbiol</source>. 2, 172. <pub-id pub-id-type="doi">10.3389./fmicb.2011.00172</pub-id></citation>
</ref>
<ref id="B64">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Roth Rosenberg</surname> <given-names>D.</given-names></name> <name><surname>Haber</surname> <given-names>M.</given-names></name> <name><surname>Goldford</surname> <given-names>J.</given-names></name> <name><surname>Lalzar</surname> <given-names>M.</given-names></name> <name><surname>Aharonovich</surname> <given-names>D.</given-names></name> <name><surname>Al-Ashhab</surname> <given-names>A.</given-names></name> <name><surname>Lehahn</surname> <given-names>Y.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Particle-associated and free-living bacterial communities in an oligotrophic sea are affected by different environmental factors</article-title>. <source>Environ. Microbiol</source>. <volume>23</volume>, <fpage>4295</fpage>&#x02013;<lpage>4308</lpage>. <pub-id pub-id-type="doi">10.1111/1462-2920.15611</pub-id><pub-id pub-id-type="pmid">34036706</pub-id></citation></ref>
<ref id="B65">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Saltzman</surname> <given-names>E. S.</given-names></name> <name><surname>Cooper</surname> <given-names>W. J.</given-names></name></person-group> (Eds.). (<year>1989</year>). <source>Biogenic Sulfur in the Environment, ACS Symposium Series</source>. <publisher-loc>Washington, DC</publisher-loc>: <publisher-name>American Chemical Society</publisher-name>. <pub-id pub-id-type="doi">10.1021./bk-1989-0393</pub-id></citation>
</ref>
<ref id="B66">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sarkar</surname> <given-names>D.</given-names></name></person-group> (<year>2008</year>). <source>Lattice: Multivariate Data Visualization with R</source>. <publisher-loc>New York, New York, NY</publisher-loc>: <publisher-name>Springer</publisher-name> <pub-id pub-id-type="doi">10.1007./978-0-387-75969-2</pub-id></citation>
</ref>
<ref id="B67">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>S&#x000E9;f&#x000E9;rian</surname> <given-names>R.</given-names></name> <name><surname>Bopp</surname> <given-names>L.</given-names></name> <name><surname>Gehlen</surname> <given-names>M.</given-names></name> <name><surname>Orr</surname> <given-names>J. C.</given-names></name> <name><surname>Eth&#x000E9;</surname> <given-names>C.</given-names></name> <name><surname>Cadule</surname> <given-names>P.</given-names></name> <etal/></person-group>. (<year>2013</year>). <article-title>Skill assessment of three earth system models with common marine biogeochemistry</article-title>. <source>Clim. Dyn</source>. <volume>40</volume>, <fpage>2549</fpage>&#x02013;<lpage>2573</lpage>. <pub-id pub-id-type="doi">10.1007/s00382-012-1362-8</pub-id></citation>
</ref>
<ref id="B68">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sieracki</surname> <given-names>M. E.</given-names></name> <name><surname>Poulton</surname> <given-names>N. J.</given-names></name> <name><surname>Jaillon</surname> <given-names>O.</given-names></name> <name><surname>Wincker</surname> <given-names>P.</given-names></name> <name><surname>Vargas</surname> <given-names>d. e.</given-names></name> <name><surname>Rubinat-Ripoll</surname> <given-names>C. R.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>Single cell genomics yields a wide diversity of small planktonic protists across major ocean ecosystems</article-title>. <source>Sci. Rep</source>. 9, 6025. <pub-id pub-id-type="doi">10.1038/s41598-019-42487-1</pub-id><pub-id pub-id-type="pmid">30988337</pub-id></citation></ref>
<ref id="B69">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sogin</surname> <given-names>M. L.</given-names></name> <name><surname>Morrison</surname> <given-names>H. G.</given-names></name> <name><surname>Huber</surname> <given-names>J. A.</given-names></name> <name><surname>Welch</surname> <given-names>D. M.</given-names></name> <name><surname>Huse</surname> <given-names>S. M.</given-names></name> <name><surname>Neal</surname> <given-names>P. R.</given-names></name> <etal/></person-group>. (<year>2006</year>). <article-title>Microbial diversity in the deep sea and the underexplored &#x0201C;rare biosphere.&#x0201D; <italic>Proc. Natl. Acad. Sci</italic>.</article-title> <volume>103</volume>, <fpage>12115</fpage>&#x02013;<lpage>12120</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.0605127103</pub-id><pub-id pub-id-type="pmid">16880384</pub-id></citation></ref>
<ref id="B70">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sosa</surname> <given-names>O. A.</given-names></name> <name><surname>Repeta</surname> <given-names>D. J.</given-names></name> <name><surname>Ferr&#x000F3;n</surname> <given-names>S.</given-names></name> <name><surname>Bryant</surname> <given-names>J. A.</given-names></name> <name><surname>Mende</surname> <given-names>D. R.</given-names></name> <name><surname>Karl</surname> <given-names>M.</given-names></name><etal/></person-group>. (<year>2017</year>). <article-title>Isolation and characterization of bacteria that degrade phosphonates in marine dissolved organic matter</article-title>. <source>Front. Microbiol</source>. 8, 1786. <pub-id pub-id-type="doi">10.3389/fmicb.2017.01786</pub-id><pub-id pub-id-type="pmid">29085339</pub-id></citation></ref>
<ref id="B71">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Staley</surname> <given-names>C.</given-names></name> <name><surname>Gould</surname> <given-names>T. J.</given-names></name> <name><surname>Wang</surname> <given-names>P.</given-names></name> <name><surname>Phillips</surname> <given-names>J.</given-names></name> <name><surname>Cotner</surname> <given-names>J. B.</given-names></name> <name><surname>Sadowsky</surname> <given-names>M. J.</given-names></name> <etal/></person-group>. (<year>2014</year>). <article-title>Core functional traits of bacterial communities in the Upper Mississippi River show limited variation in response to land cover</article-title>. <source>Front. Microbiol</source>. 5, 414. <pub-id pub-id-type="doi">10.3389./fmicb.2014.00414</pub-id><pub-id pub-id-type="pmid">25152748</pub-id></citation></ref>
<ref id="B72">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Steen</surname> <given-names>A. D.</given-names></name> <name><surname>Crits-Christoph</surname> <given-names>A.</given-names></name> <name><surname>Carini</surname> <given-names>P.</given-names></name> <name><surname>DeAngelis</surname> <given-names>K. M.</given-names></name> <name><surname>Fierer</surname> <given-names>N.</given-names></name> <name><surname>Lloyd</surname> <given-names>K. G.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>High proportions of bacteria and archaea across most biomes remain uncultured</article-title>. <source>ISME J</source>. <volume>13</volume>, <fpage>3126</fpage>&#x02013;<lpage>3130</lpage>. <pub-id pub-id-type="doi">10.1038/s41396-019-0484-y</pub-id><pub-id pub-id-type="pmid">31388130</pub-id></citation></ref>
<ref id="B73">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stepanauskas</surname> <given-names>R.</given-names></name> <name><surname>Sieracki</surname> <given-names>M. E.</given-names></name></person-group> (<year>2007</year>). <article-title>Matching phylogeny and metabolism in the uncultured marine bacteria, one cell at a time</article-title>. <source>Proc. Natl. Acad. Sci</source>. <volume>104</volume>, <fpage>9052</fpage>&#x02013;<lpage>9057</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.0700496104</pub-id><pub-id pub-id-type="pmid">17502618</pub-id></citation></ref>
<ref id="B74">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Strous</surname> <given-names>M.</given-names></name> <name><surname>Kraft</surname> <given-names>B.</given-names></name> <name><surname>Bisdorf</surname> <given-names>R.</given-names></name> <name><surname>Tegetmeyer</surname> <given-names>H. E.</given-names></name></person-group> (<year>2012</year>). <article-title>The binning of metagenomic contigs for microbial physiology of mixed cultures</article-title>. <source>Front. Microbiol</source>. 3, 410. <pub-id pub-id-type="doi">10.3389./fmicb.2012.00410</pub-id><pub-id pub-id-type="pmid">23227024</pub-id></citation></ref>
<ref id="B75">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sunagawa</surname> <given-names>S.</given-names></name> <name><surname>Coelho</surname> <given-names>L. P.</given-names></name> <name><surname>Chaffron</surname> <given-names>S.</given-names></name> <name><surname>Kultima</surname> <given-names>J. R.</given-names></name> <name><surname>Labadie</surname> <given-names>K.</given-names></name> <name><surname>Salazar</surname> <given-names>G.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>Structure and function of the global ocean microbiome</article-title>. <source>Science</source> <volume>348</volume>, <fpage>1261359</fpage>&#x02013;<lpage>1261359</lpage>. <pub-id pub-id-type="doi">10.1126/science.1261359</pub-id><pub-id pub-id-type="pmid">25999513</pub-id></citation></ref>
<ref id="B76">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Swan</surname> <given-names>B.K.</given-names></name> <name><surname>Tupper</surname> <given-names>B.</given-names></name> <name><surname>Sczyrba</surname> <given-names>A.</given-names></name> <name><surname>Lauro</surname> <given-names>F.M.</given-names></name> <name><surname>Martinez-Garcia</surname> <given-names>M.</given-names></name> <name><surname>Gonzalez</surname> <given-names>J.M.</given-names></name> <etal/></person-group>. (<year>2013</year>). <article-title>Prevalent genome streamlining and latitudinal divergence of planktonic bacteria in the surface ocean</article-title>. <source>Proc. Natl. Acad. Sci</source>. <volume>110</volume>, <fpage>11463</fpage>&#x02013;<lpage>11468</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.1304246110</pub-id><pub-id pub-id-type="pmid">23801761</pub-id></citation></ref>
<ref id="B77">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Swan</surname> <given-names>B. K.</given-names></name> <name><surname>Martinez-Garcia</surname> <given-names>M.</given-names></name> <name><surname>Preston</surname> <given-names>C. M.</given-names></name> <name><surname>Sczyrba</surname> <given-names>A.</given-names></name> <name><surname>Woyke</surname> <given-names>T.</given-names></name> <name><surname>Lamy</surname> <given-names>D.</given-names></name> <etal/></person-group>. (<year>2011</year>). <article-title>Potential for chemolithoautotrophy among ubiquitous bacteria lineages in the dark ocean</article-title>. <source>Science</source> <volume>333</volume>, <fpage>1296</fpage>&#x02013;<lpage>1300</lpage>. <pub-id pub-id-type="doi">10.1126/science.1203690</pub-id><pub-id pub-id-type="pmid">21885783</pub-id></citation></ref>
<ref id="B78">
<citation citation-type="book"><person-group person-group-type="author"><collab>The Math Works Inc.</collab></person-group> (<year>2021</year>). <source>MATLAB, Version 2021a</source>. <publisher-loc>Massachusetts</publisher-loc>: <publisher-name>Math Works Inc</publisher-name>.</citation>
</ref>
<ref id="B79">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tripp</surname> <given-names>H. J.</given-names></name> <name><surname>Kitner</surname> <given-names>J. B.</given-names></name> <name><surname>Schwalbach</surname> <given-names>M. S.</given-names></name> <name><surname>Dacey</surname> <given-names>J. W. H.</given-names></name> <name><surname>Wilhelm</surname> <given-names>L. J.</given-names></name> <name><surname>Giovannoni</surname> <given-names>S. J.</given-names></name> <etal/></person-group>. (<year>2008</year>). <article-title>SAR11 marine bacteria require exogenous reduced sulphur for growth</article-title>. <source>Nature</source> <volume>452</volume>, <fpage>741</fpage>&#x02013;<lpage>744</lpage>. <pub-id pub-id-type="doi">10.1038/nature06776</pub-id><pub-id pub-id-type="pmid">18337719</pub-id></citation></ref>
<ref id="B80">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tully</surname> <given-names>Benjamin J.</given-names></name> <name><surname>Graham</surname> <given-names>E. D.</given-names></name> <name><surname>Heidelberg</surname> <given-names>J. F.</given-names></name></person-group> (<year>2018a</year>). <article-title>The reconstruction of 2,631 draft metagenome-assembled genomes from the global oceans</article-title>. <source>Sci. Data</source> <volume>5</volume>, <fpage>170203</fpage>. <pub-id pub-id-type="doi">10.1038/sdata.2017.203</pub-id><pub-id pub-id-type="pmid">29337314</pub-id></citation></ref>
<ref id="B81">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tully</surname> <given-names>Benjamin J.</given-names></name> <name><surname>Wheat</surname> <given-names>C. G.</given-names></name> <name><surname>Glazer</surname> <given-names>B.T.</given-names></name> <name><surname>Huber</surname> <given-names>J. A.</given-names></name></person-group> (<year>2018b</year>). <article-title>A dynamic microbial community with high functional redundancy inhabits the cold, oxic subseafloor aquifer</article-title>. <source>ISME J</source>. <volume>12</volume>, <fpage>1</fpage>&#x02013;<lpage>16</lpage>. <pub-id pub-id-type="doi">10.1038/ismej.2017.187</pub-id><pub-id pub-id-type="pmid">29099490</pub-id></citation></ref>
<ref id="B82">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ustick</surname> <given-names>L. J.</given-names></name> <name><surname>Larkin</surname> <given-names>A. A.</given-names></name> <name><surname>Garcia</surname> <given-names>C. A.</given-names></name> <name><surname>Garcia</surname> <given-names>N. S.</given-names></name> <name><surname>Brock</surname> <given-names>M. L.</given-names></name> <name><surname>Lee</surname> <given-names>J. A.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Metagenomic analysis reveals global-scale patterns of ocean nutrient limitation</article-title>. <source>Science</source> <volume>372</volume>, <fpage>287</fpage>&#x02013;<lpage>291</lpage>. <pub-id pub-id-type="doi">10.1126/science.abe6301</pub-id><pub-id pub-id-type="pmid">33859034</pub-id></citation></ref>
<ref id="B83">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Venter</surname> <given-names>J. C.</given-names></name></person-group> (<year>2004</year>). <article-title>Environmental genome shotgun sequencing of the Sargasso sea</article-title>. <source>Science</source> <volume>304</volume>, <fpage>66</fpage>&#x02013;<lpage>74</lpage>. <pub-id pub-id-type="doi">10.1126/science.1093857</pub-id><pub-id pub-id-type="pmid">15001713</pub-id></citation></ref>
<ref id="B84">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wemheuer</surname> <given-names>F.</given-names></name> <name><surname>Taylor</surname> <given-names>J. A.</given-names></name> <name><surname>Daniel</surname> <given-names>R.</given-names></name> <name><surname>Johnston</surname> <given-names>E.</given-names></name> <name><surname>Meinicke</surname> <given-names>P.</given-names></name> <name><surname>Thomas</surname> <given-names>T.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>Tax4Fun2: prediction of habitat-specific functional profiles and functional redundancy based on 16S rRNA gene sequences</article-title>. <source>Environ. Microbiome</source> <volume>15</volume>, <fpage>11</fpage>. <pub-id pub-id-type="doi">10.1186/s40793-020-00358-7</pub-id><pub-id pub-id-type="pmid">33902725</pub-id></citation></ref>
<ref id="B85">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>White</surname> <given-names>A. K.</given-names></name> <name><surname>Metcalf</surname> <given-names>W. W.</given-names></name></person-group> (<year>2004</year>). <article-title>Two C&#x02014;P lyase operons in <italic>Pseudomonas stutzeri</italic> and their roles in the oxidation of phosphonates, phosphite, and hypophosphite</article-title>. <source>J. Bacteriol.</source> <volume>186</volume>, <fpage>4730</fpage>&#x02013;<lpage>4739</lpage>. <pub-id pub-id-type="doi">10.1128/JB.186.14.4730-4739.2004</pub-id><pub-id pub-id-type="pmid">15231805</pub-id></citation></ref>
<ref id="B86">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wickham</surname> <given-names>H.</given-names></name></person-group> (<year>2009</year>). <source>ggplot2: Elegant Graphics for Data Analysis</source>. <publisher-loc>New York, New York, NY</publisher-loc>: <publisher-name>Springer</publisher-name> <pub-id pub-id-type="doi">10.1007./978-0-387-98141-3</pub-id></citation>
</ref>
<ref id="B87">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>Y-. W.</given-names></name> <name><surname>Simmons</surname> <given-names>B. A.</given-names></name> <name><surname>Singer</surname> <given-names>S. W.</given-names></name></person-group> (<year>2016</year>). <article-title>MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets</article-title>. <source>Bioinformatics</source> <volume>32</volume>, <fpage>605</fpage>&#x02013;<lpage>607</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btv638</pub-id><pub-id pub-id-type="pmid">26515820</pub-id></citation></ref>
<ref id="B88">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xu</surname> <given-names>S.</given-names></name> <name><surname>Chen</surname> <given-names>M.</given-names></name> <name><surname>Feng</surname> <given-names>T.</given-names></name> <name><surname>Zhan</surname> <given-names>L.</given-names></name> <name><surname>Zhou</surname> <given-names>L.</given-names></name> <name><surname>Yu</surname> <given-names>G.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Use ggbreak to effectively utilize plotting space to deal with large datasets and outliers</article-title>. <source>Front. Genet</source>. 12, 774846. <pub-id pub-id-type="doi">10.3389/fgene.2021.774846</pub-id><pub-id pub-id-type="pmid">34795698</pub-id></citation></ref>
<ref id="B89">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yarza</surname> <given-names>P.</given-names></name> <name><surname>Yilmaz</surname> <given-names>P.</given-names></name> <name><surname>Pruesse</surname> <given-names>E.</given-names></name> <name><surname>Gl&#x000F6;ckner</surname> <given-names>F. O.</given-names></name> <name><surname>Ludwig</surname> <given-names>W.</given-names></name> <name><surname>Schleifer</surname> <given-names>K-. H.</given-names></name> <etal/></person-group>. (<year>2014</year>). <article-title>Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences</article-title>. <source>Nat. Rev. Microbiol</source>. <volume>12</volume>, <fpage>635</fpage>&#x02013;<lpage>645</lpage>. <pub-id pub-id-type="doi">10.1038/nrmicro3330</pub-id><pub-id pub-id-type="pmid">25118885</pub-id></citation></ref>
<ref id="B90">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yooseph</surname> <given-names>S.</given-names></name> <name><surname>Sutton</surname> <given-names>G.</given-names></name> <name><surname>Rusch</surname> <given-names>D.B.</given-names></name> <name><surname>Halpern</surname> <given-names>A.L.</given-names></name> <name><surname>Williamson</surname> <given-names>S.J.</given-names></name> <name><surname>Remington</surname> <given-names>K.</given-names></name> <etal/></person-group>. (<year>2007</year>). <article-title>The sorcerer II global ocean sampling expedition: expanding the universe of protein families</article-title>. <source>PLoS Biol</source>. 5, e16. <pub-id pub-id-type="doi">10.1371/journal.pbio.0050016</pub-id><pub-id pub-id-type="pmid">17355171</pub-id></citation></ref>
<ref id="B91">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zakem</surname> <given-names>E. J.</given-names></name> <name><surname>Cael</surname> <given-names>B. B.</given-names></name> <name><surname>Levine</surname> <given-names>N. M.</given-names></name></person-group> (<year>2021</year>). <article-title>A unified theory for organic matter accumulation</article-title>. <source>Proc. Natl. Acad. Sci. U. S. A</source>. 118, e2016896118. <pub-id pub-id-type="doi">10.1073/pnas.2016896118</pub-id><pub-id pub-id-type="pmid">33536337</pub-id></citation></ref>
<ref id="B92">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>H.</given-names></name> <name><surname>Yohe</surname> <given-names>T.</given-names></name> <name><surname>Huang</surname> <given-names>L.</given-names></name> <name><surname>Entwistle</surname> <given-names>S.</given-names></name> <name><surname>Wu</surname> <given-names>P.</given-names></name> <name><surname>Yang</surname> <given-names>Z.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>dbCAN2: a meta server for automated carbohydrate-active enzyme annotation</article-title>. <source>Nucleic Acids Res</source>. <volume>46</volume>, <fpage>W95</fpage>&#x02013;<lpage>W101</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gky418</pub-id><pub-id pub-id-type="pmid">29771380</pub-id></citation></ref>
<ref id="B93">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhou</surname> <given-names>Z.</given-names></name> <name><surname>Tran</surname> <given-names>P.</given-names></name> <name><surname>Liu</surname> <given-names>Y.</given-names></name> <name><surname>Kieft</surname> <given-names>K.</given-names></name> <name><surname>Anantharaman</surname> <given-names>K.</given-names></name></person-group> (<year>2019</year>). <article-title>Metabolic: a scalable high-throughput metabolic and biogeochemical functional trait profiler based on microbial genomes (preprint)</article-title>. <source>Bioinformatics</source>. 10, 761643. <pub-id pub-id-type="doi">10.1101./761643</pub-id></citation>
</ref>
</ref-list> 
</back>
</article> 