AUTHOR=Wang Yubo , Li Liguan , Xia Yu , Zhang Tong TITLE=Reliable and Scalable Identification and Prioritization of Putative Cellulolytic Anaerobes With Large Genome Data JOURNAL=Frontiers in Bioinformatics VOLUME=Volume 2 - 2022 YEAR=2022 URL=https://www.frontiersin.org/journals/bioinformatics/articles/10.3389/fbinf.2022.813771 DOI=10.3389/fbinf.2022.813771 ISSN=2673-7647 ABSTRACT=In the era of high-throughput sequencing, genetic information that is inherently whispering hints of the microbes’ function niches is becoming easily accessible, however, bottleneck remains on properly identifying and characterizing these genetic hints to infer the microbes’ function niches precisely. Regarding to genome-centric interpretation on the specific function niche of cellulose hydrolysis for anaerobes, often encountered in practice is a lack of confidence in predicting the anaerobes’ real cellulolytic competency based solely on abundances of the varying carbohydrate-active enzyme (CAZy) modules annotated or on their taxonomy affiliation. Recognition of the synergy machineries that include but are not limited to the cellulosome gene clusters is equally important as the annotation of individual carbohydrate active modules or genes. In the interpretation of complete genomes of 2642 microbe strains whose phenotypes have been well documented, with the incorporation of an automatic recognition of the synergy among the carbohydrate active elements annotated, an explicit genotype-phenotype correlation was evidenced to be feasible for cellulolytic anaerobes, and a bioinformatic pipeline was developed accordingly. This genome-centric pipeline would categorize cellulolytic anaerobes into 5 genotype groups corresponding to differential cellulose-hydrolyzing capacity and varying synergy mechanisms. Suggested in this genotype-phenotype correlation analysis was a finer categorization of the cellulosome gene clusters: although cellulosome complexes by its nature could enable the assembly of a number of carbohydrate-active units, they do not certainly guarantee the CEM complex formation and cellulose-hydrolyzing activity of the corresponding anaerobe strains, for example, the well-known Clostridium acetobutylicum strains. Also recognized in this genotype-phenotype correlation analysis was the genetic foundation of a previously unrecognized machinery that may mediate the microbe-cellulose adhesion, to be specific, enzymes encoded by genes harboring both the SLH module and the cellulose-binding CBM module. Applicability of this pipeline in scalable annotation of large genome datasets was further tested with the annotation of 7902 reference genomes downloaded from NCBI, from which 14 genomes of putative paradigm cellulose-hydrolyzing anaerobes were identified. We believe the pipeline developed in this study would be a good add as a bioinformatic tool for genome-centric interpretation of uncultivated anaerobes, specifically on their function niche of cellulose hydrolysis.