AUTHOR=de Oliveira Fábio Henrique Schuster , Gomes Felipe Acker , Feltes Bruno César TITLE=Benchmarking multiple gene ontology enrichment tools reveals high biological significance, ranking, and stringency heterogeneity among datasets JOURNAL=Frontiers in Bioinformatics VOLUME=Volume 6 - 2026 YEAR=2026 URL=https://www.frontiersin.org/journals/bioinformatics/articles/10.3389/fbinf.2026.1755664 DOI=10.3389/fbinf.2026.1755664 ISSN=2673-7647 ABSTRACT=Functional enrichment analysis (FEA) provides biological meaning from lists of differentially expressed genes and proteins obtained through omics experiments. FEA tools can employ numerous statistical methods and rely on different pathway databases. In this sense, Overrepresentation Analysis (ORA) is one of the most popular methods to perform FEA. Gene Ontology (GO) is arguably the most widely used pathway knowledgebase in FEA. Hence, benchmarking the biological accuracy of ORA-based GO enrichment tools is crucial. Nevertheless, benchmark studies in FEA tend to focus excessively on performance-based metrics rather than on the biological information contained in enrichment results. To identify the differences between popular ORA-based GO enrichment tools and provide data that brings insights into the tools’ biological accuracy and, thus, better suits the application of FEA, we tested 12 popular GO enrichment tools (i.e., DAVID, PANTHER, WebGestalt, Enrichr, ShinyGO, limma, topGO, GOstats, clusterProfiler, g:Profiler, ClueGO, and BiNGO) with randomized datasets as negative controls, a target-oriented and a hallmark datasets as positive controls, and an experiment-derived dataset. Gene sets with 500, 200, 100, and 50 genes were built for each dataset to investigate the impact of input sizes. Using the control datasets, we calculated the FPR and accuracy of the tools based on the semantic similarity between the enriched terms and the target ontologies and assessed overlooked, insightful metrics that reflect the biological informativeness of the results, such as the specificity of enriched GO terms and the prioritization of target ontologies. Additionally, we clustered the FEA results based on term semantic similarity, enabling us to directly compare the biological profiles generated by each tool. Despite employing the same method and functional database, the tools’ results diverged significantly. Our findings reveal considerable variation among tools in terms of informativeness and interpretability of results. Some tools demonstrated strong capabilities in prioritizing target pathways, while others struggled, especially as input size increased. Additionally, we observed that the degree to which the enriched ontologies are related to the expected targets varies across tools, with some being more conservative than others. Together, these results provide powerful insights into the performance characteristics of the analyzed GO enrichment tools and yield new, relevant data for benchmarking FEA tools.