AUTHOR=Schottlender Gustavo , Prieto Juan Manuel , Marti Marcelo A. , Fernández Do Porto Dario 

TITLE=Beyond Tanimoto: a learned bioactivity similarity index enhances ligand discovery

JOURNAL=Frontiers in Bioinformatics

VOLUME=Volume 5 - 2025

YEAR=2025

URL=https://www.frontiersin.org/journals/bioinformatics/articles/10.3389/fbinf.2025.1695353

DOI=10.3389/fbinf.2025.1695353

ISSN=2673-7647

ABSTRACT=Structural similarity metrics such as the Tanimoto coefficient (TC) miss many functionally related compounds—indeed, 60% of similarly bioactive ligand pairs in the ChEMBL database show TC < 0.30, revealing a major blind spot that constrains ligand-based discovery. Our motivation is to overcome this blind spot and enable the recovery of structurally different yet functionally equivalent chemotypes that structure-based similarity fails to detect. Here, we introduce the bioactivity similarity index (BSI), a machine learning model that estimates the probability that two molecules bind the same or related protein receptors. Trained under leave-one-protein-out (LOPO) across Pfam-defined protein groups on dissimilar pairs, BSI not only outperforms TC but also surpasses modern molecular embedding baselines (ChemBERTa and contrastive language-molecule pre-training (CLAMP), using cosine similarity) across protein families. We further develop a cross-family model (BSI-Large) that, while slightly below group-specific models, generalizes better and can be fine-tuned with less data, consistently improving over models trained from scratch. In retrospective validation on new ChEMBL v35 data, BSI achieves strong early-retrieval performance (top 2% enrichment factor, EF2%), with group-specific models delivering the best enrichment, and BSI-Large remaining competitive. In a realistic virtual screening-like scenario against the target gene ADRA2B, the mean rank of the next active, given a known active, improves from 45.2 (TC) to 3.9 (BSI), with 54.9 for ChemBERTa and 28.6 for CLAMP. Altogether, BSI complements, rather than replaces, structure-based similarity and embedding-based comparisons, extending hit finding to remote chemotypes that are structurally dissimilar yet functionally equivalent. The code is available at https://github.com/gschottlender/bioactivity-similarity-index.