AUTHOR=Zhang Zihan , Wang Yixuan 

TITLE=BioSemAF-BiLSTM: a protein sequence feature extraction framework based on semantic and evolutionary information

JOURNAL=Frontiers in Genetics

VOLUME=Volume 16 - 2025

YEAR=2025

URL=https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2025.1616880

DOI=10.3389/fgene.2025.1616880

ISSN=1664-8021

ABSTRACT=S-sulfenylation is a critical post-translational modification that plays an important role in regulating protein function, redox signaling, and maintaining cellular homeostasis. Accurate identification of S-sulfenylation sites is essential for understanding its biological significance and relevance to disease. However, the exclusive detection of S-sulfenylation sites through experimental methods remains challenging, as these approaches are often time-consuming and costly. Motivated by this issue, the present work proposed a deep learning-based computational framework, named BioSemAF-BiLSTM, which integrated evolutionary and semantic features to improve the prediction performance of S-sulfenylation sites. The framework employed fastText to generate subword-based sequence embeddings that captured local contextual information, and employed position-specific scoring matrices (PSSMs) to extract evolutionary conservation features. Importantly, we also quantitatively evaluated feature sufficiency at the protein sequence level using a sequence compression-based measure approximating Kolmogorov complexity, revealing an 11% information loss rate in predictive modeling using these features. These representations were subsequently fed into a bidirectional long short-term memory (BiLSTM) network to model long-range dependencies, and were further refined via an adaptive feature fusion module to enhance feature interaction. Experimental results on a benchmark dataset demonstrated that the model significantly outperformed conventional machine learning methods and current state-of-the-art deep learning approaches, achieving an accuracy of 89.32% on an independent test. It demonstrated improved sensitivity and specificity, effectively bridging the gap between bioinformatics and deep learning, and offered a robust computational tool for predicting post-translational modification sites.