AUTHOR=Baranwal Sanskriti , Sanchez Ricardo Avila , Edet Clement-Andi , Chastain Erick , Toby Inimary 
  
TITLE=Optimizing clustering of CDR3 sequences using natural language processing, Word2Vec, and KMeans
  
JOURNAL=Frontiers in Bioinformatics
  
VOLUME=Volume 5 - 2025
  
YEAR=2025
  
URL=https://www.frontiersin.org/journals/bioinformatics/articles/10.3389/fbinf.2025.1623488
  
DOI=10.3389/fbinf.2025.1623488
  
ISSN=2673-7647
  
ABSTRACT=T-cell receptor (TCR) sequencing has emerged as a powerful tool for understanding adaptive immune responses, yet challenges persist in deciphering the immense diversity of Complementarity-Determining Region 3 (CDR3) sequences. This study presents a novel natural language processing (NLP)-based pipeline to cluster CDR3 sequences from TCR β-chain repertoires using Word2Vec embeddings, principal component analysis (PCA), and KMeans clustering. Focusing on Acute Respiratory Distress Syndrome (ARDS), a life-threatening inflammatory lung condition, we trained Word2Vec models on healthy controls and applied unsupervised clustering across ARDS, non-ARDS, and control datasets. Dimensionality-reduced embeddings revealed clear distinctions in repertoire structure: control samples exhibited tight, low-diversity clusters; ARDS patients showed high dispersion and numerous diffuse clusters indicative of repertoire disruption; and non-ARDS samples displayed intermediate organization. These differences suggest that immune activation states are embedded in the structural topology of the CDR3 space. Our framework successfully captured these latent patterns, offering a scalable approach to biomarker discovery. This study not only reinforces the utility of NLP in immunological analysis but also paves the way for data-driven immune monitoring in critical care and personalized diagnostics.