AUTHOR=Kasani Payam Hosseinzadeh , Yun Cheol-Heui , Cho Kee Hyun , Jeong Su Jin TITLE=Neonatal gut microbiota stratification and identification of SCFA-associated microbial subgroups using unsupervised clustering and machine learning classification JOURNAL=Frontiers in Microbiology VOLUME=Volume 16 - 2025 YEAR=2025 URL=https://www.frontiersin.org/journals/microbiology/articles/10.3389/fmicb.2025.1668451 DOI=10.3389/fmicb.2025.1668451 ISSN=1664-302X ABSTRACT=BackgroundThe neonatal gut microbiome plays a critical role in infant health through the production of short-chain fatty acids (SCFAs). However, the organization of SCFAs-producing microbial communities in neonates remains poorly characterized. This study applied unsupervised clustering and machine learning to classify microbial subgroups associated with SCFAs production, providing insight into their composition and metabolic potential.MethodsThis study recruited 71 mother-infant pairs from Kangwon National University Hospital and Bundang CHA Hospital, collecting meconium samples within five days postpartum. Microbial diversity was analyzed by 16S rRNA gene sequencing (V3–V4 region) at the genus level, and SCFAs were quantified from the same samples. To identify functionally distinct microbial subgroups, K-Means, Agglomerative, Spectral, and Gaussian Mixture Model clustering were applied. Clustering validity was assessed using Silhouette Score, Calinski-Harabasz Index, Davies-Bouldin Index, and Prediction Strength Validation, with t-distributed Stochastic Neighbor Embedding (t-SNE) visualization to evaluate cluster separation. SCFAs distributions across clusters were compared, while random forest and logistic regression models were used to classify SCFAs-associated microbial clusters through Receiver Operating Characteristic curves (ROC).ResultsThe clustering analysis identified distinct microbial subgroups linked to SCFAs production, with Agglomerative clustering outperforming K-Means in capturing functionally relevant structures. Cluster 1 had higher SCFAs levels, enriched in Bacteroides, Prevotella, and Enterococcus, while Cluster 2 exhibited lower SCFAs concentrations with a more heterogeneous composition. The introduction of a third cluster in multi-class analysis revealed an intermediate metabolic profile, suggesting a continuum in microbial metabolic function. Classification analysis confirmed random forest model superiority, achieving ROC score of 91.05% (Agglomerative) and 87.74% (K-Means) in binary classification, and 92.98% (Agglomerative) and 89.84% (K-Means) in multi-class classification, demonstrating RF’s strong predictive ability for SCFAs-based clusters.ConclusionUnsupervised clustering combined with classification analysis effectively predict SCFAs-associated subgroups and paving the way for future research on longitudinal tracking and functional genomic integration in early-life metabolic health.