AUTHOR=Noor Saleha , Hussain Zamir , Hamdan Qurrat Ulain , Zaman Mehwish , Paracha Rehan Zafar , Zahra Shamsi Syeda Aneela TITLE=Leveraging data augmentation for machine learning models in predicting depression and anxiety using the Revised Child Anxiety and Depression Scale clinical reports JOURNAL=Frontiers in Psychiatry VOLUME=Volume 16 - 2025 YEAR=2025 URL=https://www.frontiersin.org/journals/psychiatry/articles/10.3389/fpsyt.2025.1672178 DOI=10.3389/fpsyt.2025.1672178 ISSN=1664-0640 ABSTRACT=ObjectiveAn estimated 15 million people are affected by depression and anxiety in Pakistan. However, there are relatively few government mental health facilities and certified psychiatrists. This highlights the need for efficient assessments to implement intervention strategies and address these challenges. This study aims to utilize machine learning with RCADS to maximize the use of current healthcare resources and facilitate depression and anxiety screening.MethodsThe dataset include 138 cases, with 89 retained after cleaning along 47 RCADS-items as features. Based on RCADS-47 T-scores, cases were classified as normal, borderline and clinical, with 7% in the borderline, 55% in normal and 38% in clinical range. Three feature selection methods - the Chi-square test of independence, Spearman’s correlation, and Random Forest-Recursive Feature Elimination were performed. Data augmentation was done using the probability distribution of the existing data to generate hybrid-synthetic correlated discrete multinomial variants of each item of RCADS. Six commonly employed ML algorithms, Decision Tree, Random Forest, Support Vector Machine, Logistic Regression, Naive Bayes, and K-Nearest Neighbor, were trained on the original dataset and the top three best models were then evaluated on augmented datasets and the best among them, further validated on external dataset.ResultsItem 05 of the RCADS has a weak correlation with the evaluation of depression and anxiety in the study population. Data augmented to forth time its original size was determined to be the optimal ratio for our dataset as Random Forest yielded the best overall results with up to 81% macro average accuracy, precision, recall and F1 score when tested on this data.ConclusionThe findings suggest that the Random Forest algorithm using 46 features suits the data well and has the potential to be further developed as a decision support system for the concerned professionals and improve the usual way of screening anxiety and depression in children and adolescents.