AUTHOR=Dianati-Nasab Mostafa , Salimifard Khodakaram , Mohammadi Reza , Saadatmand Sara , Fararouei Mohammad , Hosseini Kosar S. , Jiavid-Sharifi Behshid , Chaussalet Thierry , Dehdar Samira TITLE=Machine learning algorithms to uncover risk factors of breast cancer: insights from a large case-control study JOURNAL=Frontiers in Oncology VOLUME=Volume 13 - 2023 YEAR=2024 URL=https://www.frontiersin.org/journals/oncology/articles/10.3389/fonc.2023.1276232 DOI=10.3389/fonc.2023.1276232 ISSN=2234-943X ABSTRACT=This large case-control study explored the application of machine learning models to identify risk factors for primary invasive incident breast cancer (BC) in the Iranian population. This study serves as a bridge towards improved BC prevention, early detection, and management through the identification of modifiable and unmodifiable risk factors. The dataset includes 1,009 cases and 1,009 controls, with comprehensive data on lifestyle, health-behavior, reproductive, and sociodemographic factors. Different machine learning models, namely Random Forest (RF), Neural Networks (NN), Bootstrap Aggregating Classification and Regression Trees (Bagged CART), and Extreme Gradient Boosting Tree (XGBoost), were employed to analyze the data. The findings highlight the significance of a chest X-ray history, deliberate weight loss, abortion history, and post-menopausal status as predictors. Factors such as second-hand smoking, lower education, menarche age (>14), occupation (employed), first delivery age (18-23), and breastfeeding duration (>42 months) were also identified as important predictors in multiple models. The Receiver Operating Characteristic (ROC) curve showed that the Area Under the Curve (AUC), the RF model showcased the highest AUC value of 0.9 and followed by Bagged CART (0.89) and the XGBoost model (0.78); however, the NN model displayed the lowest AUC of 0.74. On the other hand, the RF model achieved an accuracy of 83.9% and a Kappa coefficient of 67.8% while the XGBoost, achieved a lower accuracy of 82.5% and a lower Kappa coefficient of 0.6. This study could be beneficial for targeted preventive measures according to the main risk factors for BC among high-risk women.