AUTHOR=Lu Qingfeng , Chen Fengxia , Li Qianyue , Chen Lihong , Tong Ling , Tian Geng , Zhou Xiaohong TITLE=A Machine Learning Method to Trace Cancer Primary Lesion Using Microarray-Based Gene Expression Data JOURNAL=Frontiers in Oncology VOLUME=Volume 12 - 2022 YEAR=2022 URL=https://www.frontiersin.org/journals/oncology/articles/10.3389/fonc.2022.832567 DOI=10.3389/fonc.2022.832567 ISSN=2234-943X ABSTRACT=Cancers of the unknown primary site (CUP) are a heterogeneous group of cancers whose site of origin remains unknown after detailed investigation by conventional clinical medicine methods. CUP count for roughly 3-5% of all human malignancies. Traditional treatment for CUP is primary broad-spectrum chemotherapy, however, this method is relatively inefficient and costly. Thus it is of clinical cancer research to accurately detect the primary site of CUP. We downloaded the microarray-based gene expression profiles of 59385 genes for 5708 samples and 6364 genes for 3101 samples from THE Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO), respectively. We developed an XGBoost framework to trace the primary site of CUP. 800 and 500 genes were respectively selected from TCGA and GEO datasets to train an XGBoost model for identification of the primary site of CUP. The overall 5-fold cross-validation accuracy of our method was 97.3% and 97.1% on the TCGA and GEO train set, while the macro- precision for the independent data set reached 98.78% and 99%. Our XGBoost framework not only can reduce the cost of clinical cancer traceability but also has high efficiency, it is promising in clinical cancer research practice.