AUTHOR=Ma Zebing , Liu Yibing , Zhang Ziyan , Chen Rui , Fan Huayu , Cao Xiangyang , Ni Lili TITLE=Clinical applications of large language models in knee osteoarthritis: a systematic review JOURNAL=Frontiers in Medicine VOLUME=Volume 12 - 2025 YEAR=2025 URL=https://www.frontiersin.org/journals/medicine/articles/10.3389/fmed.2025.1670824 DOI=10.3389/fmed.2025.1670824 ISSN=2296-858X ABSTRACT=Background and aimsKnee osteoarthritis (KOA) is a common chronic degenerative disease that significantly impacts patients’ quality of life. With the rapid advancement of artificial intelligence, large language models (LLMs) have demonstrated potential in supporting medical information extraction, clinical decision-making, and patient education through their natural language processing capabilities. However, the current landscape of LLM applications in the KOA domain, along with their methodological quality, has yet to be systematically reviewed. Therefore, this systematic review aims to comprehensively summarize existing clinical studies on LLMs in KOA, evaluate their performance and methodological rigor, and identify current challenges and future research directions.MethodsFollowing the PRISMA guidelines, a systematic search was conducted in PubMed, Cochrane Library, Embase databases and Web of science for literature published up to June 2025. The protocol was preregistered on the OSF platform. Studies were screened using standardized inclusion and exclusion criteria. Key study characteristics and performance evaluation metrics were extracted. Methodological quality was assessed using tools such as Cochrane RoB, STROBE, STARD, and DISCERN. Additionally, the CLEAR-LLM and CliMA-10 frameworks were applied to provide complementary evaluations of quality and performance.ResultsA total of 16 studies were included, covering various LLMs such as ChatGPT, Gemini, and Claude. Application scenarios encompassed text generation, imaging diagnostics, and patient education. Most studies were observational in nature, and overall methodological quality ranged from moderate to high. Based on CliMA-10 scores, LLMs exhibited upper-moderate performance in KOA-related tasks. The ChatGPT-4 series consistently outperformed other models, especially in structured output generation, interpretation of clinical terminology, and content accuracy. Key limitations included insufficient sample representativeness, inconsistent control over hallucinated content, and the lack of standardized evaluation tools.ConclusionLarge language models show notable potential in the KOA field, but their clinical application is still exploratory and limited by issues such as sample bias and methodological heterogeneity. Model performance varies across tasks, underscoring the need for improved prompt design and standardized evaluation frameworks. With real-world data and ethical oversight, LLMs may contribute more significantly to personalized KOA management.Systematic review registrationhttps://osf.io/jy4kz, identifier 10.17605/OSF.IO/479R8.