<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="brief-report">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Artif. Intell.</journal-id>
<journal-title>Frontiers in Artificial Intelligence</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Artif. Intell.</abbrev-journal-title>
<issn pub-type="epub">2624-8212</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/frai.2025.1504805</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Artificial Intelligence</subject>
<subj-group>
<subject>Perspective</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Bridging the gap: a practical step-by-step approach to warrant safe implementation of large language models in healthcare</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Workum</surname> <given-names>Jessica D.</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2854841/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/conceptualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/investigation/"/>
<role content-type="https://credit.niso.org/contributor-roles/visualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author">
<name><surname>van de Sande</surname> <given-names>Davy</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2857699/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/conceptualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Gommers</surname> <given-names>Diederik</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1718456/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/supervision/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>van Genderen</surname> <given-names>Michel E.</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<role content-type="https://credit.niso.org/contributor-roles/conceptualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/supervision/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Adult Intensive Care, Erasmus MC University Medical Center</institution>, <addr-line>Rotterdam</addr-line>, <country>Netherlands</country></aff>
<aff id="aff2"><sup>2</sup><institution>Department of Intensive Care, Elisabeth-TweeSteden Hospital</institution>, <addr-line>Tilburg</addr-line>, <country>Netherlands</country></aff>
<aff id="aff3"><sup>3</sup><institution>Erasmus MC Datahub, Erasmus MC University Medical Center</institution>, <addr-line>Rotterdam</addr-line>, <country>Netherlands</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Steffen Pauws, Tilburg University, Netherlands</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Giacomo Rossettini, University of Verona, Italy</p>
<p>Paolo Marcheschi, Gabriele Monasterio Tuscany Foundation (CNR), Italy</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Michel E. van Genderen <email>m.vangenderen&#x00040;erasmusmc.nl</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>27</day>
<month>01</month>
<year>2025</year>
</pub-date>
<pub-date pub-type="collection">
<year>2025</year>
</pub-date>
<volume>8</volume>
<elocation-id>1504805</elocation-id>
<history>
<date date-type="received">
<day>01</day>
<month>10</month>
<year>2024</year>
</date>
<date date-type="accepted">
<day>06</day>
<month>01</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2025 Workum, van de Sande, Gommers and van Genderen.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Workum, van de Sande, Gommers and van Genderen</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>Large Language Models (LLMs) offer considerable potential to enhance various aspects of healthcare, from aiding with administrative tasks to clinical decision support. However, despite the growing use of LLMs in healthcare, a critical gap persists in clear, actionable guidelines available to healthcare organizations and providers to ensure their responsible and safe implementation. In this paper, we propose a practical step-by-step approach to bridge this gap and support healthcare organizations and providers in warranting the responsible and safe implementation of LLMs into healthcare. The recommendations in this manuscript include protecting patient privacy, adapting models to healthcare-specific needs, adjusting hyperparameters appropriately, ensuring proper medical prompt engineering, distinguishing between clinical decision support (CDS) and non-CDS applications, systematically evaluating LLM outputs using a structured approach, and implementing a solid model governance structure. We furthermore propose the ACUTE mnemonic; a structured approach for assessing LLM responses based on Accuracy, Consistency, semantically Unaltered outputs, Traceability, and Ethical considerations. Together, these recommendations aim to provide healthcare organizations and providers with a clear pathway for the responsible and safe implementation of LLMs into clinical practice.</p></abstract>
<kwd-group>
<kwd>large language models</kwd>
<kwd>responsible AI</kwd>
<kwd>artificial intelligence</kwd>
<kwd>health care quality</kwd>
<kwd>access and evaluation</kwd>
<kwd>disruptive technology</kwd>
</kwd-group>
<counts>
<fig-count count="1"/>
<table-count count="2"/>
<equation-count count="0"/>
<ref-count count="35"/>
<page-count count="8"/>
<word-count count="5796"/>
</counts>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Medicine and Public Health</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>Introduction</title>
<p>Large language models (LLMs) are artificial intelligence (AI) systems with the inherent capability of processing and interpreting natural language (Thirunavukarasu et al., <xref ref-type="bibr" rid="B27">2023</xref>). LLMs show promise in transforming healthcare, offering a newfound flexibility in that, like a Swiss army knife, one single tool can be used for various applications, including administrative support and clinical decision-making (Schoonbeek et al., <xref ref-type="bibr" rid="B25">2024</xref>; Levra et al., <xref ref-type="bibr" rid="B12">2024</xref>). For example, LLMs can aid clinicians by efficiently summarizing medical records and crafting discharge documents. A recent study by Schoonbeek et al. demonstrated that the GPT-4 model proved to be as complete and correct as the clinician in summarizing clinical notes in preparation for outpatient visits, while being 28 times faster (Schoonbeek et al., <xref ref-type="bibr" rid="B25">2024</xref>). Furthermore, LLMs have shown to offer a level of empathy in responding to patient questions that could surpass human clinicians (Ayers et al., <xref ref-type="bibr" rid="B2">2023</xref>; Luo et al., <xref ref-type="bibr" rid="B13">2024</xref>). Beyond these administrative or documentation tasks, the application of LLMs in healthcare can be expanded to clinical decision support. For example, when comparing the performance of an LLM to medical-journal readers in diagnosing complex real-world cases, the LLM outperformed its human counterparts with 57% vs. 36% correct diagnoses (Eriksen et al., <xref ref-type="bibr" rid="B5">2023</xref>). These examples represent a mere subset of potential applications of LLMs in healthcare, with the scope continuously expanding at rapid pace.</p>
<p>When used for clinical decision support (CDS), LLMs are likely to be considered a medical device and thus have to adhere to strict legislation, requiring thorough assessment to ensure quality standards (Keutzer and Simonsson, <xref ref-type="bibr" rid="B11">2020</xref>; Jackups, <xref ref-type="bibr" rid="B9">2023</xref>). However, for non-CDS applications, there is a lack of robust frameworks and regulatory oversight to ensure high quality output and responsible use of these models in clinical settings. Furthermore, existing legislations provide little guidance on responsible and safe implementation of LLMs from the healthcare organization or provider&#x00027;s perspective. This problem has also been identified recently the World Health Organization in their report on Ethics and Governance of AI for Health (World Health Organization, <xref ref-type="bibr" rid="B30">2024</xref>). Current existing frameworks remain largely abstract and provide limited practical guidance (Raza et al., <xref ref-type="bibr" rid="B23">2024</xref>). Thus, despite the growing use of LLMs in healthcare, a critical gap persists in clear, actionable guidelines for healthcare organizations and providers to ensure their responsible and safe implementation. In this paper, we propose a practical step-by-step approach, combined with an evaluation framework, to bridge this gap and support healthcare organizations and providers in warranting the responsible and safe implementation of LLMs into healthcare (<xref ref-type="fig" rid="F1">Figure 1</xref>).</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>A step by step approach <bold>(A)</bold> and a structured evaluation approach (ACUTE) <bold>(B)</bold> for the responsible and safe use and development of LLMs in healthcare. CDS, Clinical Decision Support.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-08-1504805-g0001.tif"/>
</fig>
</sec>
<sec id="s2">
<title>(1) Protect patient privacy</title>
<p>LLMs have the potential to inadvertently reveal sensitive information to third parties if this information has been previously used as an input in the LLM (Open et al., <xref ref-type="bibr" rid="B19">2023</xref>). Currently, protective measures (i.e., safeguards) to prevent such data leaks are inconsistent, leaving gaps in privacy protection (Yao et al., <xref ref-type="bibr" rid="B32">2024</xref>). Importantly, in the context of healthcare, adherence to legal frameworks designed to safeguard personal data, such as the General Data Protection Regulation (GDPR) in Europe and the Health Insurance Portability and Accountability Act (HIPAA) in the U.S., is crucial. These regulations mandate that patient information must not be disclosed to third parties, including the developers or hosts of LLMs. Consequently, publicly available LLMs, which typically log user interactions for the purpose of model improvement (retraining), are not viable for healthcare applications due to the risk of data exposure and misuse. To improve user acceptance, LLMs should ideally be integrated into healthcare Information Technology (IT) ecosystems that host these models locally or on secure hospital-owned cloud servers (Nazari-Shirkouhi et al., <xref ref-type="bibr" rid="B16">2023</xref>). This approach guarantees that patient data are securely maintained within the digital infrastructure of the hospital, thereby reinforcing the confidentiality and privacy of patient data. However, this is often not feasible due to high costs and infrastructure demands. Furthermore, the best performing general-purpose LLMs cannot be deployed locally due to proprietary nature of these models, restricting deployment on local servers. Open-source or smaller language models might be considered, but their performance can be inferior to proprietary LLMs (Wu et al., <xref ref-type="bibr" rid="B31">2024</xref>).</p>
<p>It is paramount that if third party hosts of LLMs are used in healthcare, patient privacy is protected by stablishing a secure way of data transmission and guaranteeing that the data is not retained and the model is not retrained with user data. As such, an application programming interfaces (APIs) can serve as a secure connection between the hospital and the third party LLM host by implementing robust encryption protocols. Importantly, healthcare providers must be aware that they should establish strict contractual agreements with third-party LLM hosts to prevent data retention and ensure that user or patient data is not utilized for model retraining.</p></sec>
<sec id="s3">
<title>(2) Consider healthcare-specific model adaptations</title>
<p>General-purpose LLMs still face performance limitations and may not suffice for complex and specialized healthcare tasks without modifications (Mao et al., <xref ref-type="bibr" rid="B14">2023</xref>). Therefore, specific use cases might benefit from integrating medical domain knowledge in the language model. There are two main ways of doing so: by creating a healthcare specific language model or by adapting an existing LLM with medical domain knowledge, either through retraining or by giving it access to a database with specific medical knowledge.</p>
<p>Benefits of creating healthcare-specific models are that they could address challenges such as fairness, transparency, and data-inconsistency and might perform better for very specific medical domain knowledge (He et al., <xref ref-type="bibr" rid="B8">2023</xref>). An additional benefit is that these models are typically smaller in size, leaving the possibility of running these models locally. However, it appears that the development of general-purpose LLMs is advancing more rapidly than that of healthcare-specific models, likely due to broader investment and scalability. By adapting an existing general-purpose LLM with medical domain knowledge, the performance of LLMs within the medical field increases dramatically (Ferber et al., <xref ref-type="bibr" rid="B6">2024</xref>). This can be achieved either by periodically retraining the model with medical domain knowledge, or through Retrieval Augmented Generation (RAG), a technique that integrates an external knowledge database with an LLM through a pre-constructed index (Ng et al., <xref ref-type="bibr" rid="B17">2025</xref>). Comparing both techniques to a human writer: with retraining, the memory of the writer has been expanded, and with RAG, the writer has continuous access to an up-to-date library of information. With RAG, the LLM is combined with a database of specific medical domain knowledge. The LLM draws information from this database when formulating a response, similar to a search engine. This ensures its responses are aligned with the latest medical knowledge while reducing the risk of hallucinations (Zakka et al., <xref ref-type="bibr" rid="B33">2024</xref>). RAG significantly improves the performance of LLMs for healthcare-specific applications. For example, when connecting a RAG framework to international oncology guidelines, the LLM&#x00027;s response improved from 57% to 84% in answering questions correctly regarding the management of oncology patients (Ferber et al., <xref ref-type="bibr" rid="B6">2024</xref>). Due to its flexibility, RAG is particularly beneficial in fields where knowledge evolves rapidly, such as medicine.</p></sec>
<sec id="s4">
<title>(3) Consider adjusting hyperparameter settings</title>
<p>Another way of improving an LLM&#x00027;s output is by adjusting its hyperparameters, particularly its temperature setting. The temperature controls between the randomness of the generated responses. Higher temperatures generate more variability, while lower temperatures result in more predictable and consistent responses, adhering more closely to the provided prompts (Pugh et al., <xref ref-type="bibr" rid="B22">2024</xref>). Therefore, it is thought that lower temperatures are recommended when consistency is important, whereas higher temperatures might be useful in addressing ambiguity. However, despite the rationale for adjusting temperature settings based on the specific demands of a clinical use case, recent available evidence suggests that adjustment of temperature has no significant effect on the consistency of performance for various LLMs across different clinical tasks, possibly rendering this step obsolete in the future (Patel et al., <xref ref-type="bibr" rid="B21">2024</xref>).</p></sec>
<sec id="s5">
<title>(4) Ensure adequate prompt engineering</title>
<p>An LLM&#x00027;s output is highly determined by the quality of the instructions or input to the model (prompt). Prompt engineering refers to the practice of designing and implementing prompts and is considered a new discipline within the field of AI. Advanced prompt engineering techniquesimprove the quality of the response of the model significantly (Zhang X. et al., <xref ref-type="bibr" rid="B34">2024</xref>). Examples of advanced prompt engineering techniques are Few-Shot prompting and Chain-of-Thought (CoT) prompting. In Few-Shot prompting, the prompt includes a small number of examples to guide the model&#x00027;s understanding of the task. By providing these task-specific examples, the model is able to produce more accurate responses, even in scenarios where it has not been extensively trained. For example, in answering sample exam questions for the United States Medical Licensing Examination, 5-shot prompting improved the performance for the GPT-4 model from 84% to 87% correct (Nori et al., <xref ref-type="bibr" rid="B18">2023</xref>). In CoT prompting, the model is instructed to engage in step-by-step reasoning by breaking down complex questions into smaller steps (Wei et al., <xref ref-type="bibr" rid="B29">2022</xref>). This structured approach helps the model reason through tasks more effectively, improving coherence and accuracy of the outputs. CoT is especially useful for tasks requiring logical progression, making this technique of particular interest in CDS applications (Miao et al., <xref ref-type="bibr" rid="B15">2024</xref>). However, various other prompt optimization approaches exist, reflecting the rapid evolution of this new discipline (Chang et al., <xref ref-type="bibr" rid="B3">2024</xref>).</p>
<p>Currently, healthcare professionals rely heavily on extensive experimentation using LLMs, with a limited theoretical understanding of why a specific phrasing or formulation of a task is more sensible than others. Inadequate prompt engineering in medicine without strict constraints could lead to undesired outputs, such as (erroneous) medical advice. It is therefore vital that prompts in the medical field should be created by experts in medical prompt engineering (Chen et al., <xref ref-type="bibr" rid="B4">2024</xref>).</p></sec>
<sec id="s6">
<title>(5) Distinguish between CDS and non-CDS applications</title>
<p>Due to regulatory oversight that warrant safe use of innovations in healthcare such as the Medical Device Regulation (MDR) in the European Union (EU) and the Food and Drug Administration (FDA) in the United States, it is important to differentiate between Clinical Decision Support (CDS) and non-CDS for the specific applications of LLMs. This differentiation strongly indicates whether the application is considered to be a medical device, and thus would fall under these specific regulations. CDS is generally understood to be any tool that assists clinicians in diagnostics or treatment decisions, and when it is used to inform clinical decisions that directly impact patient care, it is considered a medical device and would fall under these laws. In contrast, software that only provides supplementary information without driving clinical decisions, is not considered clinical decision support (non-CDS) and thus may not be classified as a medical device.</p>
<p>Consequently, an LLM that supports diagnostic or treatment processes would be classified as a medical device under, for example, the MDR. This prohibits the use of the tool until it has undergone a thorough assessment to ensure that it meets MDR-related quality standards, such as providing clinical evidence of their safety and effectiveness. This process may be time-consuming, possibly limiting the adoption of LLMs for CDS in healthcare.</p>
<p>Unlike traditional medical devices or AI-models, LLMs are inherently multi-purpose, capable of addressing diverse clinical and non-clinical queries. Subjecting LLMs to regulatory approval for each specific clinical purpose is impractical due to the immense effort and cost required. Their rapid evolution, with frequent updates in data, methods, and architectures, further complicates regulation. Regulatory sandboxes offer a supervised setting to explore regulatory requirements and evaluate LLM performance iteratively, providing a flexible pathway to address these challenges.</p>
<sec>
<title>CDS applications</title>
<p>The use of LLMs for CDS seems very promising. When presented with United States Medical Licensing Examination (USMLE) sample exam questions, the GPT-4 model correctly answered 87% without any healthcare-specific adaptations (Nori et al., <xref ref-type="bibr" rid="B18">2023</xref>). Additionally, on various publicly available benchmark datasets, such as the MedQA and the medical components of the Massive Multitask Language Understanding (MMLU), the GPT-4 model performed outstandingly well, answering over 80% correct for each benchmark (Nori et al., <xref ref-type="bibr" rid="B18">2023</xref>). This indicates that general medical knowledge is inherently present in these models. Fewer studies have researched the capabilities of LLMs for specialized medical knowledge within clinical subdomains. For example, in a recent study, the GPT-4 model was able to correctly answer nephrology questions with a score of 73%, without healthcare-specific adaptations or advanced prompt engineering techniques, indicating its potential for highly specialized fields (Wu et al., <xref ref-type="bibr" rid="B31">2024</xref>). When compared to human physicians, the performance of the GPT-4 model exhibited variation across medical specialties, although the model consistently met or exceeded the examination threshold in the majority of cases (Katz et al., <xref ref-type="bibr" rid="B10">2024</xref>).</p>
<p>When implemented into healthcare, CDS will most likely require the use of healthcare-specific model adaptations, utilizing techniques such as RAG, to improve the accuracy of responses. Relevant references should be linked so that the source of the information can be checked.</p>
<p>However, the progression toward CDS necessitates more than the mere capability to answer clinical questions, as clinical decision-making encompasses a combination of medical knowledge, clinical reasoning, multidisciplinary collaboration, evidence-based practice and communication skills. Current advancements in LLMs, aimed at improving logical reasoning, bring the use of LLMs for CDS closer to fruition. Nevertheless, due to the potential significant impact on clinical decision-making, implementing LLMs for CDS demands tremendous diligence.</p>
</sec>
<sec>
<title>Non-CDS applications</title>
<p>The majority of non-CDS applications aims to reduce the administrative load for healthcare providers. Various examples are currently being implemented, such as composing draft responses to patient messages and creating summaries of the patient chart (Schoonbeek et al., <xref ref-type="bibr" rid="B25">2024</xref>; van Veen et al., <xref ref-type="bibr" rid="B28">2023</xref>; Garcia et al., <xref ref-type="bibr" rid="B7">2024</xref>; Tai-Seale et al., <xref ref-type="bibr" rid="B26">2024</xref>). If a use case is not considered CDS, there are currently no laws or guidelines in place to ensure responsible and safe use of LLMs. Given the swift development and adoption of LLMs in society, it is likely that additional non-CDS applications of LLMs are coming to healthcare rapidly.</p>
<p>While legal frameworks such as the EU AI Act, GDPR, and HIPAA establish important baseline requirements for data protection and accountability, they do not address the unique challenges posed by LLMs in clinical settings. For example, they lack requirements for clinical validation, i.e., objectively assessing whether outputs are sufficient for clinical use while accounting for risks like hallucinations, missing information and misinterpretations. These challenges underscore the need for healthcare-specific validation processes to complement existing legal frameworks.</p></sec></sec>
<sec id="s7">
<title>(6) Evaluate using a structured approach</title>
<p>To ensure the responses of the LLM remain accurate, consistent, and aligned with clinical standards over time, a structured approach to evaluate their responses is essential. As LLMs are probabilistic by nature, their performance can vary, making continuous and systematic evaluation critical for maintaining quality and preventing errors, especially in high-stakes environments such as healthcare. Abbasian et al. proposed an extensive set of intrinsic and extrinsic evaluation metrics for assessing the performance of healthcare chatbots, including evaluating the quality of their response (Abbasian et al., <xref ref-type="bibr" rid="B1">2024</xref>). However, their comprehensiveness limits their practicality in clinical settings. To balance comprehensiveness and simplicity, we&#x00027;ve identified five key points that should be addressed when evaluating the response of an LLM in clinical settings, being accuracy, consistency, semantically unaltered, traceable and ethical. The mnemonic &#x0201C;ACUTE&#x0201D; (<xref ref-type="fig" rid="F1">Figure 1B</xref>) could be used as a helpful tool.</p>
<p>Accuracy encompasses three domains: first, substantive accuracy, meaning that responses are factually correct and contextually appropriate within the medical field, even for non-clinical decision support (non-CDS) applications. When determining if a response is substantively accurate, it is important to determine if the response is complete (i.e., determine if there is any information missing) and correct (i.e., determine if there are any factual errors). The second domain is linguistic accuracy, particularly for languages other than English. As foundational models are predominantly trained on English data, responses may exhibit reduced accuracy in other languages. Rigorously test for linguistic accuracy by adjusting the prompt. Frequently, writing the prompt in English and asking the LLM to translates yields better results. The third domain is local accuracy, which means, ensuring that the responses reflect each hospital&#x00027;s own policies and communication preferences.</p>
<p>When deployed in clinical practice, LLM responses need to be reproducible and stable over time, ensuring reliability in their outputs. As such, consistency is another key criterion. If the LLM provides inconsistent results, try adjusting the temperature settings or the prompt. If the inconsistency remains, try a different LLM for this specific clinical task.</p>
<p>The responses should also be semantically Unaltered. The response of LLM should accurately reflect the information presented in the patient chart without introducing extraneous content (hallucinations). Furthermore, the responses should be Traceable, making it clear where the LLM obtained its information, ideally by providing a reference to the source. For example, when utilizing RAG, the source of the information should be cited, and when summarizing notes in the patient chart, after each claim, the original note should be linked.</p>
<p>And lastly, the Ethical dimension mandates the responsible use of LLMs and aims to prevent that LLM responses do not perpetuate biases or harmful stereotypes, ensuring the responsible and fair use of these models in clinical practice. LLMs are typically trained on large datasets that include publicly available text, which often contains inherent biases reflective of societal inequalities. Studies have shown that these biases can perpetuate in LLM outputs, leading to disparities in diagnosis and treatment across different demographic groups. The Benchmark of Clinical Bias in Large Language Models (CLIMB benchmark) highlights how LLMs may exhibit these biases, resulting in unequal diagnostic accuracy across populations (Zhang Y. et al., <xref ref-type="bibr" rid="B35">2024</xref>). Similarly, another study found that LLMs could reinforce harmful stereotypes, such as underdiagnosing conditions like smoking in young males and obesity in middle-aged females (Pal et al., <xref ref-type="bibr" rid="B20">2023</xref>). This emphasizes the need for careful oversight to prevent biased decision-making in clinical practice. Ideally, each new use case should be clinically tested compared to its gold standard, which is generally the performance of the clinician.</p>
<p>In contrast to existing frameworks that provide broad, cross-sectoral guidelines, the ACUTE framework offers a specialized and practical approach tailored to the unique requirements of healthcare, focusing specifically on evaluating LLM outputs for clinical relevance and patient safety. We believe that using the ACUTE mnemonic as a structured approach balances simplicity and comprehensiveness for the evaluation of LLM responses and remains practical for real-world clinical use while still adequately addressing key challenges in LLM evaluation and deployment. Comparative analyses utilizing the ACUTE framework should be performed between LLM-generated outputs and clinician outputs for clinical validation.</p></sec>
<sec id="s8">
<title>(7) Implement a model governance structure</title>
<p>Eventually, it is crucial to ensure high quality performance and output of the LLM over time and therefore, a system for regular monitoring and continuous evaluation should be in place. This is of particular importance, as an LLM&#x00027;s performance can vary over time via retraining or is updated to a new version. Thus, establishing a governance framework to monitor the LLM&#x00027;s performance over time and implement adaptive maintenance strategies is crucial. In addition to model governance, robust data governance is essential, ensuring transparent data management and controlled access. Governance principles for both data and models should be traceable, securely stored, and readily accessible to notified bodies and competent authorities to support regulatory compliance. A dedicated team comprising medical and AI experts should be established to collect and evaluate user feedback, interpret model quality and outputs, and implement appropriate actions accordingly. Adaptive maintenance strategies could include periodic audits of LLM outputs and robust fallback mechanisms, such as maintaining access to legacy versions and options for model switching. By incorporating these measures, the governance structure will remain robust and futureproof, safeguarding both safety and reliability over time. The ACUTE framework mentioned in step 6 could offer such guidance.</p></sec>
<sec id="s9">
<title>Connect all the steps</title>
<p>To move toward the safe and responsible development and implementation of LLMs in both administrative tasks and clinical decision support in healthcare, connecting all the steps is essential. For example, by combining different prompt engineering techniques with healthcare-specific model adaptations like RAG the overall performance of an LLM on medical board examinations improves significantly, highlighting the importance of considering the steps outlined in this manuscript (Samaan et al., <xref ref-type="bibr" rid="B24">2024</xref>). As a practical aid, we have transformed the recommendations into &#x0201C;critical questions&#x0201D; in <xref ref-type="table" rid="T1">Table 1</xref>, and the ACUTE framework into a checklist in <xref ref-type="table" rid="T2">Table 2</xref>. These critical questions are designed to assess the readiness for responsible LLM implementation in healthcare. If these questions cannot be answered adequately, there is a significant gap that must be addressed prior to utilizing LLMs in healthcare. The ACUTE checklist will help systematically evaluate the performance of an LLM application, while highlighting potential weaknesses.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Critical questions to guide responsible LLM implementation in healthcare with actionable steps.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#8f9496;color:#ffffff">
<th valign="top" align="left"><bold>Recommendation</bold></th>
<th valign="top" align="left"><bold>Critical questions</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">1. Protect patient privacy</td>
<td valign="top" align="left">How is patient data securely transmitted and stored?</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">Are third-party agreements in place to prevent data retention or model retraining?</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">Are the LLMs hosted on secure, hospital-controlled infrastructure?</td>
</tr> <tr>
<td valign="top" align="left">2. Consider healthcare-specific model adaptations</td>
<td valign="top" align="left">Is medical domain knowledge paramount to the specific use case for which an LLM is deployed?</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">Has the LLM been adapted or validated for the specific healthcare tasks it will perform? If so, how?</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">Does the application utilize RAG (Retrieval-Augmented Generation) to integrate up-to-date medical knowledge?</td>
</tr> <tr>
<td valign="top" align="left">3. Consider adjusting hyperparameter settings</td>
<td valign="top" align="left">Have hyperparameters, such as temperature, been adjusted to align with the specific clinical use case?</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">Has the impact of hyperparameter adjustments been adequately evaluated?</td>
</tr> <tr>
<td valign="top" align="left">4. Ensure adequate prompt engineering</td>
<td valign="top" align="left">Who is responsible for writing and maintaining the prompts?</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">Have medical professionals been involved in designing and testing the prompts?</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">Have the prompts been tested and refined in an iterative manner to minimize errors and undesired outputs?</td>
</tr> <tr>
<td valign="top" align="left">5. Distinguish between CDS and non-CDS applications</td>
<td valign="top" align="left">Is the application clearly categorized as either Clinical Decision Support (CDS) or non-CDS?</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">For CDS applications, does the LLM comply with potentially relevant medical device regulations (e.g., MDR, FDA)?</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">For non-CDS applications, are barriers set in place to avoid unintended use as a medical device?</td>
</tr> <tr>
<td valign="top" align="left">6. Evaluate using a structured approach</td>
<td valign="top" align="left">Are LLM outputs evaluated using a structured framework, such as the ACUTE criteria (Accuracy, Consistency, Unaltered meaning, Traceability, Ethical considerations)?</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">Is there a process for documenting evaluation results and using them to guide improvements?</td>
</tr> <tr>
<td valign="top" align="left">7. Implement a model governance structure</td>
<td valign="top" align="left">Is there a dedicated team in place to monitor and oversee LLM performance over time?</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">Are evaluations performed regularly to ensure ongoing alignment with clinical standards?</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Are fallback mechanisms established to ensure continuity?</td>
</tr></tbody>
</table>
</table-wrap>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Checklist for the ACUTE framework, designed to evaluate LLM outputs in healthcare and ensure that each criterion is addressed effectively to minimize risks and enhance reliability.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#8f9496;color:#ffffff">
<th valign="top" align="left"><bold>Dimension</bold></th>
<th valign="top" align="left"><bold>Criteria</bold></th>
<th valign="top" align="left"><bold>Focus</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Accuracy</td>
<td valign="top" align="left">Are responses factually correct and complete?</td>
<td valign="top" align="left">Substantive accuracy</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">Are responses grammatically correct and clear, even in non-English languages?</td>
<td valign="top" align="left">Linguistic accuracy</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">Do responses align with hospital policies and preferences?</td>
<td valign="top" align="left">Local accuracy</td>
</tr> <tr>
<td valign="top" align="left">Consistency</td>
<td valign="top" align="left">Are responses consistent across repeated prompts?</td>
<td valign="top" align="left">Reproducibility</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">Are responses stable across different sessions and model versions?</td>
<td valign="top" align="left">Stability over time</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">Are inconsistencies addressed effectively through prompt refinements?</td>
<td valign="top" align="left">Mitigation of inconsistencies</td>
</tr> <tr>
<td valign="top" align="left">Unaltered</td>
<td valign="top" align="left">Do responses avoid adding erroneous or fabricated information?</td>
<td valign="top" align="left">Hallucinations</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">Do responses accurately reflect the input data, such as patient charts?</td>
<td valign="top" align="left">Reflection of source data</td>
</tr> <tr>
<td valign="top" align="left">Traceability</td>
<td valign="top" align="left">If applicable, are claims and recommendations clearly linked to credible sources?</td>
<td valign="top" align="left">Source attribution</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">If applicable, are external references provided when RAG or other systems are used?</td>
<td valign="top" align="left">Use of retrieval systems</td>
</tr> <tr>
<td valign="top" align="left">Ethical</td>
<td valign="top" align="left">Do responses avoid perpetuating harmful biases or stereotypes?</td>
<td valign="top" align="left">Bias avoidance</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Are sensitive topics handled responsibly and respectfully?</td>
<td valign="top" align="left">Sensitive topics</td>
</tr></tbody>
</table>
</table-wrap>
<p>Ultimately, we must bridge the gap between technological AI model development and trustworthy and responsible AI adoption in a clinical setting. Despite the growing use of LLMs, a critical gap persists in clear, actionable guidelines available to healthcare organizations and providers to ensure their responsible and safe implementation. The integration of a step-by-step approach, combined with a practical evaluation framework, could address this gap. By balancing simplicity with comprehensiveness, these recommendations could lower AI hesitancy, improve clinical implementation and unlock its full potential in improving healthcare. Future researchers are encouraged to validate the proposed framework across diverse clinical scenarios. Advancing the responsible implementation of LLMs in healthcare will require a collective effort from healthcare organizations, providers, researchers, and policymakers to ensure robust validation, responsible use and adequate monitoring of LLMs in clinical practice. The recommendations outlined in this manuscript provide a practical starting point for this collaborative journey, offering guidance for the responsible and effective implementation of LLMs in healthcare.</p></sec>
</body>
<back>
<sec sec-type="data-availability" id="s10">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.</p>
</sec>
<sec sec-type="author-contributions" id="s11">
<title>Author contributions</title>
<p>JW: Conceptualization, Investigation, Visualization, Writing &#x02013; original draft, Writing &#x02013; review &#x00026; editing. DS: Conceptualization, Writing &#x02013; original draft. DG: Supervision, Writing &#x02013; review &#x00026; editing. MG: Conceptualization, Supervision, Writing &#x02013; review &#x00026; editing.</p>
</sec>
<sec sec-type="funding-information" id="s12">
<title>Funding</title>
<p>The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec id="s13">
<title>Generative AI statement</title>
<p>The author(s) declare that Gen AI was used in the creation of this manuscript. Paperpal (version 3.209.2, source: <email>paperpal.com</email>) was utilized solely for the purpose to enhance language.</p></sec>
<sec sec-type="disclaimer" id="s14">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Abbasian</surname> <given-names>M.</given-names></name> <name><surname>Khatibi</surname> <given-names>E.</given-names></name> <name><surname>Azimi</surname> <given-names>I.</given-names></name> <name><surname>Oniani</surname> <given-names>D.</given-names></name> <name><surname>Shakeri Hossein Abad</surname> <given-names>Z.</given-names></name> <name><surname>Thieme</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2024</year>). <article-title>Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI</article-title>. <source>NPJ Digit Med</source>. <volume>7</volume>:<fpage>82</fpage>. <pub-id pub-id-type="doi">10.1038/s41746-024-01074-z</pub-id><pub-id pub-id-type="pmid">38553625</pub-id></citation></ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ayers</surname> <given-names>J. W.</given-names></name> <name><surname>Poliak</surname> <given-names>A.</given-names></name> <name><surname>Dredze</surname> <given-names>M.</given-names></name> <name><surname>Leas</surname> <given-names>E. C.</given-names></name> <name><surname>Zhu</surname> <given-names>Z.</given-names></name> <name><surname>Kelley</surname> <given-names>J. B.</given-names></name> <etal/></person-group>. (<year>2023</year>). <article-title>Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum</article-title>. <source>JAMA Intern. Med</source>. <volume>183</volume>, <fpage>589</fpage>&#x02013;<lpage>596</lpage>. <pub-id pub-id-type="doi">10.1001/jamainternmed.2023.1838</pub-id><pub-id pub-id-type="pmid">37115527</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chang</surname> <given-names>K.</given-names></name> <name><surname>Xu</surname> <given-names>S.</given-names></name> <name><surname>Wang</surname> <given-names>C.</given-names></name> <name><surname>Luo</surname> <given-names>Y.</given-names></name> <name><surname>Liu</surname> <given-names>X.</given-names></name> <name><surname>Xiao</surname> <given-names>T.</given-names></name> <etal/></person-group>. (<year>2024</year>). <article-title>Efficient prompting methods for large language models: a survey</article-title>. <source>arXiv [Preprint].</source> arXiv:2404.01077. <pub-id pub-id-type="doi">10.48550/arXiv.2404.01077</pub-id></citation>
</ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>S.</given-names></name> <name><surname>Guevara</surname> <given-names>M.</given-names></name> <name><surname>Moningi</surname> <given-names>S.</given-names></name> <name><surname>Hoebers</surname> <given-names>F.</given-names></name> <name><surname>Elhalawani</surname> <given-names>H.</given-names></name> <name><surname>Kann</surname> <given-names>B. H.</given-names></name> <etal/></person-group>. (<year>2024</year>). <article-title>The effect of using a large language model to respond to patient messages</article-title>. <source>Lancet Digital Health</source>. <volume>6</volume>, <fpage>e379</fpage>&#x02013;<lpage>e381</lpage>. <pub-id pub-id-type="doi">10.1016/S2589-7500(24)00060-8</pub-id><pub-id pub-id-type="pmid">38664108</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Eriksen</surname> <given-names>A. V.</given-names></name> <name><surname>M&#x000F6;ller</surname> <given-names>S.</given-names></name> <name><surname>Ryg</surname> <given-names>J.</given-names></name></person-group> (<year>2023</year>). <article-title>Use of GPT-4 to diagnose complex clinical cases</article-title>. <source>NEJM AI</source>. <volume>1</volume>, <fpage>2023</fpage>&#x02013;<lpage>2025</lpage>. <pub-id pub-id-type="doi">10.1056/AIp2300031</pub-id></citation>
</ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ferber</surname> <given-names>D.</given-names></name> <name><surname>Wiest</surname> <given-names>I. C.</given-names></name> <name><surname>W&#x000F6;lflein</surname> <given-names>G.</given-names></name> <name><surname>Ebert</surname> <given-names>M. P.</given-names></name> <name><surname>Beutel</surname> <given-names>G.</given-names></name> <name><surname>Eckardt</surname> <given-names>J. N.</given-names></name> <etal/></person-group>. (<year>2024</year>). <article-title>GPT-4 for information retrieval and comparison of medical oncology guidelines</article-title>. <source>NEJM AI</source> <volume>1</volume>:<fpage>235</fpage>. <pub-id pub-id-type="doi">10.1056/AIcs2300235</pub-id></citation>
</ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Garcia</surname> <given-names>P.</given-names></name> <name><surname>Ma</surname> <given-names>S. P.</given-names></name> <name><surname>Shah</surname> <given-names>S.</given-names></name> <name><surname>Smith</surname> <given-names>M.</given-names></name> <name><surname>Jeong</surname> <given-names>Y.</given-names></name> <name><surname>Devon-Sand</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2024</year>). <article-title>Artificial intelligence&#x02013;generated draft replies to patient inbox messages</article-title>. <source>JAMA Netw. Open</source> <volume>7</volume>:<fpage>e243201</fpage>. <pub-id pub-id-type="doi">10.1001/jamanetworkopen.2024.3201</pub-id><pub-id pub-id-type="pmid">38506805</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Mao</surname> <given-names>R.</given-names></name> <name><surname>Lin</surname> <given-names>Q.</given-names></name> <name><surname>Ruan</surname> <given-names>Y.</given-names></name> <name><surname>Lan</surname> <given-names>X.</given-names></name> <name><surname>Feng</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2023</year>). <article-title>A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics</article-title>. <source>arXiv [Preprint]</source>. <pub-id pub-id-type="doi">10.2139/ssrn.4809363</pub-id></citation>
</ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jackups</surname> <given-names>R.</given-names></name></person-group> (<year>2023</year>). <article-title>FDA regulation of laboratory clinical decision support software: is it a medical device?</article-title> <source>Clin. Chem</source>. <volume>69</volume>, <fpage>327</fpage>&#x02013;<lpage>329</lpage>. <pub-id pub-id-type="doi">10.1093/clinchem/hvad011</pub-id><pub-id pub-id-type="pmid">36806588</pub-id></citation></ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Katz</surname> <given-names>U.</given-names></name> <name><surname>Cohen</surname> <given-names>E.</given-names></name> <name><surname>Shachar</surname> <given-names>E.</given-names></name> <name><surname>Somer</surname> <given-names>J.</given-names></name> <name><surname>Fink</surname> <given-names>A.</given-names></name> <name><surname>Morse</surname> <given-names>E.</given-names></name> <etal/></person-group>. (<year>2024</year>). <article-title>GPT versus resident physicians &#x02014; a benchmark based on official board scores</article-title>. <source>NEJM AI</source> <volume>1</volume>:<fpage>192</fpage>. <pub-id pub-id-type="doi">10.1056/AIdbp2300192</pub-id></citation>
</ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Keutzer</surname> <given-names>L.</given-names></name> <name><surname>Simonsson</surname> <given-names>U. S. H.</given-names></name></person-group> (<year>2020</year>). <article-title>Medical device apps: An introduction to regulatory affairs for developers</article-title>. <source>JMIR Mhealth Uhealth</source>. <volume>8</volume>:<fpage>e17567</fpage>. <pub-id pub-id-type="doi">10.2196/17567</pub-id><pub-id pub-id-type="pmid">32589154</pub-id></citation></ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Levra</surname> <given-names>A. G.</given-names></name> <name><surname>Gatti</surname> <given-names>M.</given-names></name> <name><surname>Mene</surname> <given-names>R.</given-names></name> <name><surname>Shiffer</surname> <given-names>D.</given-names></name> <name><surname>Costantino</surname> <given-names>G.</given-names></name> <name><surname>Solbiati</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2024</year>). <article-title>A large language model-based clinical decision support system for syncope recognition in the emergency department: a framework for clinical workflow integration</article-title>. <source>Eur. J. Intern. Med.</source> <volume>131</volume>, <fpage>113</fpage>&#x02013;<lpage>120</lpage>. <pub-id pub-id-type="doi">10.1016/j.ejim.2024.09.017</pub-id><pub-id pub-id-type="pmid">39341748</pub-id></citation></ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Luo</surname> <given-names>M.</given-names></name> <name><surname>Warren</surname> <given-names>C. J.</given-names></name> <name><surname>Cheng</surname> <given-names>L.</given-names></name> <name><surname>Abdul-Muhsin</surname> <given-names>H. M.</given-names></name> <name><surname>Banerjee</surname> <given-names>I.</given-names></name></person-group> (<year>2024</year>). <article-title>Assessing empathy in large language models with real-world physician-patient interactions</article-title>. <source>arXiv [Preprint].</source> arXiv:2405.16402. <pub-id pub-id-type="doi">10.48550/arXiv.2405.16402</pub-id></citation>
</ref>
<ref id="B14">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Mao</surname> <given-names>R.</given-names></name> <name><surname>Chen</surname> <given-names>G.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Guerin</surname> <given-names>F.</given-names></name> <name><surname>Cambria</surname> <given-names>E.</given-names></name></person-group> (<year>2023</year>). <source>GPTEval: A Survey on Assessments of ChatGPT and GPT-4</source>. Available at: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/2308.12488">http://arxiv.org/abs/2308.12488</ext-link> (accessed 23 August 2023).</citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Miao</surname> <given-names>J.</given-names></name> <name><surname>Thongprayoon</surname> <given-names>C.</given-names></name> <name><surname>Suppadungsuk</surname> <given-names>S.</given-names></name> <name><surname>Krisanapan</surname> <given-names>P.</given-names></name> <name><surname>Radhakrishnan</surname> <given-names>Y.</given-names></name> <name><surname>Cheungpasitporn</surname> <given-names>W.</given-names></name></person-group> (<year>2024</year>). <article-title>Chain of thought utilization in large language models and application in nephrology</article-title>. <source>Medicina (Lithuania)</source> <volume>60</volume>:<fpage>148</fpage>. <pub-id pub-id-type="doi">10.3390/medicina60010148</pub-id><pub-id pub-id-type="pmid">38256408</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nazari-Shirkouhi</surname> <given-names>S.</given-names></name> <name><surname>Badizadeh</surname> <given-names>A.</given-names></name> <name><surname>Dashtpeyma</surname> <given-names>M.</given-names></name> <name><surname>Ghodsi</surname> <given-names>R.</given-names></name></person-group> (<year>2023</year>). <article-title>A model to improve user acceptance of e-services in healthcare systems based on technology acceptance model: an empirical study</article-title>. <source>J. Ambient Intell. Humaniz. Comput.</source> <volume>14</volume>, <fpage>7919</fpage>&#x02013;<lpage>7935</lpage>. <pub-id pub-id-type="doi">10.1007/s12652-023-04601-0</pub-id><pub-id pub-id-type="pmid">37228695</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ng</surname> <given-names>K. K. Y.</given-names></name> <name><surname>Matsuba</surname> <given-names>I.</given-names></name> <name><surname>Zhang</surname> <given-names>P. C.</given-names></name></person-group> (<year>2025</year>). <article-title>RAG in health care: a novel framework for improving communication and decision-making by addressing LLM limitations</article-title>. <source>NEJM AI</source> <volume>2</volume>:<fpage>380</fpage>. <pub-id pub-id-type="doi">10.1056/AIra2400380</pub-id></citation>
</ref>
<ref id="B18">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Nori</surname> <given-names>H.</given-names></name> <name><surname>King</surname> <given-names>N.</given-names></name> <name><surname>Mckinney</surname> <given-names>S. M.</given-names></name> <name><surname>Carignan</surname> <given-names>D.</given-names></name> <name><surname>Horvitz</surname> <given-names>E.</given-names></name> <name><surname>Openai</surname> <given-names>M.</given-names></name></person-group> (<year>2023</year>). <source>Capabilities of GPT-4 on Medical Challenge Problems</source>. Available at: <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2303.13375v2">https://arxiv.org/abs/2303.13375v2</ext-link> (accessed 28 September 2024).</citation>
</ref>
<ref id="B19">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Open</surname> <given-names>A. I.</given-names></name> <name><surname>Achiam</surname> <given-names>J.</given-names></name> <name><surname>Adler</surname> <given-names>S.</given-names></name> <name><surname>Agarwal</surname> <given-names>S.</given-names></name> <name><surname>Ahmad</surname> <given-names>L.</given-names></name> <name><surname>Akkaya</surname> <given-names>I.</given-names></name> <etal/></person-group>. (<year>2023</year>). <source>GPT-4 Technical Report</source>. Available at: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/2303.08774">http://arxiv.org/abs/2303.08774</ext-link> (accessed 15 March 2023).</citation>
</ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pal</surname> <given-names>R.</given-names></name> <name><surname>Garg</surname> <given-names>H.</given-names></name> <name><surname>Patel</surname> <given-names>S.</given-names></name> <name><surname>Sethi</surname> <given-names>T.</given-names></name></person-group> (<year>2023</year>). <article-title>Bias amplification in intersectional subpopulations for clinical phenotyping by large language models</article-title>. <source>MedRxiv [Preprint]</source>. <pub-id pub-id-type="doi">10.1101/2023.03.22.23287585</pub-id></citation>
</ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Patel</surname> <given-names>D.</given-names></name> <name><surname>Timsina</surname> <given-names>P.</given-names></name> <name><surname>Raut</surname> <given-names>G.</given-names></name> <name><surname>Freeman</surname> <given-names>R.</given-names></name> <name><surname>Levin</surname> <given-names>M. A.</given-names></name> <name><surname>Nadkarni</surname> <given-names>G. N.</given-names></name> <etal/></person-group>. (<year>2024</year>). <article-title>Exploring temperature effects on large language models across various clinical tasks</article-title>. <source>medRxiv [Preprint]</source>. <pub-id pub-id-type="doi">10.1101/2024.07.22.24310824</pub-id></citation>
</ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pugh</surname> <given-names>S. L.</given-names></name> <name><surname>Chandler</surname> <given-names>C.</given-names></name> <name><surname>Cohen</surname> <given-names>A. S.</given-names></name> <name><surname>Diaz-Asper</surname> <given-names>C.</given-names></name> <name><surname>Elvev&#x000E5;g</surname> <given-names>B.</given-names></name> <name><surname>Foltz</surname> <given-names>P. W.</given-names></name></person-group> (<year>2024</year>). <article-title>Assessing dimensions of thought disorder with large language models: the tradeoff of accuracy and consistency</article-title>. <source>Psychiatry Res</source>. <volume>341</volume>:<fpage>116119</fpage>. <pub-id pub-id-type="doi">10.1016/j.psychres.2024.116119</pub-id><pub-id pub-id-type="pmid">39226873</pub-id></citation></ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Raza</surname> <given-names>M. M.</given-names></name> <name><surname>Venkatesh</surname> <given-names>K. P.</given-names></name> <name><surname>Kvedar</surname> <given-names>J. C.</given-names></name></person-group> (<year>2024</year>). <article-title>Generative AI and large language models in health care: pathways to implementation</article-title>. <source>NPJ Digit. Med</source>. <volume>7</volume>:<fpage>62</fpage>. <pub-id pub-id-type="doi">10.1038/s41746-023-00988-4</pub-id><pub-id pub-id-type="pmid">38454007</pub-id></citation></ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Samaan</surname> <given-names>J. S.</given-names></name> <name><surname>Margolis</surname> <given-names>S.</given-names></name> <name><surname>Srinivasan</surname> <given-names>N.</given-names></name> <name><surname>Srinivasan</surname> <given-names>A.</given-names></name> <name><surname>Yeo</surname> <given-names>Y. H.</given-names></name> <name><surname>Anand</surname> <given-names>R.</given-names></name> <etal/></person-group>. (<year>2024</year>). <article-title>Multimodal large language model passes specialty board examination and surpasses human test-taker scores: a comparative analysis examining the stepwise impact of model prompting strategies on performance</article-title>. <source>medRxiv</source>. <volume>2024</volume>:<fpage>10809</fpage>. <pub-id pub-id-type="doi">10.1101/2024.07.27.24310809</pub-id></citation>
</ref>
<ref id="B25">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Schoonbeek</surname> <given-names>R. C.</given-names></name> <name><surname>Workum</surname> <given-names>J. D.</given-names></name> <name><surname>Schuit</surname> <given-names>S. C. E.</given-names></name> <name><surname>Doornberg</surname> <given-names>J. N.</given-names></name> <name><surname>Van Der Laan</surname> <given-names>T. P.</given-names></name> <name><surname>Bootsma-Robroeks</surname> <given-names>C. M. H. H.T.</given-names></name></person-group> (<year>2024</year>). <source>Completeness, Correctness and Conciseness of Physician-written versus Large Language Model Generated Patient Summaries Integrated in Electronic Health Records</source>. SSRN. Available at: <ext-link ext-link-type="uri" xlink:href="https://ssrn.com/abstract=4835935">https://ssrn.com/abstract=4835935</ext-link></citation>
</ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tai-Seale</surname> <given-names>M.</given-names></name> <name><surname>Baxter</surname> <given-names>S. L.</given-names></name> <name><surname>Vaida</surname> <given-names>F.</given-names></name> <name><surname>Walker</surname> <given-names>A.</given-names></name> <name><surname>Sitapati</surname> <given-names>A. M.</given-names></name> <name><surname>Osborne</surname> <given-names>C.</given-names></name> <etal/></person-group>. (<year>2024</year>). <article-title>AI-generated draft replies integrated into health records and physicians&#x00027; electronic communication</article-title>. <source>JAMA Netw. Open</source> <volume>2024</volume>:<fpage>E246565</fpage>. <pub-id pub-id-type="doi">10.1001/jamanetworkopen.2024.6565</pub-id><pub-id pub-id-type="pmid">38619840</pub-id></citation></ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Thirunavukarasu</surname> <given-names>A. J.</given-names></name> <name><surname>Ting</surname> <given-names>D. S. J.</given-names></name> <name><surname>Elangovan</surname> <given-names>K.</given-names></name> <name><surname>Gutierrez</surname> <given-names>L.</given-names></name> <name><surname>Tan</surname> <given-names>T. F.</given-names></name> <name><surname>Ting</surname> <given-names>D. S. W.</given-names></name></person-group> (<year>2023</year>). <article-title>Large language models in medicine</article-title>. <source>Nat. Med.</source> <volume>29</volume>, <fpage>1930</fpage>&#x02013;<lpage>1940</lpage>. <pub-id pub-id-type="doi">10.1038/s41591-023-02448-8</pub-id><pub-id pub-id-type="pmid">37460753</pub-id></citation></ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>van Veen</surname> <given-names>D.</given-names></name> <name><surname>van Uden</surname> <given-names>C.</given-names></name> <name><surname>Blankemeier</surname> <given-names>L.</given-names></name> <name><surname>Delbrouck</surname> <given-names>J.-B.</given-names></name> <name><surname>Aali</surname> <given-names>A.</given-names></name> <name><surname>Bluethgen</surname> <given-names>C.</given-names></name> <etal/></person-group>. (<year>2023</year>). <article-title>Adapted large language models can outperform medical experts in clinical text summarization</article-title>. <source>Nat. Med.</source> <volume>30</volume>, <fpage>1134</fpage>&#x02013;<lpage>1142</lpage>. <pub-id pub-id-type="doi">10.1038/s41591-024-02855-5</pub-id><pub-id pub-id-type="pmid">38413730</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wei</surname> <given-names>J.</given-names></name> <name><surname>Wang</surname> <given-names>X.</given-names></name> <name><surname>Schuurmans</surname> <given-names>D.</given-names></name> <name><surname>Bosma</surname> <given-names>M.</given-names></name> <name><surname>Ichter</surname> <given-names>B.</given-names></name> <name><surname>Xia</surname> <given-names>F.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>Chain-of-thought prompting elicits reasoning in large language models chain-of-thought prompting</article-title>. <source>arXiv [Preprint].</source> arXiv:2201.11903. <pub-id pub-id-type="doi">10.48550/arXiv.2201.11903</pub-id><pub-id pub-id-type="pmid">39637822</pub-id></citation></ref>
<ref id="B30">
<citation citation-type="book"><person-group person-group-type="author"><collab>World Health Organization</collab></person-group> (<year>2024</year>). <italic>Ethics and Governance of Artificial Intelligence for Health</italic>. <source>Guidance on Large Multi-modal Models</source>. <publisher-loc>Geneva</publisher-loc>: <publisher-name>World Health Organization</publisher-name>.</citation>
</ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>S.</given-names></name> <name><surname>Koo</surname> <given-names>M.</given-names></name> <name><surname>Blum</surname> <given-names>L.</given-names></name> <name><surname>Black</surname> <given-names>A.</given-names></name> <name><surname>Kao</surname> <given-names>L.</given-names></name> <name><surname>Fei</surname> <given-names>Z.</given-names></name> <etal/></person-group>. (<year>2024</year>). <article-title>Benchmarking open-source large language models, GPT-4 and Claude 2 on multiple-choice questions in nephrology</article-title>. <source>NEJM AI</source> <volume>1</volume>, <fpage>1</fpage>&#x02013;<lpage>8</lpage>. <pub-id pub-id-type="doi">10.1056/AIdbp2300092</pub-id></citation>
</ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yao</surname> <given-names>Y.</given-names></name> <name><surname>Duan</surname> <given-names>J.</given-names></name> <name><surname>Xu</surname> <given-names>K.</given-names></name> <name><surname>Cai</surname> <given-names>Y.</given-names></name> <name><surname>Sun</surname> <given-names>Z.</given-names></name> <name><surname>Zhang</surname> <given-names>Y.</given-names></name></person-group> (<year>2024</year>). <article-title>A survey on large language model (LLM) security and privacy: the good, the bad, and the ugly</article-title>. <source>High-Confid. Comp</source>. <volume>2024</volume>:<fpage>100211</fpage>. <pub-id pub-id-type="doi">10.1016/j.hcc.2024.100211</pub-id></citation>
</ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zakka</surname> <given-names>C.</given-names></name> <name><surname>Shad</surname> <given-names>R.</given-names></name> <name><surname>Chaurasia</surname> <given-names>A.</given-names></name> <name><surname>Dalal</surname> <given-names>A. R.</given-names></name> <name><surname>Kim</surname> <given-names>J. L.</given-names></name> <name><surname>Moor</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2024</year>). <article-title>Almanac&#x02014;retrieval-augmented language models for clinical medicine</article-title>. <source>NEJM AI</source> <volume>1</volume>:<fpage>68</fpage>. <pub-id pub-id-type="doi">10.1056/AIoa2300068</pub-id><pub-id pub-id-type="pmid">38343631</pub-id></citation></ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Talukdar</surname> <given-names>N.</given-names></name> <name><surname>Vemulapalli</surname> <given-names>S.</given-names></name> <name><surname>Ahn</surname> <given-names>S.</given-names></name> <name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Meng</surname> <given-names>H.</given-names></name> <etal/></person-group>. (<year>2024</year>). <article-title>Comparison of prompt engineering and fine-tuning strategies in large language models in the classification of clinical notes</article-title>. <source>medRxiv [Preprint]</source>. <pub-id pub-id-type="doi">10.1101/2024.02.07.24302444</pub-id><pub-id pub-id-type="pmid">38370673</pub-id></citation></ref>
<ref id="B35">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>Y.</given-names></name> <name><surname>Hou</surname> <given-names>S.</given-names></name> <name><surname>Derek Ma</surname> <given-names>M.</given-names></name> <name><surname>Wang</surname> <given-names>W.</given-names></name> <name><surname>Chen</surname> <given-names>M.</given-names></name> <name><surname>Zhao</surname> <given-names>J.</given-names></name></person-group> (<year>2024</year>). CLIMB: a benchmark of clinical bias in large language models. Available at: <ext-link ext-link-type="uri" xlink:href="https://github.com/">https://github.com/</ext-link></citation>
</ref>
</ref-list>
</back>
</article> 