<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="EN">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Digit. Health</journal-id>
<journal-title>Frontiers in Digital Health</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Digit. Health</abbrev-journal-title>
<issn pub-type="epub">2673-253X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fdgth.2025.1535168</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Digital Health</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>ChestX-Transcribe: a multimodal transformer for automated radiology report generation from chest x-rays</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author"><name><surname>Singh</surname><given-names>Prateek</given-names></name><uri xlink:href="https://loop.frontiersin.org/people/2907137/overview"/><role content-type="https://credit.niso.org/contributor-roles/conceptualization/"/><role content-type="https://credit.niso.org/contributor-roles/methodology/"/><role content-type="https://credit.niso.org/contributor-roles/software/"/><role content-type="https://credit.niso.org/contributor-roles/visualization/"/><role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/></contrib>
<contrib contrib-type="author" corresp="yes"><name><surname>Singh</surname><given-names>Sudhakar</given-names></name>
<xref ref-type="corresp" rid="cor1">&#x002A;</xref><uri xlink:href="https://loop.frontiersin.org/people/2577769/overview" /><role content-type="https://credit.niso.org/contributor-roles/methodology/"/><role content-type="https://credit.niso.org/contributor-roles/supervision/"/><role content-type="https://credit.niso.org/contributor-roles/visualization/"/><role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/></contrib>
</contrib-group>
<aff><institution>Biomedical Engineering Department, School of Bioengineering and Biosciences, Lovely Professional University</institution>, <addr-line>Punjab</addr-line>, <country>India</country></aff>
<author-notes>
<fn fn-type="edited-by"><p><bold>Edited by:</bold> Adnan Haider, Dongguk University Seoul, Republic of Korea</p></fn>
<fn fn-type="edited-by"><p><bold>Reviewed by:</bold> Francisco Maria Calisto, University of Lisbon, Portugal</p>
<p>Atif Latif, Utah State University, United States</p></fn>
<corresp id="cor1"><label>&#x002A;</label><bold>Correspondence:</bold> Sudhakar Singh <email>sudhakarsingh86@gmail.com</email></corresp>
</author-notes>
<pub-date pub-type="epub"><day>21</day><month>01</month><year>2025</year></pub-date>
<pub-date pub-type="collection"><year>2025</year></pub-date>
<volume>7</volume><elocation-id>1535168</elocation-id>
<history>
<date date-type="received"><day>27</day><month>11</month><year>2024</year></date>
<date date-type="accepted"><day>06</day><month>01</month><year>2025</year></date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2025 Singh and Singh.</copyright-statement>
<copyright-year>2025</copyright-year><copyright-holder>Singh and Singh</copyright-holder><license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution License (CC BY)</ext-link>. The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>Radiology departments are under increasing pressure to meet the demand for timely and accurate diagnostics, especially with chest x-rays, a key modality for pulmonary condition assessment. Producing comprehensive and accurate radiological reports is a time-consuming process prone to errors, particularly in high-volume clinical environments. Automated report generation plays a crucial role in alleviating radiologists&#x0027; workload, improving diagnostic accuracy, and ensuring consistency. This paper introduces <italic>ChestX-Transcribe</italic>, a multimodal transformer model that combines the Swin Transformer for extracting high-resolution visual features with DistilGPT for generating clinically relevant, semantically rich medical reports. Trained on the Indiana University Chest x-ray dataset, <italic>ChestX-Transcribe</italic> demonstrates state-of-the-art performance across BLEU, ROUGE, and METEOR metrics, outperforming prior models in producing clinically meaningful reports. However, the reliance on the Indiana University dataset introduces potential limitations, including selection bias, as the dataset is collected from specific hospitals within the Indiana Network for Patient Care. This may result in underrepresentation of certain demographics or conditions not prevalent in those healthcare settings, potentially skewing model predictions when applied to more diverse populations or different clinical environments. Additionally, the ethical implications of handling sensitive medical data, including patient privacy and data security, are considered. Despite these challenges, <italic>ChestX-Transcribe</italic> shows promising potential for enhancing real-world radiology workflows by automating the creation of medical reports, reducing diagnostic errors, and improving efficiency. The findings highlight the transformative potential of multimodal transformers in healthcare, with future work focusing on improving model generalizability and optimizing clinical integration.</p>
</abstract>
<kwd-group>
<kwd>medical report generation</kwd>
<kwd>multimodal transformers</kwd>
<kwd>swin transformer</kwd>
<kwd>DistilGPT</kwd>
<kwd>vision-language models</kwd>
<kwd>radiology workflow</kwd>
</kwd-group><counts>
<fig-count count="2"/>
<table-count count="3"/><equation-count count="15"/><ref-count count="40"/><page-count count="11"/><word-count count="0"/></counts><custom-meta-wrap><custom-meta><meta-name>section-at-acceptance</meta-name><meta-value>Health Technology Implementation</meta-value></custom-meta></custom-meta-wrap>
</article-meta>
</front>
<body><sec id="s1" sec-type="intro"><title>Introduction</title>
<p>Chest x-rays remain one of the most widely used diagnostic tools in healthcare for assessing pulmonary disorders. However, interpreting these images and generating accurate, detailed reports is a time-consuming and subjective task, particularly in high-volume clinical environments. Automating this process with deep learning holds the potential to streamline diagnostic workflows and reduce the burden on radiologists. A significant challenge in healthcare is the prevalence of diagnostic errors, with studies (<xref ref-type="bibr" rid="B1">1</xref>) indicating that nearly everyone will experience a diagnostic error at least once in their lifetime. Automating the report generation process can mitigate these errors, enhancing diagnostic accuracy and consistency. By relying on automated systems for report preparation, healthcare professionals can ensure more reliable interpretations of chest x-rays. Recent advancements in artificial intelligence (AI) have been driven by transformer architectures, which have revolutionized natural language processing (NLP) and computer vision. Vision transformers, such as Swin Transformer (<xref ref-type="bibr" rid="B2">2</xref>), excel at capturing intricate spatial patterns in medical images, while language models like GPT have shown remarkable ability in generating coherent, contextually accurate text.</p>
<p>However, these technologies have only recently begun to be explored in the context of automated clinical workflows, especially for medical image captioning. In this study, we introduce ChestX-Transcribe, a multimodal sequence-to-sequence transformer model that combines the strengths of DistilGPT (<xref ref-type="bibr" rid="B3">3</xref>) for generating precise radiology reports with the Swin Transformer (<xref ref-type="bibr" rid="B4">4</xref>) for extracting high-resolution visual features from chest x-rays. By seamlessly integrating both vision and language transformers, ChestX-Transcribe offers an innovative approach to automating medical report generation. This model enhances diagnostic workflows by producing reliable, contextually appropriate medical reports, reducing the cognitive load on radiologists, and supporting faster diagnosis.</p>
<p>Key contributions of this study include:
<list list-type="simple">
<list-item><label>&#x2022;</label>
<p><bold>Multimodal Transformer Architecture:</bold> ChestX-Transcribe integrates a pre-trained Swin Transformer for high-resolution visual feature extraction from chest x-rays with DistilGPT, a distilled version of GPT-2, for language generation. This combination enables the model to effectively handle both local and global dependencies in visual data while generating coherent text, offering a robust solution for medical report generation in healthcare.</p></list-item>
<list-item><label>&#x2022;</label>
<p><bold>Efficient and Scalable Model Design:</bold> By leveraging DistilGPT, a smaller and faster variant of GPT-2, we achieve notable improvements in model efficiency without sacrificing the quality of the generated reports. The reduced computational complexity makes the model more scalable for clinical applications where real-time processing and resource efficiency are essential.</p></list-item>
<list-item><label>&#x2022;</label>
<p><bold>Dataset Utilization and Performance Evaluation:</bold> We evaluate the model using the widely recognized Indiana University Chest x-ray dataset (<xref ref-type="bibr" rid="B5">5</xref>), facilitating performance comparisons with existing state-of-the-art methods. Preliminary results demonstrate a marked improvement in BLEU and ROUGE scores, highlighting ChestX-Transcribe&#x0027;s ability to generate clinically relevant text, indicating its potential in real-world medical settings.</p></list-item>
<list-item><label>&#x2022;</label>
<p><bold>Projection Layer for Cross-Modality Fusion:</bold> A key innovation in this work is the introduction of a projection layer that bridges the gap between the high-dimensional visual features from the Swin Transformer and the language model, ensuring smooth integration of image embeddings into the text generation pipeline. This layer significantly enhances the model&#x0027;s ability to correlate image features with accurate medical descriptions, setting our approach apart from existing methods.</p></list-item>
</list></p>
<p>This work contributes novel insights by combining advanced visual and language transformers in a cohesive model for automated medical report generation, showing the potential to improve both the efficiency and accuracy of radiology workflows.</p>
</sec>
<sec id="s2"><title>Literature review</title>
<p>Automated Radiological Report Generation is one of the techniques used to characterize the clinical aspects of chest x-ray images. It is a powerfully influential field that combines natural language processing with computer vision. Earlier approaches to report writing included retrieval of descriptions, filling of templates, and manually developed NLP techniques. Later on, automated medical report creation saw several developments, but the fundamental idea behind all of them was to use an image encoder to transform CXR images into a latent space, which was then used by a decoder to produce medical reports. The issue was classified as an image-to-sequence issue in general. The reviewed literature encompasses a wide spectrum of methodologies utilized in Automated Radiological Report Generation. This includes CNN-based models, attention-driven mechanisms, and hybrid approaches combining reinforcement learning and encoder-decoder architectures. Such diversity highlights the progression and innovation in this domain, reflecting current research trends and addressing complex challenges effectively. The idea was introduced by Allaouzi et al. (<xref ref-type="bibr" rid="B6">6</xref>), of using a CNN-RNN architecture to automatically produce medical reports from images. As research in the field progressed, the attention layer (<xref ref-type="bibr" rid="B2">2</xref>) was added in several tests, and models like (<xref ref-type="bibr" rid="B7">7</xref>) began fusing the standard CNN-RNN architecture with the attention mechanism to project multi-view visual features which is based on a sentence-level in a late fusion fashion. A dynamic graph paired with contrastive learning in transformers was proposed by Li et al. (<xref ref-type="bibr" rid="B8">8</xref>). This enhanced textual and visual representation in the work of creating medical reports. Jing et al. (<xref ref-type="bibr" rid="B9">9</xref>) presented a technique that combines multi-task learning and co-attention mechanism to identify aberrant patches in medical images. The author then overcame the challenge of creating long paragraph-level reports by using an LSTM-based hierarchical decoder to generate comprehensive clinical imaging reports with visual attention and labels. In the proposed model for automatic report generation from chest x-ray images, Hou et al. (<xref ref-type="bibr" rid="B10">10</xref>) designed an architecture consisting of three key components: an encoder, decoder, and reward module. Two branches are featured by an encoder: a CNN that extracts visual features from the input images and a multi-label classification (MLC) branch that predicts common medical notions and findings. These predictions are embedded as vectors and passed to the decoder. The decoder is constructed employing a multi-level attention hierarchical LSTM, and generates reports in two stages&#x2014;first, a sentence LSTM produces topic vectors to outline the content of each sentence, followed by a word LSTM that generates the specific words for each sentence based on the topic vectors. To enhance the report quality, a reward module with two discriminators provides feedback by evaluating the generated report&#x0027;s quality, and this feedback is employed to train the generator using reinforcement learning. The reward and the decoder module are trained adversarial in alternating iterations. An iterative decoder with visual attention was developed by Xue et al. (<xref ref-type="bibr" rid="B11">11</xref>) to ensure the coherence between texts. Jianbo et al. (<xref ref-type="bibr" rid="B12">12</xref>) utilized the multi-view information of the IU-Xray dataset by using a Resnet152 model trained on the Chexpert dataset (<xref ref-type="bibr" rid="B13">13</xref>) to extract the visual features and tags&#x0027; prediction from the patient&#x0027;s front and side images. Hierarchical LSTMs were then used to generate the report. A transformer-based neural machine translation model was put forth by Lovelace and Mortazavi (<xref ref-type="bibr" rid="B14">14</xref>) that made use of a fine-tuning technique to extract clinical data from the reports produced and enhance clinical consistency. To create reports from the IU-Xray dataset, a customized transformer, and an additional relational memory unit were also utilized by Chen et al. (<xref ref-type="bibr" rid="B15">15</xref>). A visual extractor uses trained models like VGG and Resnet to extract a set of visual attributes from the front and side chest images. These features are then sent to an encoder and decoder to produce reports. A framework for generating medical reports from chest x-rays was presented by Pino et al. (<xref ref-type="bibr" rid="B16">16</xref>). It primarily relies on a Template-based Report Generation (CNN-TRG) model. This model states abnormalities found in the x-ray images using preset templates and fixed words. By using templates, CNN-TRG takes a more straightforward and systematic approach than many other deep learning-based Natural Language Generation (NLG) techniques, which makes it easier to guarantee clinical accuracy in the produced reports. Variational Topic Inference (VTI), which is a novel method for automating the creation of medical image reports&#x2014;a crucial task in clinical practice&#x2014;was presented by Najdenkoska et al. (<xref ref-type="bibr" rid="B17">17</xref>). VTI successfully handles the issue of different report formats written by radiologists with differing degrees of expertise. This method makes use of conditional variational inference and deep learning techniques, with a primary emphasis on latent topics that inform sentence construction. These latent subjects help to align the visual and verbal modalities into a single latent area. The visual prior net, which encodes local visual signals from input images, the language posterior net, which records the associations between word embeddings in the generated sentences, and the sentence generator net are the three key components of VTI. Akbar et al. (<xref ref-type="bibr" rid="B18">18</xref>) used DenseNet121 to extract image features and in the training phase applied regularization using a dropout of 20&#x0025;. For medical report generation, they used the default embedding layer of Keras, they gave both the image vector and text embedding layer for training. Lee et al. (<xref ref-type="bibr" rid="B19">19</xref>) presented a model that included a Cross Encoder-Decoder Transformer and a Global-Local Visual Extractor (GLVE and CEDT).The language for global characteristics such as organ size and bone shape was written using the GLVE. They employed multi-level encoding features using CEDT. Chen et al. (<xref ref-type="bibr" rid="B20">20</xref>) have improved the generation process through the integration of cross-modal memory networks, allowing interactions between text and visuals, among other modalities. Han et al. (<xref ref-type="bibr" rid="B21">21</xref>) provides a framework for combining reinforcement learning (RL) with diffusion probabilistic models to generate chest x-rays (CXRs) conditioned on diagnostic reports. Using Reinforcement Learning with Comparative Feedback (RLCF), the model refines image generation through comparative rewards, ensuring accurate posture alignment, diagnostic detail, and report-image consistency. Additionally, learnable adaptive condition embeddings (ACE) enhance the generator&#x0027;s ability to capture subtle medical features, leading to pathologically realistic CXRs. Parres et al. (<xref ref-type="bibr" rid="B22">22</xref>) introduced a two-stage vision encoder-decoder (VED) architecture for radiology report generation (RRG), combining negative log-likelihood (NLL) training and reinforcement learning (RL) optimization. Text augmentation (TA) is proposed to enhance data diversity by reorganizing phrases in reference reports, addressing data scarcity and improving report quality and variability. Pan et al. (<xref ref-type="bibr" rid="B23">23</xref>) proposed a chest radiology report generation method using cross-modal multi-scale feature fusion. It incorporates an auxiliary labeling module to focus on lesion regions, a channel attention network to enhance disease and location feature representation, and a cross-modal feature fusion module that aligns multi-scale visual features with textual features through memory matrices for fine-grained integration.</p>
<p>Also Iqra et al. (<xref ref-type="bibr" rid="B24">24</xref>) introduced a Conditional Self-Attention Memory-Driven Transformer model for radiology report generation. The process involved two phases: first, a ResNet152 v2-based multi-label classification model was used for feature extraction and multi-disease diagnosis. Next, the Conditional Self-Attention Memory-Driven Transformer acts as a decoder, leveraging memory-driven self-attention mechanisms to generate textual reports. Sharma et al. (<xref ref-type="bibr" rid="B25">25</xref>) introduced the MAIRA-Seg framework, a segmentation-aware multimodal large language model (MLLM) for radiology report generation. They trained expert segmentation models to obtain pseudolabels for radiology-specific structures in chest x-rays (CXRs). Building upon the MAIRA architecture, they integrated a trainable segmentation tokens extractor that leverages these segmentation masks and employed mask-aware prompting to generate radiology reports.</p>
<p>Tanno et al. (<xref ref-type="bibr" rid="B26">26</xref>) developed the <italic>Flamingo-CXR</italic> system for automated chest radiograph report generation, which was evaluated by board-certified radiologists. Their study found that AI-generated reports were deemed preferable or equivalent to clinician reports in 56.1&#x0025; of intensive care unit cases and 77.7&#x0025; for in/outpatient x-rays. Despite errors in both AI and human reports, the research highlights the potential for clinician-AI collaboration to improve radiology report quality.</p>
<p>In the clinical environment, the integration of explainable AI (XAI) systems has become increasingly important for fostering trust and ensuring that clinicians understand the rationale behind AI-driven decisions. XAI allows for transparent reporting, providing meaningful explanations for the recommendations made by AI systems, which is crucial for enhancing clinical decision-making and improving patient care. The use of a modality-specific lexicon plays a key role in ensuring that AI-generated reports are detailed, contextually relevant, and interpretable. In the context of breast cancer diagnosis, Bastos et al. (<xref ref-type="bibr" rid="B27">27</xref>) developed a system that incorporates semantic annotation into medical image analysis to generate clearer, more comprehensive explanations of findings, allowing clinicians to better understand AI predictions. This approach not only enhances the transparency of AI models but also ensures they are aligned with clinicians&#x0027; needs, ultimately facilitating the adoption of AI systems in real-world clinical settings. Bluethgen et al. (<xref ref-type="bibr" rid="B28">28</xref>) developed a domain-adaptation strategy for large vision-language models to overcome distributional shifts when generating medical images. By leveraging publicly available chest x-ray datasets and corresponding radiology reports, they adapted a latent diffusion model to generate diverse, visually plausible synthetic chest x-ray images, controlled by free-form medical text prompts. This approach offers a viable alternative to using real medical images for training and fine-tuning AI models.</p>
</sec>
<sec id="s3"><title>Methodology</title>
<p>The model architecture consists of a language model, a projection layer, and a vision model, as illustrated in <xref ref-type="fig" rid="F1">Figure&#x00A0;1</xref>. Because of its architecture, the model can produce comprehensive medical reports that are shaped by the visual features that are taken out of the medical images.</p>
<fig id="F1" position="float"><label>Figure 1</label>
<caption><p>ChestX-Transcribe architecture.</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="fdgth-07-1535168-g001.tif"/>
</fig>
<sec id="s3a"><title>Visual model (swin transformer)</title>
<p>A Swin transformer model is used as the initial processing step to extract visual features from the input chest x-rays. The image is partitioned into patches and passed through the transformer blocks, which operate hierarchically to capture both local and global image information. The extracted visual features are then transformed into a high-dimensional embedding vector.</p>
<sec id="s3a1"><title>Working</title>
<list list-type="simple">
<list-item><label>1.</label>
<p>Patch Partitioning and Linear Embedding: The input image <italic>I</italic> is split into non-overlapping patches of size <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="IM1"><mml:mi>M</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>M</mml:mi></mml:math></inline-formula>. The patches are then flattened and passed to a linear embedding layer. A linear embedding layer projects each flattened patch into a feature space of dimension of <italic>C</italic>. Each patch now becomes a &#x201C;token&#x201D; with a feature vector. The number of tokens <italic>N</italic> (<xref ref-type="disp-formula" rid="disp-formula1">Equation&#x00A0;1</xref>) is then calculated as:<disp-formula id="disp-formula1"><label>(1)</label><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="DM1"><mml:mi>N</mml:mi><mml:mspace width="0.25em"/><mml:mo>=</mml:mo><mml:mspace width="0.25em"/><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mrow><mml:mfrac><mml:mrow><mml:mi>H</mml:mi><mml:mspace width="0.25em"/><mml:mo>&#x00D7;</mml:mo><mml:mspace width="0.25em"/><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mi>M</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle></mml:math></disp-formula></p></list-item>
<list-item><label>2.</label>
<p>Window-based Multi-head Self-Attention (W-MSA): Swin transformers use window-based multi-head self-attention where self-attention is computed with non-overlapping windows. The computational complexity for global multi-head self-attention (MSA) (<xref ref-type="disp-formula" rid="disp-formula2">Equation&#x00A0;2</xref>) and window-based self-attention (W-MSA) (<xref ref-type="disp-formula" rid="disp-formula3">Equation&#x00A0;3</xref>) can be described as:<disp-formula id="disp-formula2"><label>(2)</label><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="DM2"><mml:mi mathvariant="normal">&#x03A9;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>M</mml:mi><mml:mi>S</mml:mi><mml:mi>A</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mspace width="0.25em"/><mml:mo>=</mml:mo><mml:mspace width="0.25em"/><mml:mn>4</mml:mn><mml:mi>h</mml:mi><mml:mi>w</mml:mi><mml:msup><mml:mi>c</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mspace width="0.25em"/><mml:mo>+</mml:mo><mml:mspace width="0.25em"/><mml:mn>2</mml:mn><mml:mo stretchy="false">(</mml:mo><mml:mi>h</mml:mi><mml:mi>w</mml:mi><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup><mml:mi>C</mml:mi></mml:math></disp-formula><disp-formula id="disp-formula3"><label>(3)</label><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="DM3"><mml:mi mathvariant="normal">&#x03A9;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>W</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>M</mml:mi><mml:mi>S</mml:mi><mml:mi>A</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mspace width="0.25em"/><mml:mo>=</mml:mo><mml:mspace width="0.25em"/><mml:mn>4</mml:mn><mml:mi>h</mml:mi><mml:mi>w</mml:mi><mml:msup><mml:mi>c</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mspace width="0.25em"/><mml:mo>+</mml:mo><mml:mspace width="0.25em"/><mml:mn>2</mml:mn><mml:msup><mml:mi>M</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mi>h</mml:mi><mml:mi>w</mml:mi><mml:mi>C</mml:mi></mml:math></disp-formula>where <italic>h</italic> and <italic>w</italic> are the dimensions of the input feature map, and <italic>M</italic> is the window size. The first term represents the cost of computing the queries, keys, and values, while the second term captures the attention computation.</p></list-item>
<list-item><label>3.</label>
<p>Shifted Window-based Multi-head Self-Attention (SW-MSA): SW-MSA is a feature introduced by Swin Transformer that enables token interactions across windows. The windows are shifted by <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="IM2"><mml:mrow><mml:mfrac><mml:mi>M</mml:mi><mml:mn>2</mml:mn></mml:mfrac></mml:mrow></mml:math></inline-formula> pixels between successive layers. This enables information exchange across windows.</p></list-item>
<list-item><label>4.</label>
<p>Patch Merging: After every stage, patch merging is used to downsample the resolution of the feature map while increasing the feature dimension.</p></list-item>
</list>
</sec>
</sec>
<sec id="s3b"><title>Projection layer</title>
<p>After extraction of the image embedding (<xref ref-type="fig" rid="F1">Figure&#x00A0;1</xref>) from the vision model happens, a projection layer is used to align these embeddings to match the input dimension expected from the language model. This layer helps to ensure that the image representations can be effectively combined with token embeddings from the language model for joint processing. The projection layer applies a linear transformation (<xref ref-type="disp-formula" rid="disp-formula4">Equation&#x00A0;4</xref>) to map the image embeddings from the dimensionality <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="IM3"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>g</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> to <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="IM4"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. This can be expressed as:<disp-formula id="disp-formula4"><label>(4)</label><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="DM4"><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mspace width=".1em"/><mml:mi>p</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mspace width="0.25em"/><mml:mo>=</mml:mo><mml:mspace width="0.25em"/><mml:msub><mml:mi>z</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>g</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mspace width=".1em"/><mml:mi>p</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mspace width="0.25em"/><mml:mo>+</mml:mo><mml:mspace width="0.25em"/><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mspace width=".1em"/><mml:mi>p</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula>Where:
<list list-type="simple">
<list-item><label>&#x2022;</label>
<p><inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="IM5"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mspace width=".1em"/><mml:mi>p</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mspace width="0.25em"/><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>g</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>&#x00D7;</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> is the weight matrix of the projection layer.</p></list-item>
<list-item><label>&#x2022;</label>
<p><inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="IM6"><mml:msub><mml:mi>b</mml:mi><mml:mrow><mml:mspace width=".1em"/><mml:mi>p</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula> is the bias vector.</p></list-item>
</list></p>
</sec>
<sec id="s3c"><title>Language model (DistilGPT)</title>
<p>The language model is based on DistilGPT, which consists of 12 transformer layers. It takes token embeddings as input and predicts the next token in the sequence to generate a coherent medical report. The token embeddings are generated from the medical report text and processed through the model&#x0027;s transformer layers, which include masked multi-head self-attention, layer normalization, and feed-forward layers. The model also integrates the transformed image embeddings with the token embeddings at the beginning of the input sequence to condition the report generation on both the visual and textual information. The output dimension of the vision model (Swin Transformer) is set to 768 which matches the dimensionality required for the input to the language model.</p>
<p>We selected DistilGPT for its ability to balance performance, computational efficiency, and adaptability to domain-specific text. While larger language models (e.g., GPT-3) offer superior generalization capabilities, their computational cost makes them less feasible for clinical deployment. DistilGPT retains 97&#x0025; of GPT&#x0027;s performance while being significantly faster and lightweight, making it ideal for generating radiology reports in real-world, high-volume settings.</p>
</sec>
<sec id="s3d"><title>Training details</title>
<p>The dataset used in this study is the Indiana University Chest x-ray Dataset, which consists of 7,430 images of frontal and lateral chest x-rays belonging to 3,825 patients. Each image is paired with corresponding radiology reports that provide detailed findings regarding the patients&#x0027; conditions. This dataset serves as the foundation for training the model to generate textual descriptions based on visual inputs.</p>
<sec id="s3d1"><title>Image preprocessing</title>
<p>Each x-ray image undergoes a series of transformations as included in the Swin Transformer model preprocessing pipeline. First, the image is resized to the standard dimensions which are suitable for the model&#x0027;s input, ensuring uniformity across all the input samples. Next, pixel values are normalized in a range between 0 and 1.</p>
</sec>
<sec id="s3d2"><title>Text preprocessing</title>
<p>The textual findings from the radiology reports are tokenized using the GPT-2 tokenizer. This process involves encoding the findings into a sequence of token IDs representing the words or subwords in the text. These token IDs allow the model to interpret and process the textual information effectively. To ensure compatibility with the model&#x0027;s input requirements, tokenized sequences are constrained to a specified maximum length.</p>
<p>If a sequence exceeds this length, it is truncated to fit the required size, preventing overflow during processing. Additionally, an end-of-sequence token ID is appended to mark the conclusion of the text sequence, signaling the model when to stop generating output. In this study, the dataset consists of a total of 155,837 tokens. This token count reflects the cumulative number of tokens across all text sequences in the dataset. To maintain consistency and optimize the performance of the model, the output is restricted to a maximum of 100 newly generated tokens. This limit includes both the input prompt and the generated text. Therefore, the total length of the output is carefully controlled, ensuring that the generated sequence is neither too short nor too long, and can be evaluated consistently across experiments.</p>
</sec>
<sec id="s3d3"><title>Training and validation loss</title>
<p>The model was trained over 5 epochs with both training and validation losses being tracked to monitor the model&#x0027;s performance and prevent overfitting. <xref ref-type="fig" rid="F2">Figure&#x00A0;2</xref> below shows the Training Loss vs. Validation Loss over 5 epochs.</p>
<fig id="F2" position="float"><label>Figure 2</label>
<caption><p>Training vs. Validation Loss (Trends Observed Over 5 Epochs).</p></caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="fdgth-07-1535168-g002.tif"/>
</fig>
<p>The model is learning from the data as seen by the training loss, which gradually drops throughout the epochs. Initially, the validation loss decreases along with the training loss, showing that the model is improving its generalization to unseen data. However, after approximately 2 epochs, the validation loss plateaus, suggesting that further training may lead to overfitting.</p>
<p>The adaptive learning rate method of the Adam optimizer, which works well for tasks involving huge datasets and parameters, was used to train the model. To maximize performance, two different learning rates were applied to various model components in this configuration. To ensure that fine-tuning happens gradually without causing significant changes, the language model&#x0027;s parameters were updated using a learning rate of 2e-4.</p>
<p>This allowed the model to preserve its pre-trained knowledge while responding to the task-specific data. On the other hand, because these layers are more task-specific and call for more substantial updates, the projection model&#x0027;s parameters were changed using a higher learning rate of 5e-5. By keeping a balance throughout training, this differential learning rate technique helps keep the model from underperforming or overfitting too soon.</p>
</sec>
</sec>
<sec id="s3e"><title>Evaluation metrics</title>
<list list-type="simple">
<list-item><label>1.</label>
<p><bold>BLEU Score:</bold> Bilingual Evaluation Understudy (BLEU) (<xref ref-type="bibr" rid="B29">29</xref>) (<xref ref-type="disp-formula" rid="disp-formula5">Equation&#x00A0;5</xref>) is a widely used metric in natural language processing used for evaluating the quality of text generated by the model. It measures the overlap between the generated text and the reference text by comparing n-grams (contiguous sequences of words). A higher BLEU score indicates better alignment with the ground truth captions, demonstrating the model&#x0027;s ability to generate coherent and relevant descriptions.<disp-formula id="disp-formula5"><label>(5)</label><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="DM5"><mml:mi>B</mml:mi><mml:mi>L</mml:mi><mml:mi>E</mml:mi><mml:mi>U</mml:mi><mml:mspace width="0.25em"/><mml:mo>=</mml:mo><mml:mspace width="0.25em"/><mml:mi>B</mml:mi><mml:mi>P</mml:mi><mml:mo>.</mml:mo><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:munderover><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:mspace width="0.2em"/><mml:msub><mml:mi>w</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mspace width="0.25em"/><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:msub><mml:mi>p</mml:mi><mml:mi>n</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p></list-item>
</list>
<p>Where;
<list list-type="simple">
<list-item><label>&#x2022;</label>
<p><italic>BP</italic> is the brevity penalty, which penalizes shorter sentences to encourage longer, more complete outputs.</p></list-item>
<list-item><label>&#x2022;</label>
<p><italic>p<sub>n</sub></italic> is the precision for n-grams of order n.</p></list-item>
<list-item><label>&#x2022;</label>
<p><italic>w<sub>n</sub></italic> is the weight assigned to the precision of each n-gram level.</p></list-item>
<list-item><label>&#x2022;</label>
<p><italic>N</italic> is the maximum length of the n-grams.</p></list-item>
<list-item><label>2.</label>
<p><bold>ROUGE-L:</bold> The ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation - Longest Common Subsequence) (<xref ref-type="bibr" rid="B30">30</xref>) metric measures the Longest Common Subsequence (LCS) between a generated text and a reference text. It focuses on capturing the sequence similarity while maintaining word order. ROUGE-L computes precision (<xref ref-type="disp-formula" rid="disp-formula6">Equation&#x00A0;6</xref>), recall (<xref ref-type="disp-formula" rid="disp-formula7">Equation&#x00A0;7</xref>), and F1-score based on the length of the LCS between the generated sequence <italic>C</italic> and the reference sequence <italic>R</italic>.<disp-formula id="disp-formula111"><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="UDM1"><mml:mtable columnalign="right left" rowspacing=".5em" columnspacing="thickmathspace" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>L</mml:mi><mml:mi>C</mml:mi><mml:mi>S</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>C</mml:mi><mml:mo>,</mml:mo><mml:mi>R</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo></mml:mtd><mml:mtd><mml:mspace width=".1em"/><mml:mrow><mml:mi mathvariant="normal">Length</mml:mi></mml:mrow><mml:mspace width="0.25em"/><mml:mrow><mml:mi mathvariant="normal">of</mml:mi></mml:mrow><mml:mspace width="0.25em"/><mml:mrow><mml:mi mathvariant="normal">the</mml:mi></mml:mrow><mml:mspace width="0.25em"/><mml:mrow><mml:mi mathvariant="normal">Longest</mml:mi></mml:mrow><mml:mspace width="0.25em"/><mml:mrow><mml:mi mathvariant="normal">Common</mml:mi></mml:mrow><mml:mspace width="0.25em"/><mml:mrow><mml:mi mathvariant="normal">Subsequence</mml:mi></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mi mathvariant="normal">between</mml:mi></mml:mrow><mml:mspace width="0.25em"/><mml:mi>C</mml:mi><mml:mspace width="0.25em"/><mml:mrow><mml:mi mathvariant="normal">and</mml:mi></mml:mrow><mml:mspace width="0.25em"/><mml:mi>R</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula><disp-formula id="disp-formula6"><label>(6)</label><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="DM6"><mml:mi>P</mml:mi><mml:mspace width="0.25em"/><mml:mo>=</mml:mo><mml:mspace width="0.25em"/><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mrow><mml:mfrac><mml:mrow><mml:mi>L</mml:mi><mml:mi>C</mml:mi><mml:mi>S</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>C</mml:mi><mml:mo>,</mml:mo><mml:mi>R</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mi>C</mml:mi><mml:mo fence="false" stretchy="false">|</mml:mo></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle></mml:math></disp-formula><disp-formula id="disp-formula7"><label>(7)</label><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="DM7"><mml:mi>R</mml:mi><mml:mspace width="0.25em"/><mml:mo>=</mml:mo><mml:mspace width="0.25em"/><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mrow><mml:mfrac><mml:mrow><mml:mi>L</mml:mi><mml:mi>C</mml:mi><mml:mi>S</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>C</mml:mi><mml:mo>,</mml:mo><mml:mi>R</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mi>R</mml:mi><mml:mo fence="false" stretchy="false">|</mml:mo></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle></mml:math></disp-formula></p></list-item>
<list-item><label>&#x2022;</label>
<p><italic>P</italic> is the Precision.</p></list-item>
<list-item><label>&#x2022;</label>
<p><italic>R</italic> is the Recall.</p></list-item>
<list-item><label>&#x2022;</label>
<p><inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="IM7"><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mi>C</mml:mi><mml:mo fence="false" stretchy="false">|</mml:mo></mml:math></inline-formula> is the length of the generated text.</p></list-item>
<list-item><label>&#x2022;</label>
<p><inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="IM8"><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mi>R</mml:mi><mml:mo fence="false" stretchy="false">|</mml:mo></mml:math></inline-formula> is the length of the reference text.</p></list-item>
<list-item><label>3.</label>
<p><bold>ROUGE-1:</bold> The ROUGE-1 metric evaluates the unigram overlap between the generated text and the reference text. It captures the presence of individual words from the reference in the generated sequence, measuring precision, recall, and F1-score.</p></list-item>
<list-item><label>4.</label>
<p><bold>ROUGE-2:</bold> The ROUGE-2 metric evaluates the bigram overlap, focusing on the accuracy of consecutive word pairs in the generated text compared to the reference. It calculates precision, recall, and F1-score based on bigram matches.</p></list-item>
<list-item><label>5.</label>
<p><bold>METEOR:</bold> Metric for Evaluation of Translation with Explicit Ordering (METEOR) (<xref ref-type="bibr" rid="B31">31</xref>) is a metric used for evaluating the quality of machine-generated translations by comparing them with human-generated reference translations. Unlike precision-focused metrics like BLEU, METEOR places more emphasis on recall and incorporates additional linguistic features such as stemming synonym matching, and paraphrase matching.</p></list-item>
</list></p>
</sec>
</sec>
<sec id="s4" sec-type="results"><title>Results</title>
<p>In this section, we present the results of evaluating our proposed model on the Indiana University Chest x-ray Dataset using various evaluation metrics, including BLEU, ROUGE, and METEOR. These metrics offer a thorough evaluation of the quality of the generated text in terms of recall and precision.</p>
<p>A comparison of our model&#x0027;s output to many cutting-edge methods for producing medical reports from chest x-ray pictures is shown in <xref ref-type="table" rid="T1">Table&#x00A0;1</xref>. All evaluation criteria, such as ROUGE-L, METEOR, and BLEU scores, show that our model performs better. We succeeded in capturing unigrams pertinent to the medical context with a BLEU-1 score of 0.675, which is noteworthy and outperforms previous models like Alqahtani et al. (<xref ref-type="bibr" rid="B35">35</xref>) and Singh et al. (<xref ref-type="bibr" rid="B36">36</xref>). In longer generated sequences, our BLEU-2, BLEU-3, and BLEU-4 scores, which are 0.585, 0.523, and 0.472, respectively, demonstrate great coherence and relevance. With a METEOR score of 0.382, our model outperforms many other models in producing linguistically diverse text, highlighting its efficacy in capturing semantic nuances. Additionally, a high degree of structural similarity between the generated reports and the reference texts is indicated by our ROUGE-L score of 0.698, indicating that our approach performs exceptionally well in preserving sentence-level organization. Overall, <xref ref-type="table" rid="T1">Table&#x00A0;1</xref>&#x0027;s results confirm that our model performs noticeably better than current approaches, highlighting its efficiency in automatically generating high-quality medical reports.</p>
<table-wrap id="T1" position="float"><label>Table 1</label>
<caption><p>Performance metrics of state-of-the-Art models across BLEU, METEOR, and ROUGE scores.</p></caption>
<table frame="hsides" rules="groups">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
</colgroup>
<thead>
<tr>
<th valign="top" align="left">S. no</th>
<th valign="top" align="left">Works</th>
<th valign="top" align="center">BL-1</th>
<th valign="top" align="center">BL-2</th>
<th valign="top" align="center">BL-3</th>
<th valign="top" align="center">BL-4</th>
<th valign="top" align="center">MTR</th>
<th valign="top" align="center">RG-1</th>
<th valign="top" align="center">RG-2</th>
<th valign="top" align="center">RG-L</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">1.</td>
<td valign="top" align="left">Niksaz et al. (<xref ref-type="bibr" rid="B32">32</xref>) (ResNeXt&#x2009;&#x002B;&#x2009;BioBert)</td>
<td valign="top" align="center">0.178</td>
<td valign="top" align="center">0.146</td>
<td valign="top" align="center">0.135</td>
<td valign="top" align="center">0.102</td>
<td valign="top" align="center">&#x2013;</td>
<td valign="top" align="center"/>
<td valign="top" align="center"/>
<td valign="top" align="center">&#x2013;</td>
</tr>
<tr>
<td valign="top" align="left">2.</td>
<td valign="top" align="left">Junior et al. (<xref ref-type="bibr" rid="B33">33</xref>)</td>
<td valign="top" align="center">0.377</td>
<td valign="top" align="center">0.239</td>
<td valign="top" align="center">0.168</td>
<td valign="top" align="center">0.124</td>
<td valign="top" align="center">0.322</td>
<td valign="top" align="center"/>
<td valign="top" align="center"/>
<td valign="top" align="center">0.300</td>
</tr>
<tr>
<td valign="top" align="left">3.</td>
<td valign="top" align="left">Yelure et al. (<xref ref-type="bibr" rid="B34">34</xref>) (Encoder-Decoder)</td>
<td valign="top" align="center">0.11</td>
<td valign="top" align="center">0.23</td>
<td valign="top" align="center">0.32</td>
<td valign="top" align="center">0.38</td>
<td valign="top" align="center">&#x2013;</td>
<td valign="top" align="center"/>
<td valign="top" align="center"/>
<td valign="top" align="center">&#x2013;</td>
</tr>
<tr>
<td valign="top" align="left">4.</td>
<td valign="top" align="left">Yelure et al. (<xref ref-type="bibr" rid="B34">34</xref>) (Encoder-Decoder with Attention)</td>
<td valign="top" align="center">0.11</td>
<td valign="top" align="center">0.32</td>
<td valign="top" align="center">0.46</td>
<td valign="top" align="center">0.56</td>
<td valign="top" align="center">&#x2013;</td>
<td valign="top" align="center"/>
<td valign="top" align="center"/>
<td valign="top" align="center">&#x2013;</td>
</tr>
<tr>
<td valign="top" align="left">5.</td>
<td valign="top" align="left">Alqahtani et al. (<xref ref-type="bibr" rid="B35">35</xref>)</td>
<td valign="top" align="center">0.479</td>
<td valign="top" align="center">0.363</td>
<td valign="top" align="center">0.261</td>
<td valign="top" align="center">0.173</td>
<td valign="top" align="center">0.188</td>
<td valign="top" align="center"/>
<td valign="top" align="center"/>
<td valign="top" align="center">0.354</td>
</tr>
<tr>
<td valign="top" align="left">6.</td>
<td valign="top" align="left">Singh et al. (<xref ref-type="bibr" rid="B36">36</xref>) (ResNet-101,CNN&#x2009;&#x002B;&#x2009;Transformer)</td>
<td valign="top" align="center">0.311</td>
<td valign="top" align="center">0.196</td>
<td valign="top" align="center">0.131</td>
<td valign="top" align="center">0.091</td>
<td valign="top" align="center">0.136</td>
<td valign="top" align="center"/>
<td valign="top" align="center"/>
<td valign="top" align="center">0.264</td>
</tr>
<tr>
<td valign="top" align="left">7.</td>
<td valign="top" align="left">Shaikh et al. (<xref ref-type="bibr" rid="B37">37</xref>)</td>
<td valign="top" align="center">0.465</td>
<td valign="top" align="center">0.300</td>
<td valign="top" align="center">0.220</td>
<td valign="top" align="center">0.172</td>
<td valign="top" align="center">0.185</td>
<td valign="top" align="center"/>
<td valign="top" align="center"/>
<td valign="top" align="center">0.361</td>
</tr>
<tr>
<td valign="top" align="left">8.</td>
<td valign="top" align="left">Alfarghaly et al. (<xref ref-type="bibr" rid="B38">38</xref>) (CDGPT2)</td>
<td valign="top" align="center">0.387</td>
<td valign="top" align="center">0.245</td>
<td valign="top" align="center">0.166</td>
<td valign="top" align="center">0.111</td>
<td valign="top" align="center">0.164</td>
<td valign="top" align="center"/>
<td valign="top" align="center"/>
<td valign="top" align="center">0.289</td>
</tr>
<tr>
<td valign="top" align="left">9.</td>
<td valign="top" align="left">Akbar et al. (<xref ref-type="bibr" rid="B18">18</xref>)</td>
<td valign="top" align="center">0.558</td>
<td valign="top" align="center">0.463</td>
<td valign="top" align="center">0.311</td>
<td valign="top" align="center">0.097</td>
<td valign="top" align="center">&#x2013;</td>
<td valign="top" align="center"/>
<td valign="top" align="center"/>
<td valign="top" align="center">0.448</td>
</tr>
<tr>
<td valign="top" align="left">10.</td>
<td valign="top" align="left">Raminedi et al. (<xref ref-type="bibr" rid="B39">39</xref>) (ViGPT2)</td>
<td valign="top" align="center">0.571</td>
<td valign="top" align="center">0.385</td>
<td valign="top" align="center">0.291</td>
<td valign="top" align="center">0.226</td>
<td valign="top" align="center">&#x2013;</td>
<td valign="top" align="center"/>
<td valign="top" align="center"/>
<td valign="top" align="center">0.433</td>
</tr>
<tr>
<td valign="top" align="left">11.</td>
<td valign="top" align="left">Ours</td>
<td valign="top" align="center">0.675</td>
<td valign="top" align="center">0.585</td>
<td valign="top" align="center">0.523</td>
<td valign="top" align="center">0.472</td>
<td valign="top" align="center">0.382</td>
<td valign="top" align="center">0.72</td>
<td valign="top" align="center">0.55</td>
<td valign="top" align="center">0.698</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The failure cases outlined in the (<xref ref-type="table" rid="T2">Table&#x00A0;2</xref>) highlights key limitations in the dataset and model, particularly the underrepresentation of rare or subtle conditions. For instance, the omission of calcified granulomas and degenerative changes underscores the challenge of detecting less common or subtle abnormalities that may not be adequately represented in the training data. Similarly, the model&#x0027;s failure to capture the nuanced description of acute bony findings points to difficulties in handling ambiguous or borderline cases. While the general findings were correctly identified, minor stylistic differences in phrasing reflect inconsistencies in reporting, though not affecting clinical accuracy. These failures emphasize the need for a more diverse dataset, with a better balance between common and rare conditions, to ensure the model can generalize effectively. Future improvements, such as data augmentation, synthetic data generation, and class balancing, will help address these gaps and enhance the model&#x0027;s ability to accurately detect a wider range of clinical findings, ultimately improving its robustness and applicability in real-world clinical settings.</p>
<table-wrap id="T2" position="float"><label>Table 2</label>
<caption><p>Limitations and failure cases.</p></caption>
<table frame="hsides" rules="groups">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th valign="top" align="left">S. no</th>
<th valign="top" align="center">Case</th>
<th valign="top" align="center">Sample report findings</th>
<th valign="top" align="center">Generated report findings</th>
<th valign="top" align="center">Issue</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">1.</td>
<td valign="top" align="left">Calcified granuloma</td>
<td valign="top" align="left">Large calcified granuloma within the medial right lung base</td>
<td valign="top" align="left">Granuloma not mentioned</td>
<td valign="top" align="left">Omission of rare finding (calcified granuloma)</td>
</tr>
<tr>
<td valign="top" align="left">2.</td>
<td valign="top" align="left">Degenerative changes</td>
<td valign="top" align="left">Mild degenerative changes at the lower thoracic spine</td>
<td valign="top" align="left">No mention of degenerative changes</td>
<td valign="top" align="left">Omission of subtle abnormality (degenerative changes)</td>
</tr>
<tr>
<td valign="top" align="left">3.</td>
<td valign="top" align="left">Bony abnormalities</td>
<td valign="top" align="left">Convincing acute bony findings</td>
<td valign="top" align="left">No acute bony abnormality</td>
<td valign="top" align="left">Ambiguous terminology; failed to capture nuanced description of bony findings</td>
</tr>
<tr>
<td valign="top" align="left">4.</td>
<td valign="top" align="left">General findings</td>
<td valign="top" align="left">Clear lungs, normal heart size, no pleural effusion or pneumothorax</td>
<td valign="top" align="left">Similar findings</td>
<td valign="top" align="left">Minor phrasing differences, no false analysis; just style variance</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s5" sec-type="discussion"><title>Discussion</title>
<p>This work demonstrates promising results using the ChestX-Transcribe model, but several limitations related to both the dataset and the model itself must be considered. The Indiana University Chest x-ray (IU CXR) dataset, while valuable, may introduce selection bias due to its specific origins within the Indiana Network for Patient Care. This can affect the model&#x0027;s generalizability if certain patient demographics or less common medical conditions are underrepresented, limiting its applicability to a wider, more diverse population, including rural or international healthcare settings. Additionally, the dataset may not capture the full spectrum of conditions, such as rare findings like calcified granulomas or degenerative changes, which could result in omissions or misclassifications in model output. Regarding the model, DistilGPT was chosen for its balance between computational efficiency and coherence, but more advanced models such as GPT-4 or T5, fine-tuned for medical data, could provide more accurate and context-sensitive reports. However, these models come with higher computational costs, which could hinder scalability in real-world clinical applications, where timely report generation is crucial. Furthermore, since the model was trained on a single dataset, its performance on other datasets with differing characteristics&#x2014;such as those containing rare conditions or subtle findings like bony abnormalities&#x2014;remains uncertain. This underscores the need for further validation on diverse datasets to assess the model&#x0027;s robustness and generalizability. These limitations highlight the need for data diversity, improved model efficiency, and cross-dataset validation to enhance the model&#x0027;s practicality in real-world clinical settings.</p>
</sec>
<sec id="s6" sec-type="conclusions"><title>Conclusion</title>
<p>In this study, we aimed to develop a multimodal sequence-to-sequence transformer model for generating accurate medical reports from chest x-ray images, which addresses the critical need for automated systems in radiology. Our proposed model demonstrated superior performance across various evaluation metrics, achieving BLEU, ROUGE, and METEOR scores that outperformed several state-of-the-art models. The integration of a Swin Transformer for visual feature extraction and DistilGPT for text generation proved effective in producing coherent and contextually relevant medical narratives. The findings (<xref ref-type="table" rid="T3">Table&#x00A0;3</xref>) suggest that our method can markedly improve the efficiency of medical reporting, potentially assisting healthcare practitioners in delivering prompt and precise diagnoses. This automation could alleviate the workload on radiologists, allowing them to focus more on patient care.</p>
<table-wrap id="T3" position="float"><label>Table 3</label>
<caption><p>Example predictions of ChestX-transcribe.</p></caption>
<table frame="hsides" rules="groups">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th valign="top" align="left">Input image</th>
<th valign="top" align="center">Ground truth</th>
<th valign="top" align="center">Predicted</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="fdgth-07-1535168-i001.tif"/></td>
<td valign="top" align="left">
There is a subtle left medial base opacity. Cardiomediastinal silhouette is normal. Pulmonary vasculature and XXXX are normal. No pneumothorax or large pleural effusion. Osseous structures and soft tissues are normal.</td>
<td valign="top" align="left">XXXX are unchanged. No focal airspace consolidation. No pleural effusion or pneumothorax. Heart size and mediastinal contour are within normal limits.</td>
</tr>
<tr>
<td valign="top" align="left"><inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="fdgth-07-1535168-i002.tif"/></td>
<td valign="top" align="left">
Cardiomediastinal silhouette and pulmonary vasculature are within normal limits. Lungs are clear. No pneumothorax or pleural effusion. No acute osseous findings.</td>
<td valign="top" align="left">C and lateral views of the chest. The cardiomediastinal silhouette is normal in size and contour. No focal consolidation, pneumothorax or large pleural effusion. No acute bony abnormality.</td>
</tr>
<tr>
<td valign="top" align="left"><inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="fdgth-07-1535168-i003.tif"/></td>
<td valign="top" align="left">
The cardiac contours are normal. The lungs are clear. Thoracic spondylosis.</td>
<td valign="top" align="left">Heart size and mediastinal contours. The lungs are clear. Thoracic spondylosis.</td>
</tr>
<tr>
<td valign="top" align="left"><inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="fdgth-07-1535168-i004.tif"/></td>
<td valign="top" align="left">
The heart size is normal. The mediastinal contour is within normal limits. The lungs are free of any focal infiltrates. There are no nodules or masses. No visible pneumothorax. No visible pleural fluid. The XXXX are grossly normal. There is no visible free intraperitoneal air under the diaphragm. Surgical clips are seen the right upper quadrant.</td>
<td valign="top" align="left">Heart size is normal. Lungs are clear.</td>
</tr>
<tr>
<td valign="top" align="left"><inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="fdgth-07-1535168-i005.tif"/></td>
<td valign="top" align="left">
The cardiac silhouette and mediastinal contours are within normal limits. There are low lung volumes with bronchovascular crowding. Otherwise the lungs are clear. There is no pneumothorax. No large pleural effusion.</td>
<td valign="top" align="left">Low lung volumes. Heart size within normal limits. No pneumothorax or pleural effusion.</td>
</tr>
<tr>
<td valign="top" align="left"><inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="fdgth-07-1535168-i006.tif"/></td>
<td valign="top" align="left">
The heart and lungs have XXXX XXXX in the interval. Both lungs are clear and expanded. Heart and mediastinum normal.</td>
<td valign="top" align="left">XXXX opacities XXXX representing chronic lung disease. Lungs are clear. No pleural effusions or pneumothoraces. Heart and mediastinum of normal size and contour. Degenerative changes in the spine.</td>
</tr>
<tr>
<td valign="top" align="left"><inline-graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="fdgth-07-1535168-i007.tif"/></td>
<td valign="top" align="left">
Right lower lobe XXXX calcified granuloma. Heart size within normal limits. No pleural effusions. No evidence of pneumothorax. Degenerative changes thoracic spine.</td>
<td valign="top" align="left">Right lower lobe XXXX calcified granuloma. Heart size and pulmonary vascularity within normal limits. No focal consolidation, pneumothorax or pleural effusion.</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>While our model performs well on the evaluated dataset, it is essential to recognize that the results may vary with different datasets or imaging modalities. Further exploration is needed to generalize the model&#x0027;s performance across various clinical contexts. As noted by Pan et al. (<xref ref-type="bibr" rid="B40">40</xref>), integrating large AI models into radiology workflows presents challenges, including data privacy concerns, ethical considerations, and compatibility with existing hospital infrastructure. Future work should focus on refining the model by incorporating larger and more diverse datasets to enhance its robustness. Additionally, exploring hybrid architectures or integrating attention mechanisms may yield further performance improvements.</p>
<p>To transition this model into real-world clinical systems, several steps should be considered. One key recommendation is to address data privacy concerns, ensuring the protection of patient information in compliance with healthcare regulations. Moreover, integration into existing hospital infrastructure, including compatibility with radiology workstations and electronic health records, would be essential for seamless deployment. Implementing real-time processing capabilities to enable timely report generation is another practical challenge. Ethical considerations surrounding the use of AI in healthcare, such as transparency, accountability, and bias mitigation, must also be discussed to ensure responsible adoption. Ultimately, this research lays the groundwork for the development of intelligent systems that not only improve the accuracy of medical reporting but also pave the way for innovative applications in automated healthcare diagnostics.</p>
</sec>
</body>
<back>
<sec id="s7" sec-type="data-availability"><title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding authors.</p>
</sec>
<sec id="s8" sec-type="author-contributions"><title>Author contributions</title>
<p>PS: Conceptualization, Methodology, Software, Visualization, Writing &#x2013; original draft. SS: Methodology, Supervision, Visualization, Writing &#x2013; review &#x0026; editing.</p>
</sec>
<sec id="s9" sec-type="funding-information"><title>Funding</title>
<p>The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.</p>
</sec>
<ack><title>Acknowledgments</title>
<p>All authors are thankful to Lovely Professional University for providing labs to perform work.</p>
</ack>
<sec id="s10" sec-type="COI-statement"><title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec id="s12" sec-type="ai-statement"><title>Generative AI statement</title>
<p>The author(s) declare that no Generative AI was used in the creation of this manuscript.</p>
</sec>
<sec id="s11" sec-type="disclaimer"><title>Publisher&#x0027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list><title>References</title>
<ref id="B1"><label>1.</label><citation citation-type="book"><collab>Committee on Diagnostic Error in Health Care, Board on Health Care Services, Institute of Medicine, The National Academies of Sciences, Engineering, and Medicine</collab>. <article-title><italic>Improving diagnosis in health care</italic></article-title>. In: <person-group person-group-type="editor"><name><surname>Balogh</surname><given-names>EP</given-names></name><name><surname>Miller</surname><given-names>BT</given-names></name><name><surname>Ball</surname><given-names>JR</given-names></name></person-group>, <publisher-loc>Washington, DC</publisher-loc>: <publisher-name>National Academies Press</publisher-name> (<year>2015</year>). <comment>Available online at:</comment> <ext-link ext-link-type="uri" xlink:href="https://www.ncbi.nlm.nih.gov/books/NBK338586/">https://www.ncbi.nlm.nih.gov/books/NBK338586/</ext-link></citation></ref>
<ref id="B2"><label>2.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vaswani</surname><given-names>A</given-names></name><name><surname>Shazeer</surname><given-names>N</given-names></name><name><surname>Parmar</surname><given-names>N</given-names></name><name><surname>Uszkoreit</surname><given-names>J</given-names></name><name><surname>Jones</surname><given-names>L</given-names></name><name><surname>Gomez</surname><given-names>AN</given-names></name><etal/></person-group> <article-title>Attention is all you need</article-title>. <source>Adv Neural Inf Process Syst</source>. (<year>2017</year>) <comment>arXiv:1706.03762v7</comment>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1706.03762v7">https://arxiv.org/abs/1706.03762v7</ext-link></citation></ref>
<ref id="B3"><label>3.</label><citation citation-type="other"><person-group person-group-type="author"><name><surname>Li</surname><given-names>T</given-names></name><name><surname>El Mesbahi</surname><given-names>Y</given-names></name><name><surname>Kobyzev</surname><given-names>I</given-names></name><name><surname>Rashid</surname><given-names>A</given-names></name><name><surname>Mahmud</surname><given-names>A</given-names></name><name><surname>Anchuri</surname><given-names>N</given-names></name><etal/></person-group> <article-title>A short study on compressing decoder-based language models. <italic>arXiv preprint arXiv:2305.10698</italic></article-title> (<year>2023</year>).</citation></ref>
<ref id="B4"><label>4.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Liu</surname><given-names>Z</given-names></name><name><surname>Lin</surname><given-names>Y</given-names></name><name><surname>Cao</surname><given-names>Y</given-names></name><name><surname>Hu</surname><given-names>H</given-names></name><name><surname>Wei</surname><given-names>Y</given-names></name><name><surname>Zhang</surname><given-names>Z</given-names></name><etal/></person-group> <article-title>Swin transformer: hierarchical vision transformer using shifted windows</article-title>. <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)</conf-name> (<year>2021</year>).</citation></ref>
<ref id="B5"><label>5.</label><citation citation-type="other"><article-title>Indiana University Chest x-ray dataset</article-title>. (n.d.). <comment>Available online at:</comment> <ext-link ext-link-type="uri" xlink:href="https://openi.nlm.nih.gov/detailedresult?img=CXR111_IM-0076-1001%26req=4">https://openi.nlm.nih.gov/detailedresult?img&#x003D;CXR111_IM-0076-1001&#x0026;req&#x003D;4</ext-link> (<comment>Accessed October, 2024</comment>).</citation></ref>
<ref id="B6"><label>6.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Allaouzi</surname><given-names>I</given-names></name><name><surname>Ben Ahmed</surname><given-names>M</given-names></name><name><surname>Benamrou</surname><given-names>B</given-names></name><name><surname>Ouardouz</surname><given-names>M</given-names></name></person-group>. <article-title>Automatic caption generation for medical images</article-title>. <conf-name>Proceedings of the 3rd International Conference on Smart City Applications</conf-name>. <publisher-loc>New York, NY</publisher-loc>: <publisher-name>Association for Computing Machinery</publisher-name> (<year>2018</year>). p. <fpage>1</fpage>&#x2013;<lpage>6</lpage></citation></ref>
<ref id="B7"><label>7.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Yuan</surname><given-names>J</given-names></name><name><surname>Liao</surname><given-names>H</given-names></name><name><surname>Luo</surname><given-names>R</given-names></name><name><surname>Luo</surname><given-names>J</given-names></name></person-group>. <article-title>Medical image computing and computer-assisted intervention</article-title>. <conf-name>MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13&#x2013;17, 2019, Proceedings, Part VI 22</conf-name> (<year>2019</year>).</citation></ref>
<ref id="B8"><label>8.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Li</surname><given-names>M</given-names></name><name><surname>Lin</surname><given-names>B</given-names></name><name><surname>Chen</surname><given-names>Z</given-names></name><name><surname>Lin</surname><given-names>H</given-names></name><name><surname>Liang</surname><given-names>X</given-names></name><name><surname>Chang</surname><given-names>X</given-names></name></person-group>. <article-title>Dynamic graph enhanced contrastive learning for chest x-ray report generation</article-title>. <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name> (<year>2023</year>). p. <fpage>3334</fpage>&#x2013;<lpage>43</lpage></citation></ref>
<ref id="B9"><label>9.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Jing</surname><given-names>B</given-names></name><name><surname>Xie</surname><given-names>P</given-names></name><name><surname>Xing</surname><given-names>E</given-names></name></person-group>. <article-title>On the automatic generation of medical imaging reports</article-title>. <conf-name>Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Vol. 1 (Long Papers)</conf-name> (<year>2018</year>). p. <fpage>2577</fpage>&#x2013;<lpage>86</lpage></citation></ref>
<ref id="B10"><label>10.</label><citation citation-type="other"><person-group person-group-type="author"><name><surname>Hou</surname><given-names>D</given-names></name><name><surname>Zhao</surname><given-names>Z</given-names></name><name><surname>Liu</surname><given-names>Y</given-names></name><name><surname>Chang</surname><given-names>F</given-names></name><name><surname>Hu</surname><given-names>S</given-names></name></person-group>. <article-title>Automatic report generation for chest x-ray images via adversarial reinforcement learning</article-title> (<year>n.d.</year>).</citation></ref>
<ref id="B11"><label>11.</label><citation citation-type="other"><person-group person-group-type="author"><name><surname>Xue</surname><given-names>Y</given-names></name><name><surname>Xu</surname><given-names>T</given-names></name><name><surname>Long</surname><given-names>LR</given-names></name><name><surname>Xue</surname><given-names>Z</given-names></name><name><surname>Antani</surname><given-names>S</given-names></name><name><surname>Thoma</surname><given-names>GR</given-names></name><etal/></person-group> <article-title>Multimodal recurrent model with attention for automated radiology report generation</article-title> (<year>n.d.</year>).</citation></ref>
<ref id="B12"><label>12.</label><citation citation-type="other"><person-group person-group-type="author"><name><surname>Yuan</surname><given-names>J</given-names></name><name><surname>Liao</surname><given-names>H</given-names></name><name><surname>Luo</surname><given-names>R</given-names></name><name><surname>Luo</surname><given-names>J</given-names></name></person-group>. <article-title>Automatic radiology report generation based on multi-view image fusion and medical concept enrichment</article-title> (<year>n.d.</year>).</citation></ref>
<ref id="B13"><label>13.</label><citation citation-type="other"><person-group person-group-type="author"><name><surname>Irvin</surname><given-names>J</given-names></name><name><surname>Rajpurkar</surname><given-names>P</given-names></name><name><surname>Ko</surname><given-names>M</given-names></name><name><surname>Yu</surname><given-names>Y</given-names></name><name><surname>Ciurea-Ilcus</surname><given-names>S</given-names></name><name><surname>Chute</surname><given-names>C</given-names></name><etal/></person-group> <article-title>CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison</article-title> (<year>2019</year>).</citation></ref>
<ref id="B14"><label>14.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Lovelace</surname><given-names>J</given-names></name><name><surname>Mortazavi</surname><given-names>B</given-names></name></person-group>. <article-title>Learning to generate clinically coherent chest x-ray reports</article-title>. In: <person-group person-group-type="editor"><name><surname>Cohn</surname><given-names>T</given-names></name><name><surname>He</surname><given-names>Y</given-names></name><name><surname>Liu</surname><given-names>Y</given-names></name></person-group>, editors. <source>Findings of the Association for Computational Linguistics: EMNLP 2020</source>. <publisher-name>Association for Computational Linguistics. (ACL)</publisher-name> (<year>2020</year>). p. <fpage>1235</fpage>&#x2013;<lpage>43</lpage>.</citation></ref>
<ref id="B15"><label>15.</label><citation citation-type="other"><person-group person-group-type="author"><name><surname>Chen</surname><given-names>Z</given-names></name><name><surname>Song</surname><given-names>Y</given-names></name><name><surname>Chang</surname><given-names>T</given-names></name><name><surname>Wan</surname><given-names>X</given-names></name></person-group>. <article-title>Generating radiology reports via memory-driven transformer</article-title> (<year>n.d.</year>).</citation></ref>
<ref id="B16"><label>16.</label><citation citation-type="other"><person-group person-group-type="author"><name><surname>Pino</surname><given-names>P</given-names></name><name><surname>Parra</surname><given-names>D</given-names></name><name><surname>Besa</surname><given-names>C</given-names></name><name><surname>Lagos</surname><given-names>C</given-names></name></person-group>. <article-title>Clinically correct report generation from chest x-rays using templates</article-title> (<year>2021</year>).</citation></ref>
<ref id="B17"><label>17.</label><citation citation-type="other"><person-group person-group-type="author"><name><surname>Najdenkoska</surname><given-names>I</given-names></name><name><surname>Zhen</surname><given-names>X</given-names></name><name><surname>Worring</surname><given-names>M</given-names></name><name><surname>Shao</surname><given-names>L</given-names></name></person-group>. <article-title>Uncertainty-aware report generation for chest x-rays by variational topic inference</article-title> (<year>2022</year>).</citation></ref>
<ref id="B18"><label>18.</label><citation citation-type="other"><person-group person-group-type="author"><name><surname>Akbar</surname><given-names>W</given-names></name><name><surname>Haq</surname><given-names>MIU</given-names></name><name><surname>Soomro</surname><given-names>A</given-names></name><name><surname>Daudpota</surname><given-names>SM</given-names></name><name><surname>Imran</surname><given-names>AS</given-names></name><name><surname>Ullah</surname><given-names>M</given-names></name></person-group>. <article-title>Automated report generation: A GRU-based method for chest x-rays</article-title> (<year>n.d.</year>).</citation></ref>
<ref id="B19"><label>19.</label><citation citation-type="other"><person-group person-group-type="author"><name><surname>Lee</surname><given-names>H</given-names></name><name><surname>Cho</surname><given-names>H</given-names></name><name><surname>Park</surname><given-names>J</given-names></name><name><surname>Chae</surname><given-names>J</given-names></name><name><surname>Kim</surname><given-names>J</given-names></name></person-group>. <article-title>Cross encoder-decoder transformer with global-local visual extractor for medical image captioning</article-title> (<year>n.d.</year>).</citation></ref>
<ref id="B20"><label>20.</label><citation citation-type="other"><person-group person-group-type="author"><name><surname>Chen</surname><given-names>Y-J</given-names></name><name><surname>Shen</surname><given-names>W-H</given-names></name><name><surname>Chung</surname><given-names>H-W</given-names></name><name><surname>Chiu</surname><given-names>J-H</given-names></name><name><surname>Juan</surname><given-names>D-C</given-names></name><name><surname>Ho</surname><given-names>T-Y</given-names></name><etal/></person-group> <article-title>Representative image feature extraction via contrastive learning pretraining for chest x-ray report generation</article-title> (<year>2022</year>).</citation></ref>
<ref id="B21"><label>21.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Han</surname><given-names>W</given-names></name> <name><surname>Kim</surname><given-names>C</given-names></name> <name><surname>Ju</surname><given-names>D</given-names></name> <name><surname>Shim</surname><given-names>Y</given-names></name> <name><surname>Hwang</surname><given-names>SJ</given-names></name></person-group>. <article-title>Advancing text-driven chest x-ray generation with policy-based reinforcement learning</article-title>. <conf-name>International Conference on Medical Image Computing and Computer-Assisted Intervention</conf-name>. <publisher-loc>Cham</publisher-loc>: <publisher-name>Springer Nature Switzerland</publisher-name> (<year>2024</year>).</citation></ref>
<ref id="B22"><label>22.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Parres</surname><given-names>D</given-names></name><name><surname>Albiol</surname><given-names>A</given-names></name><name><surname>Paredes</surname><given-names>R</given-names></name></person-group>. <article-title>Improving radiology report generation quality and diversity through reinforcement learning and text augmentation</article-title>. <source>Bioengineering</source>. (<year>2024</year>) <volume>11</volume>(<issue>4</issue>):<fpage>351</fpage>. <pub-id pub-id-type="doi">10.3390/bioengineering11040351</pub-id><pub-id pub-id-type="pmid">38671773</pub-id></citation></ref>
<ref id="B23"><label>23.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pan</surname><given-names>Y</given-names></name> <name><surname>Liu</surname><given-names>L-J</given-names></name> <name><surname>Yang</surname><given-names>X-B</given-names></name> <name><surname>Peng</surname><given-names>W</given-names></name> <name><surname>Huang</surname><given-names>Q-S</given-names></name></person-group>. <article-title>Chest radiology report generation based on cross-modal multi-scale feature fusion</article-title>. <source>J Radia Res Appl Sci</source>. (<year>2024</year>) <volume>17</volume>(<issue>1</issue>):<fpage>100823</fpage>. <pub-id pub-id-type="doi">10.1016/j.jrras.2024.100823</pub-id></citation></ref>
<ref id="B24"><label>24.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shahzadi</surname><given-names>I</given-names></name><name><surname>Madni</surname><given-names>TM</given-names></name> <name><surname>Janjua</surname><given-names>UI</given-names></name> <name><surname>Batool</surname><given-names>G</given-names></name> <name><surname>Naz</surname><given-names>B</given-names></name> <name><surname>Ali</surname><given-names>MQ</given-names></name></person-group>. <article-title>CSAMDT: conditional self-attention memory-driven transformers for radiology report generation from chest x-ray</article-title>. <source>J Imaging Inform Med</source>. (<year>2024</year>) <volume>37</volume>:<fpage>1</fpage>&#x2013;<lpage>13</lpage>.</citation></ref>
<ref id="B25"><label>25.</label><citation citation-type="other"><person-group person-group-type="author"><name><surname>Sharma</surname><given-names>H</given-names></name><name><surname>Salvatelli</surname><given-names>V</given-names></name> <name><surname>Srivastav</surname><given-names>S</given-names></name> <name><surname>Bouzid</surname><given-names>K</given-names></name> <name><surname>Bannur</surname><given-names>S</given-names></name> <name><surname>Castro</surname><given-names>DC</given-names></name><etal/></person-group> <article-title>MAIRA-Seg: Enhancing Radiology Report Generation with Segmentation-Aware Multimodal Large Language Models. arXiv preprint arXiv: 2411.11362 (2024)</article-title>.</citation></ref>
<ref id="B26"><label>26.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tanno</surname><given-names>R</given-names></name><name><surname>Barrett</surname><given-names>DGT</given-names></name><name><surname>Sellergren</surname><given-names>A</given-names></name><name><surname>Ghaisas</surname><given-names>S</given-names></name> <name><surname>Dathathri</surname><given-names>S</given-names></name> <name><surname>See</surname><given-names>A</given-names></name><etal/></person-group> <article-title>Collaboration between clinicians and vision&#x2013;language models in radiology report generation</article-title>. <source>Nat Med</source>. (<year>2024</year>). <pub-id pub-id-type="doi">10.1038/s41591-024-03302-1</pub-id></citation></ref>
<ref id="B27"><label>27.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bastos</surname><given-names>M</given-names></name><name><surname>Nascimento</surname><given-names>J</given-names></name><name><surname>Calisto</surname><given-names>FM</given-names></name></person-group>. <article-title>Human-centered design of a semantic annotation tool for breast cancer diagnosis</article-title>. (<year>2024</year>). <pub-id pub-id-type="doi">10.13140/RG.2.2.14982.38722</pub-id></citation></ref>
<ref id="B28"><label>28.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bluethgen</surname><given-names>C</given-names></name><name><surname>Chambon</surname><given-names>P</given-names></name><name><surname>Delbrouck</surname><given-names>JB</given-names></name><name><surname>van der Sluijs</surname><given-names>R</given-names></name><name><surname>Po&#x0142;acin</surname><given-names>M</given-names></name> <name><surname>Zambrano Chaves</surname><given-names>JM</given-names></name><etal/></person-group> <article-title>A vision&#x2013;language foundation model for the generation of realistic chest x-ray images</article-title>. <source>Nat Biomed Eng</source>. (<year>2024</year>) <fpage>1</fpage>&#x2013;<lpage>13</lpage>. <pub-id pub-id-type="doi">10.1038/s41551-024-01246-y</pub-id><pub-id pub-id-type="pmid">38253670</pub-id></citation></ref>
<ref id="B29"><label>29.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Papineni</surname><given-names>K</given-names></name><name><surname>Roukos</surname><given-names>S</given-names></name><name><surname>Ward</surname><given-names>T</given-names></name><name><surname>Zhu</surname><given-names>W-J</given-names></name></person-group>. <article-title>BLEU: a method for automatic evaluation of machine translation</article-title>. <conf-name>Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics</conf-name> (<year>2002</year>). p. <fpage>311</fpage>&#x2013;<lpage>8</lpage></citation></ref>
<ref id="B30"><label>30.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Lin</surname><given-names>C-Y</given-names></name></person-group>. <article-title>ROUGE: a package for automatic evaluation of summaries</article-title>. <conf-name>Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics</conf-name> (<year>2004</year>).</citation></ref>
<ref id="B31"><label>31.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Denkowski</surname><given-names>M</given-names></name><name><surname>Lavie</surname><given-names>A</given-names></name></person-group>. <article-title>METEOR 1.3: automatic metric for reliable optimization and evaluation of machine translation systems</article-title>. <conf-name>Proceedings of the Sixth Workshop on Statistical Machine Translation</conf-name> (<year>2011</year>).</citation></ref>
<ref id="B32"><label>32.</label><citation citation-type="other"><person-group person-group-type="author"><name><surname>Niksaz</surname><given-names>S</given-names></name><name><surname>Ghasemian</surname><given-names>F</given-names></name></person-group>. <article-title>Improving chest x-ray report generation by leveraging text of similar images</article-title> (<year>n.d.</year>).</citation></ref>
<ref id="B33"><label>33.</label><citation citation-type="other"><person-group person-group-type="author"><name><surname>Magalh&#x00E3;es</surname><given-names>GV</given-names><suffix>Junior</suffix></name><name><surname>de S. Santos</surname><given-names>RL</given-names></name><name><surname>Vogado</surname><given-names>LHS</given-names></name><name><surname>Cardoso de Paiva</surname><given-names>A</given-names></name><name><surname>dos Santos Neto</surname><given-names>PA</given-names></name></person-group>. <article-title>XRaySwinGen: Automatic medical reporting for x-ray exams with multimodal model</article-title> (<year>n.d.</year>).</citation></ref>
<ref id="B34"><label>34.</label><citation citation-type="other"><person-group person-group-type="author"><name><surname>Yelure</surname><given-names>BS</given-names></name><name><surname>Kuchabal</surname><given-names>GS</given-names></name><name><surname>Chaure</surname><given-names>SH</given-names></name><name><surname>Bhingardive</surname><given-names>SR</given-names></name><name><surname>Patil</surname><given-names>GS</given-names></name><name><surname>Pati</surname><given-names>PS</given-names></name></person-group>. <article-title>Summarized automatic medical report generation on chest x-ray using deep learning</article-title> (<year>n.d.</year>).</citation></ref>
<ref id="B35"><label>35.</label><citation citation-type="other"><person-group person-group-type="author"><name><surname>Alqahtani</surname><given-names>FF</given-names></name><name><surname>Mohsan</surname><given-names>MM</given-names></name><name><surname>Alshamrani</surname><given-names>K</given-names></name><name><surname>Zeb</surname><given-names>J</given-names></name><name><surname>Alhamami</surname><given-names>S</given-names></name><name><surname>Alqarni</surname><given-names>D</given-names></name></person-group>. <article-title>CNX-B2: A novel CNN-transformer approach for chest x-ray medical report generation</article-title> (<year>n.d.</year>).</citation></ref>
<ref id="B36"><label>36.</label><citation citation-type="other"><person-group person-group-type="author"><name><surname>Singh</surname><given-names>S</given-names></name></person-group>. <article-title>Clinical context-aware radiology report generation from medical images using transformers</article-title> (<year>n.d.</year>).</citation></ref>
<ref id="B37"><label>37.</label><citation citation-type="other"><person-group person-group-type="author"><name><surname>Shaikh</surname><given-names>Z</given-names></name><name><surname>Bharti</surname><given-names>J</given-names></name></person-group>. <article-title>Transformer-based chest x-ray report generation model</article-title> (<year>n.d.</year>).</citation></ref>
<ref id="B38"><label>38.</label><citation citation-type="other"><person-group person-group-type="author"><name><surname>Alfarghaly</surname><given-names>O</given-names></name><name><surname>Khaled</surname><given-names>R</given-names></name><name><surname>Elkorany</surname><given-names>A</given-names></name><name><surname>Helal</surname><given-names>M</given-names></name><name><surname>Fahmy</surname><given-names>A</given-names></name></person-group>. <article-title>Automated radiology report generation using conditioned transformers</article-title> (<year>n.d.</year>).</citation></ref>
<ref id="B39"><label>39.</label><citation citation-type="other"><person-group person-group-type="author"><name><surname>Raminedi</surname><given-names>S</given-names></name><name><surname>Shridevi</surname><given-names>S</given-names></name><name><surname>Won</surname><given-names>D</given-names></name></person-group>. <article-title>Multi-modal transformer architecture for medical image analysis and automated report generation</article-title> (<year>n.d.</year>).</citation></ref>
<ref id="B40"><label>40.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pan</surname><given-names>L</given-names></name><name><surname>Zhao</surname><given-names>Z</given-names></name><name><surname>Lu</surname><given-names>Y</given-names></name><name><surname>Tang</surname><given-names>K</given-names></name><name><surname>Fu</surname><given-names>L</given-names></name><name><surname>Liang</surname><given-names>Q</given-names></name><etal/></person-group> <article-title>Opportunities and challenges in the application of large artificial intelligence models in radiology</article-title>. <source>Meta-Radiology</source>. (<year>2024</year>) <volume>2</volume>(<issue>2</issue>):<fpage>100080</fpage>. <pub-id pub-id-type="doi">10.1016/j.metrad.2024.100080</pub-id></citation></ref></ref-list>
</back>
</article>