<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.3 20210610//EN" "JATS-journalpublishing1-3-mathml3.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:ali="http://www.niso.org/schemas/ali/1.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="1.3" xml:lang="EN">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Oncol.</journal-id>
<journal-title-group>
<journal-title>Frontiers in Oncology</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Oncol.</abbrev-journal-title>
</journal-title-group>
<issn pub-type="epub">2234-943X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fonc.2025.1643504</article-id>
<article-version article-version-type="Version of Record" vocab="NISO-RP-8-2008"/>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Original Research</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Molecular-informed image classification for predicting drug sensitivity in cancer therapy</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Qu</surname><given-names>Chunmei</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>*</sup></xref>
<uri xlink:href="https://loop.frontiersin.org/people/3094196/overview"/>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Writing &#x2013; original draft" vocab-term-identifier="https://credit.niso.org/contributor-roles/writing-original-draft/">Writing &#x2013; original draft</role>
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="Writing &#x2013; review &amp; editing" vocab-term-identifier="https://credit.niso.org/contributor-roles/writing-review-editing/">Writing &#x2013; review &amp; editing</role>
</contrib>
</contrib-group>
<aff id="aff1"><institution>Internet Academy, Anhui University</institution>, <city>Hefei</city>,&#xa0;<country country="cn">China</country></aff>
<author-notes>
<corresp id="c001"><label>*</label>Correspondence: Chunmei Qu, <email xlink:href="mailto:l5eolzo0@163.com">l5eolzo0@163.com</email></corresp>
</author-notes>
<pub-date publication-format="electronic" date-type="pub" iso-8601-date="2026-01-12">
<day>12</day>
<month>01</month>
<year>2026</year>
</pub-date>
<pub-date publication-format="electronic" date-type="collection">
<year>2025</year>
</pub-date>
<volume>15</volume>
<elocation-id>1643504</elocation-id>
<history>
<date date-type="received">
<day>13</day>
<month>06</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>29</day>
<month>10</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2026 Qu.</copyright-statement>
<copyright-year>2026</copyright-year>
<copyright-holder>Qu</copyright-holder>
<license>
<ali:license_ref start_date="2026-01-12">https://creativecommons.org/licenses/by/4.0/</ali:license_ref>
<license-p>This is an open-access article distributed under the terms of the <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution License (CC BY)</ext-link>. The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</license-p>
</license>
</permissions>
<abstract>
<sec>
<title>Introduction</title>
<p>Understanding and predicting drug sensitivity in cancer therapy demands innovative  approaches that integrate multi-modal data to enhance treatment efficacy. In alignment with the advancing scope of precision oncology and the molecularly informed therapeutic decision-making emphasized by contemporary cancer research, this work proposes a dynamic and structure-aware imaging framework for robust molecular-informed image classification. Traditional methodologies often suffer from rigid modeling assumptions and inadequate handling of complex, heterogeneous noise prevalent in biological imaging, which limits their predictive accuracy and generalizability.</p>
</sec>
<sec>
<title>Methods</title>
<p>To address these challenges, we introduce a novel dynamic structure-aware imaging network (DSINet) coupled with a progressive structure-guided optimization (PSGO) strategy. DSINet dynamically adapts spatial filters based on local molecular content, preserves critical biological structures through attention mechanisms, and incorporates uncertainty-aware fusion across multiple resolutions. PSGO further refines the reconstruction by progressively focusing optimization on high-confidence regions and adaptively restructuring feature graphs to enhance robustness against variable imaging artifacts.</p>
</sec>
<sec>
<title>Results and Discussion</title>
<p>Extensive experimental evaluations demonstrate that our method significantly outperforms techniques in classifying molecular patterns correlated with drug sensitivity, offering a reliable and interpretable foundation for advancing personalized cancer therapy strategies. This approach seamlessly integrates cutting-edge adaptive imaging models with the emerging needs of molecular-insight-driven therapeutic optimization, bridging critical gaps in current cancer informatics research.</p>
</sec>
</abstract>
<kwd-group>
<kwd>drug sensitivity prediction</kwd>
<kwd>molecular-informed imaging</kwd>
<kwd>adaptive imaging model</kwd>
<kwd>structure-aware optimization</kwd>
<kwd>cancer therapy classification</kwd>
</kwd-group>
<funding-group>
<funding-statement>The author(s) declared that financial support was not received for this work and/or its publication.</funding-statement>
</funding-group>
<counts>
<fig-count count="5"/>
<table-count count="13"/>
<equation-count count="44"/>
<ref-count count="44"/>
<page-count count="20"/>
<word-count count="13237"/>
</counts>
<custom-meta-group>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Cancer Imaging and Image-directed Interventions</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="s1" sec-type="intro">
<label>1</label>
<title>Introduction</title>
<p>The prediction of drug sensitivity in cancer therapy has become a central focus in precision medicine, aiming to tailor treatment strategies to individual patient profiles. While traditional biomarkers such as genetic mutations provide valuable insights, they are often insufficient to fully explain variations in therapeutic outcomes. This limitation arises from the complex and heterogeneous nature of tumors, including diverse phenotypic traits and dynamic tumor microenvironments that can interfere with drug efficacy (<xref ref-type="bibr" rid="B1">1</xref>).</p>
<p>To address these challenges, molecular-informed image classification has emerged as a promising approach. By integrating histopathological imaging with molecular-level data (like gene expression, mutations), this method provides a more comprehensive view of tumor biology. It not only enhances the predictive accuracy of treatment outcomes but also facilitates the discovery of therapeutic targets and resistance mechanisms, thereby supporting more informed clinical decision-making and personalized therapy design (<xref ref-type="bibr" rid="B2">2</xref>).</p>
<p>Initial efforts in this field relied heavily on manual interpretation and expert-defined image descriptors. Researchers focused on predefined visual features such as nuclear size, tissue texture, and cellular organization (<xref ref-type="bibr" rid="B3">3</xref>). These handcrafted features were typically used in rule-based models guided by human expertise. Although interpretable and biologically grounded, such systems struggled to generalize across diverse cancer types (<xref ref-type="bibr" rid="B4">4</xref>) and lacked the flexibility to incorporate molecular-level variability, limiting their applicability for individualized prediction.</p>
<p>Subsequent advancements introduced computational models capable of learning patterns directly from labeled data, reducing dependence on manual feature engineering. Techniques such as support vector classifiers, decision forests, and boosting methods were applied to histopathological image analysis (<xref ref-type="bibr" rid="B5">5</xref>), demonstrating better generalization by leveraging statistical patterns in large datasets (<xref ref-type="bibr" rid="B6">6</xref>). These models also allowed for the integration of molecular profiles as auxiliary input features. However, with the increasing complexity of data&#x2014;including whole-slide images and high-dimensional omics information&#x2014;these approaches began to face challenges related to scalability and sensitivity (<xref ref-type="bibr" rid="B7">7</xref>).</p>
<p>More recently, deep learning models have gained prominence due to their ability to extract rich, hierarchical features from raw input data. Convolutional neural networks (CNNs) have been widely adopted for image-based tasks, while multi-branch architectures enable the simultaneous processing of molecular data (<xref ref-type="bibr" rid="B8">8</xref>). Joint learning frameworks further enhance the ability to model intricate associations between tissue morphology and molecular alterations, leading to improved performance in drug sensitivity prediction (<xref ref-type="bibr" rid="B9">9</xref>). Nevertheless, issues such as lack of interpretability, limited robustness across clinical settings, and the need for large, well-annotated datasets remain open challenges (<xref ref-type="bibr" rid="B10">10</xref>).</p>
<p>To overcome these limitations, this work presents a molecular-informed image classification framework that integrates multi-modal data through an efficient, interpretable, and robust architecture. The proposed method is designed to improve the accuracy of drug sensitivity prediction while maintaining adaptability to complex and heterogeneous biomedical data.</p>
<list list-type="bullet">
<list-item>
<p>Our method introduces a multi-modal transformer architecture that jointly models histopathological images and molecular data, capturing complex cross-modal relationships with minimal feature engineering.</p></list-item>
<list-item>
<p>It features a modular design that ensures adaptability across different cancer types, demonstrating high efficiency and generalizability in multi-scenario clinical settings.</p></list-item>
<list-item>
<p>Experimental results show that our model significantly outperforms state-of-the-art baselines in multiple benchmark datasets, achieving improved prediction accuracy, robustness, and interpretability.</p></list-item>
</list>
</sec>
<sec id="s2">
<label>2</label>
<title>Related work</title>
<sec id="s2_1">
<label>2.1</label>
<title>Molecular representations in imaging</title>
<p>The integration of molecular information into medical imaging has emerged as a key direction for improving drug sensitivity prediction in cancer treatment. Traditional imaging methods primarily focus on visible tumor features, such as size, shape, and contrast enhancement patterns. While clinically useful, these visual cues often fail to reflect the molecular diversity that underlies variations in treatment response (<xref ref-type="bibr" rid="B10">10</xref>). Radiogenomics has established a foundational link between imaging features and molecular characteristics, enabling researchers to identify image-based biomarkers that correspond to specific biological pathways (<xref ref-type="bibr" rid="B11">11</xref>). In recent years, deep learning&#x2014;particularly convolutional neural networks (CNNs)&#x2014;has been increasingly used to automate this process. These models can learn complex patterns in imaging data that correlate with molecular traits, such as those linked to drug resistance (<xref ref-type="bibr" rid="B12">12</xref>). Public datasets like The Cancer Genome Atlas (TCGA) and The Cancer Imaging Archive (TCIA) have supported the development of such integrated models. Studies have shown that incorporating gene expression data into imaging pipelines can significantly improve predictive accuracy (<xref ref-type="bibr" rid="B13">13</xref>). Moreover, integration has expanded beyond transcriptomics to include other molecular dimensions such as somatic mutations, copy number variations, and DNA methylation profiles, enriching image-based classification with multi-omics information (<xref ref-type="bibr" rid="B14">14</xref>). Despite this progress, several challenges remain. One major issue is the variability across datasets&#x2014;in both imaging modalities and molecular profiling platforms&#x2014;which complicates model training and generalization. Addressing this requires advanced normalization methods and domain adaptation strategies (<xref ref-type="bibr" rid="B15">15</xref>). Another ongoing research focus is model interpretability. It is critical to ensure that the image features used for prediction correspond to meaningful biological phenomena rather than artifacts or correlations without causation (<xref ref-type="bibr" rid="B16">16</xref>). Improving these aspects is essential to gain clinical trust and to enable the practical deployment of molecular-informed imaging models in precision oncology (<xref ref-type="bibr" rid="B17">17</xref>).</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Deep learning for drug response prediction</title>
<p>Deep learning models have emerged as essential tools for predicting drug response due to their capability to capture high-dimensional and non-linear relationships intrinsic to biomedical data (<xref ref-type="bibr" rid="B18">18</xref>). Traditional approaches to drug sensitivity prediction in cancer, which have predominantly utilized cell line assays or patient-derived xenografts, are constrained by substantial resource requirements and limited scalability (<xref ref-type="bibr" rid="B19">19</xref>). The availability of extensive pharmacogenomic datasets, including the Genomics of Drug Sensitivity in Cancer (GDSC) and the Cancer Cell Line Encyclopedia (CCLE), has enabled the development of deep learning architectures that map comprehensive molecular profiles to therapeutic outcomes (<xref ref-type="bibr" rid="B20">20</xref>). Autoencoders, graph neural networks, and multimodal deep learning frameworks have been deployed to integrate genomic, transcriptomic, and proteomic data with drug molecular characteristics to forecast efficacy (<xref ref-type="bibr" rid="B21">21</xref>). Incorporation of imaging modalities into predictive models has facilitated the learning of joint feature representations that simultaneously capture phenotypic traits and molecular determinants of drug response (<xref ref-type="bibr" rid="B22">22</xref>). Multimodal variational autoencoders (MVAEs) have been employed to simultaneously encode histopathology images and molecular profiles, resulting in enhanced prediction performance across heterogeneous cancer types (<xref ref-type="bibr" rid="B23">23</xref>). Optimization of these models frequently involves specialized loss functions, including contrastive loss and triplet loss, to ensure alignment between multimodal feature spaces and therapeutic responses (<xref ref-type="bibr" rid="B24">24</xref>). Despite advances, significant obstacles persist, notably the scarcity of labeled data, pronounced heterogeneity across cancer subtypes, and difficulties in achieving model generalization across diverse patient cohorts (<xref ref-type="bibr" rid="B25">25</xref>). Strategies such as transfer learning and few-shot learning are being actively explored to address these limitations, promoting the development of robust, scalable, and clinically translatable deep learning systems for drug response prediction (<xref ref-type="bibr" rid="B26">26</xref>).</p>
</sec>
<sec id="s2_3">
<label>2.3</label>
<title>Multimodal data fusion techniques</title>
<p>The integration of multimodal data, encompassing imaging, molecular profiles, clinical metadata, and therapeutic outcomes, represents a pivotal approach for advancing drug sensitivity prediction in oncology (<xref ref-type="bibr" rid="B27">27</xref>). Multimodal data fusion techniques are typically categorized into early fusion, intermediate fusion, and late fusion, with each strategy offering distinct trade-offs regarding information preservation and model complexity (<xref ref-type="bibr" rid="B28">28</xref>). Early fusion methods involve the concatenation of raw features from disparate modalities prior to modeling, although they often encounter challenges associated with dimensionality explosion and modality dominance (<xref ref-type="bibr" rid="B29">29</xref>). Intermediate fusion approaches, which entail learning-modality-specific latent representations before their integration through mechanisms such as attention or latent alignment, have demonstrated a superior balance between modality fidelity and cross-modal interaction (<xref ref-type="bibr" rid="B30">30</xref>). Late fusion techniques independently model each modality and subsequently amalgamate predictions using ensemble strategies, enhancing robustness at the potential cost of synergistic feature utilization (<xref ref-type="bibr" rid="B31">31</xref>). The application of transformer architectures for multimodal fusion, leveraging cross-attention mechanisms to dynamically model inter-modal dependencies, has recently achieved notable success in enhancing predictive accuracy and interpretability (<xref ref-type="bibr" rid="B32">32</xref>). Cross-modal contrastive learning has further strengthened the ability of models to align heterogeneous modality representations within a unified embedding space, promoting generalization across diverse datasets (<xref ref-type="bibr" rid="B33">33</xref>). Concurrently, the incorporation of explainable artificial intelligence (XAI) techniques into multimodal fusion frameworks has facilitated the attribution of predictive outcomes to specific modalities, fostering transparency and clinical confidence in model outputs. As multimodal fusion methodologies continue to evolve, they offer transformative potential for personalizing cancer therapy through comprehensive and molecularly informed image-based drug response prediction.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Method</title>
<sec id="s3_1">
<label>3.1</label>
<title>Overview</title>
<p>This section systematically introduces the core components of the proposed framework for imaging problems. Section 3.2 establishes the fundamental concepts and formal notations, where the imaging model is defined, key mathematical abstractions are articulated, and the problem setting is formalized. The imaging formation process is modeled as <inline-formula>
<mml:math display="inline" id="im1"><mml:mrow><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="script">H</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:math></inline-formula>, where <italic>x</italic> denotes the latent clean image, <italic>y</italic> represents the observed degraded image, H is the degradation operator, and <italic>n</italic> denotes additive noise. The degradation operator H can encapsulate blurring, downsampling, or a mixture of complex distortions. This foundation ensures that subsequent developments are rooted in precise mathematical formulations and notational consistency. Section 3.3 presents the newly proposed imaging model, which addresses the complexities inherent in real-world visual degradations. Instead of adopting rigid assumptions on noise distribution or blur kernels, the model parameterizes the degradation process using adaptable structure that adapts to spatially varying conditions. Let the degradation operator be parameterized as H<italic><sub>&#x3b8;</sub></italic>, where <italic>&#x3b8;</italic> represents learnable parameters inferred from the degraded observation <italic>y</italic>. A spatially adaptive convolutional mechanism is embedded within H<italic><sub>&#x3b8;</sub></italic> to model heterogeneous distortions across the image domain. The noise <italic>n</italic> is treated as a realization from a location-dependent distribution <inline-formula>
<mml:math display="inline" id="im2"><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x223c;</mml:mo><mml:mi mathvariant="script">N</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:msup><mml:mi>&#x3c3;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, where <inline-formula>
<mml:math display="inline" id="im3"><mml:mrow><mml:msup><mml:mi>&#x3c3;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> varies with the underlying content. This formulation significantly enhances the model&#x2019;s capacity to handle diverse degradation types and non-stationary noise patterns.</p>
<p>Following the model construction, Section 3.4 proposes a progressive optimization strategy for model training. Departing from conventional single-pass or heuristic-guided approaches, the optimization unfolds over multiple refinement stages. At each stage <italic>t</italic>, the reconstruction <inline-formula>
<mml:math display="inline" id="im4"><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi>x</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is updated by selectively activating high-confidence regions, measured by a certainty map <inline-formula>
<mml:math display="inline" id="im5"><mml:mrow><mml:mi>C</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mover accent="true"><mml:mi>x</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>t</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. The certainty map is derived from the posterior distribution over the latent image space and guides the learning objective to prioritize reliable regions before addressing more uncertain parts. Formally, the update rule incorporates a masked loss function <inline-formula>
<mml:math display="inline" id="im6"><mml:mrow><mml:msub><mml:mi mathvariant="script">L</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2225;</mml:mo><mml:mi>C</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mover accent="true"><mml:mi>x</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>t</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2299;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi mathvariant="script">H</mml:mi><mml:mi>&#x3b8;</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mover accent="true"><mml:mi>x</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>t</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msup><mml:mo>&#x2225;</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:math></inline-formula>, where <inline-formula>
<mml:math display="inline" id="im7"><mml:mo>&#x2299;</mml:mo></mml:math></inline-formula> denotes element-wise multiplication. This progressive mechanism not only stabilizes convergence but also mitigates the risk of overfitting to corrupted regions during early stages. Throughout these sections, mathematical rigor and methodological clarity are emphasized. Operators, distributions, and functions are carefully defined to ensure transparent interpretations. Optimization objectives are designed to be both theoretically sound and computationally tractable. The methodology combines classical principles from inverse problems with recent advances in deep learning, resulting in a hybrid framework that leverages domain-specific priors while maintaining flexibility through data-driven learning. The degradation operator <inline-formula>
<mml:math display="inline" id="im8"><mml:mrow><mml:msub><mml:mi mathvariant="script">H</mml:mi><mml:mi>&#x3b8;</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>, certainty map <inline-formula>
<mml:math display="inline" id="im9"><mml:mrow><mml:mi>C</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mover accent="true"><mml:mi>x</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mi>t</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, and noise modeling <italic>&#x3c3;</italic><sup>2</sup>(<italic>x</italic>) are seamlessly integrated into a unified architecture, allowing coherent end-to-end training. The overall approach decomposes the imaging reconstruction problem into three structured modules: degradation modeling, noise-adaptive regularization, and confidence-driven optimization. The degradation modeling module parameterizes distortions through a learnable convolutional structure, the noise-adaptive regularization module captures spatially varying uncertainty, and the confidence-driven optimization module progressively refines the reconstruction by exploiting spatial reliability. This decomposition not only improves interpretability but also yields significant empirical performance gains across multiple tasks.</p>
<p>The proposed method establishes a robust framework for a wide range of imaging tasks, including deblurring, denoising, and super-resolution. Extensive experimental results demonstrate that the method consistently surpasses state-of-the-art baselines across various benchmarks. The adaptability to non-uniform degradations and the progressive learning paradigm lead to improved reconstruction fidelity and robustness under both synthetic and real-world degradation scenarios. Each component introduced in the following sections is meticulously designed to interoperate, resulting in a coherent system that balances modeling accuracy, computational efficiency, and learning stability. Section 3.2 specifies the imaging problem setup and defines the notation used throughout the paper. Section 3.3 elaborates the detailed architecture and parameterization of the proposed degradation model. Section 3.4 describes the progressive optimization procedure, detailing how certainty maps are constructed and utilized to guide learning. This section provides a detailed exposition of the methodology, preparing the foundation for the theoretical analysis and experimental validation presented in subsequent parts. By carefully integrating model design, adaptive regularization, and progressive optimization, the proposed framework achieves superior performance with enhanced theoretical guarantees and practical effectiveness.</p>
<p>To provide a clearer understanding of the overall architecture, a system-level overview of the proposed framework is illustrated in <xref ref-type="fig" rid="f1"><bold>Figure&#xa0;1</bold></xref>. The model consists of two main components: DSINet, which performs spatially adaptive dynamic filtering and multi-resolution refinement, and PSGO, which carries out progressive structure-guided optimization. The entire pipeline begins with degradation modeling and noise-adaptive regularization, followed by a dual-stage reconstruction strategy that combines content-awareness and confidence-driven refinement. <xref ref-type="fig" rid="f2"><bold>Figure&#xa0;2</bold></xref> depicts the internal structure of DSINet. It dynamically generates spatially adaptive filters via a vision transformer and text-guided token sampler. These filters are applied in a context-aware manner to the input image. A structure-preserving attention mechanism ensures that biologically important regions are retained. The model further integrates multi-resolution refinement and uncertainty modeling to support robust image representation. <xref ref-type="fig" rid="f3"><bold>Figure&#xa0;3</bold></xref> shows the three stages of PSGO, namely: adaptive confidence-guided decomposition, where reliable and uncertain regions are separated using a learned confidence map; iterative restructuring with uncertainty adaptation, which updates graph structures using attention-based mechanisms; and dynamic graph-regularized propagation, which propagates information across spatially consistent regions to obtain a high-fidelity reconstruction. These components together enable the model to adaptively handle noise, structural variation, and uncertainty in biomedical images.</p>
<fig id="f1" position="float">
<label>Figure&#xa0;1</label>
<caption>
<p>Overall architecture of the proposed DSINet + PSGO framework. The system begins with degradation modeling of a latent image and proceeds through noise-adaptive regularization and progressive optimization. Certainty maps guide confidence-driven refinement steps, and the entire process is iteratively updated until convergence. This unified pipeline enables robust reconstruction and classification in the presence of molecular and visual heterogeneity.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fonc-15-1643504-g001.tif">
<alt-text content-type="machine-generated">Flowchart illustrating an image restoration process. It starts with a latent image undergoing noise-adaptive regularization and degradation modeling. Progressive and confidence-driven optimizations follow, feeding into a certainty map and refinement. The process leads to image reconstruction.</alt-text>
</graphic></fig>
<fig id="f2" position="float">
<label>Figure&#xa0;2</label>
<caption>
<p>Overview of DSINet architecture for robust image reconstruction. The DSINet framework is composed of three main modules: spatially adaptive dynamic filtering, where informative tokens are extracted via a vision transformer and text-guided sampler to generate dynamic convolutional filters <italic>A<sub>&#x3b8;</sub></italic>(<italic>x</italic>); Structure-preserving attention modulation, which combines structural features <italic>s</italic> with adaptive weights <italic>&#x3b1;</italic> to retain biologically meaningful regions; and multi-resolution refinement and uncertainty modeling, which aggregates multi-scale features using attention maps and applies uncertainty-aware corrections via &#x3a3;(<italic>x</italic>). The final prediction is computed as <inline-formula>
<mml:math display="inline" id="im10"><mml:mover accent="true"><mml:mi>x</mml:mi><mml:mo>&#x2dc;</mml:mo></mml:mover></mml:math></inline-formula> = <italic>x</italic> + <italic>&#x3b2;S</italic>(<italic>x</italic>), enabling structure-consistent and noise-robust reconstruction.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fonc-15-1643504-g002.tif">
<alt-text content-type="machine-generated">Diagram illustrating a model for spatially-adaptive dynamic filtering and multi-resolution refinement. The process begins with a vision transformer and text-guided sampler creating informative tokens. These tokens are processed into structure-preserving attention modulation. The attention map and structure features are integrated for uncertainty modeling. Components like predictive uncertainty and structure extraction contribute to refining the output through various equations and models, resulting in modified outputs denoted by \(\tilde{x} = x + \beta\) and \(S(x)\). The diagram uses labeled boxes and arrows to show the flow of data between processes.</alt-text>
</graphic></fig>
<fig id="f3" position="float">
<label>Figure&#xa0;3</label>
<caption>
<p>This figure illustrates the overall workflow of the progressive structure-guided optimization (PSGO) framework. The system consists of three sequential components: adaptive confidence-guided decomposition, where the input domain is separated into reliable and uncertain regions based on a structural confidence map; iterative restructuring with uncertainty adaptation, which applies attention-based graph refinement and dynamically adjusts the confidence map over iterations; and dynamic graph-regularized propagation, where structural consistency is reinforced using feature-based graphs, reward-driven updates, and gradient normalization. The pipeline produces a high-fidelity reconstruction robust to spatial and annotation uncertainty.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fonc-15-1643504-g003.tif">
<alt-text content-type="machine-generated">Flowchart depicting three interconnected processes: Adaptive Confidence-Guided Decomposition, Iterative Restructuring with Uncertainty Adaptation, and Dynamic Graph-Regularized Propagation. The flow begins with an Objective Function leading to a Confidence Map, then splits into Reliable and Uncertain Regions. Iterative Restructuring starts with Initial Estimate moving through Attention, Graph Update, and Structural Construction. Dynamic Propagation includes Feature-Based Graph, Reward Computation, and Gradient Normalization, culminating in High-Fidelity Image Reconstruction.</alt-text>
</graphic></fig>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Preliminaries</title>
<p>This section formalizes the imaging problem considered in this work by introducing the mathematical notations, degradation models, and key assumptions that underpin subsequent developments.</p>
<p>Let <inline-formula>
<mml:math display="inline" id="im11"><mml:mrow><mml:mi>x</mml:mi><mml:mtext>&#xa0;&#x2208;&#xa0;</mml:mtext><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mi>n</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> denote the latent sharp image, and <inline-formula>
<mml:math display="inline" id="im12"><mml:mrow><mml:mi>y</mml:mi><mml:mtext>&#xa0;</mml:mtext><mml:mo>&#x2208;</mml:mo><mml:mtext>&#xa0;</mml:mtext><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mi>n</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> the observed degraded image. The degradation process is modeled as&#xa0;(<xref ref-type="disp-formula" rid="eq1">Equation 1</xref>):</p>
<disp-formula id="eq1"><label>(1)</label>
<mml:math display="block" id="M1"><mml:mrow><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="script">H</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x3f5;</mml:mi><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math display="inline" id="im13"><mml:mrow><mml:mi mathvariant="script">H</mml:mi><mml:mo>:</mml:mo><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mi>n</mml:mi></mml:msup><mml:mo>&#x2192;</mml:mo><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mi>n</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> is a degradation operator and <inline-formula>
<mml:math display="inline" id="im14"><mml:mrow><mml:mi>&#x454;</mml:mi><mml:mtext>&#xa0;</mml:mtext><mml:mo>&#x2208;</mml:mo><mml:mtext>&#xa0;</mml:mtext><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mi>n</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> represents additive noise.</p>
<p>A standard instance of <inline-formula>
<mml:math display="inline" id="im15"><mml:mrow><mml:mi mathvariant="script">H</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#xb7;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> involves convolution with a blur kernel <inline-formula>
<mml:math display="inline" id="im16"><mml:mrow><mml:mi>k</mml:mi><mml:mtext>&#xa0;</mml:mtext><mml:mo>&#x2208;</mml:mo><mml:mtext>&#xa0;</mml:mtext><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mi>m</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> (<xref ref-type="disp-formula" rid="eq2">Equation 2</xref>):</p>
<disp-formula id="eq2"><label>(2)</label>
<mml:math display="block" id="M2"><mml:mrow><mml:mi mathvariant="script">H</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>k</mml:mi><mml:mtext>&#xa0;</mml:mtext><mml:mo>*</mml:mo><mml:mtext>&#xa0;</mml:mtext><mml:mi>x</mml:mi><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where &#x2217; denotes the convolution operation under specified boundary conditions.</p>
<p>For spatially variant degradations, the degradation operator is expressed as (<xref ref-type="disp-formula" rid="eq3">Equation 3</xref>):</p>
<disp-formula id="eq3"><label>(3)</label>
<mml:math display="block" id="M3"><mml:mrow><mml:mi mathvariant="script">H</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mtext>&#xa0;</mml:mtext><mml:mo>&#x2208;</mml:mo><mml:mtext>&#xa0;&#x3a9;</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:munder><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where &#x2126;(<italic>i</italic>) defines the neighborhood around pixel <italic>i</italic>, and <italic>k<sub>i,j</sub></italic>are location-dependent kernel weights. The noise term $ accounts for both Gaussian and structured perturbations decomposed as (<xref ref-type="disp-formula" rid="eq4">Equation 4</xref>):</p>
<disp-formula id="eq4"><label>(4)</label>
<mml:math display="block" id="M4"><mml:mrow><mml:mtext>&#xa0;</mml:mtext><mml:mi>&#x454;</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x454;</mml:mi><mml:mi>G</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x454;</mml:mi><mml:mi>S</mml:mi></mml:msub><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math display="inline" id="im81"><mml:mrow><mml:msub><mml:mi>&#x454;</mml:mi><mml:mi>G</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> represents Gaussian noise and <inline-formula>
<mml:math display="inline" id="im82"><mml:mrow><mml:msub><mml:mi>&#x454;</mml:mi><mml:mi>S</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> models sparse outliers such as impulsive noise or saturation 217 artifacts.</p>
<p>s, let <inline-formula>
<mml:math display="inline" id="im17"><mml:mrow><mml:mi>H</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#xd7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> denote the convolution matrix associated with <inline-formula>
<mml:math display="inline" id="im18"><mml:mi>k</mml:mi></mml:math></inline-formula>, and <inline-formula>
<mml:math display="inline" id="im19"><mml:mrow><mml:mi>E</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#xd7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> the covariance matrix of the noise. The degradation model becomes (<xref ref-type="disp-formula" rid="eq5">Equation 5</xref>):</p>
<disp-formula id="eq5"><label>(5)</label>
<mml:math display="block" id="M5"><mml:mrow><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mi>H</mml:mi><mml:mi>x</mml:mi><mml:mo>+</mml:mo><mml:mi>e</mml:mi><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math display="inline" id="im20"><mml:mrow><mml:mi>e</mml:mi><mml:mo>&#x223c;</mml:mo><mml:mi mathvariant="script">N</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>E</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> in the Gaussian component.</p>
<p>The ill-posed nature of the inverse problem necessitates regularization. One widely adopted prior assumes sparsity in the gradient domain (<xref ref-type="disp-formula" rid="eq6">Equation 6</xref>):</p>
<disp-formula id="eq6"><label>(6)</label>
<mml:math display="block" id="M6"><mml:mrow><mml:mo>&#x2207;</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x223c;</mml:mo><mml:mtext>sparse</mml:mtext><mml:mi>&#x2004;</mml:mi><mml:mtext>distribution</mml:mtext><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math display="inline" id="im21"><mml:mrow><mml:mo>&#x2207;</mml:mo><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>D</mml:mi><mml:mi>h</mml:mi></mml:msub><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>D</mml:mi><mml:mi>v</mml:mi></mml:msub><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, and <inline-formula>
<mml:math display="inline" id="im22"><mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mi>h</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>, <inline-formula>
<mml:math display="inline" id="im23"><mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mi>v</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> are horizontal and vertical difference operators, respectively.</p>
<p>To capture high-level structures, a feature extraction operator <inline-formula>
<mml:math display="inline" id="im24"><mml:mrow><mml:mi mathvariant="script">F</mml:mi><mml:mo>:</mml:mo><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mi>n</mml:mi></mml:msup><mml:mo>&#x2192;</mml:mo><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mi>p</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> is introduced (<xref ref-type="disp-formula" rid="eq7">Equation 7</xref>):</p>
<disp-formula id="eq7"><label>(7)</label>
<mml:math display="block" id="M7"><mml:mrow><mml:mi>z</mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="script">F</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math display="inline" id="im25"><mml:mrow><mml:mi>z</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mi>p</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> encodes salient features including edges, textures, or semantic patterns.</p>
<p>Modeling complex degradations often requires incorporating nonlinearities. The forward model is thus extended to (<xref ref-type="disp-formula" rid="eq8">Equation&#xa0;8</xref>):</p>
<disp-formula id="eq8"><label>(8)</label>
<mml:math display="block" id="M8"><mml:mrow><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="script">G</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="script">H</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x454;</mml:mi><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math display="inline" id="im26"><mml:mrow><mml:mi mathvariant="script">G</mml:mi><mml:mo>:</mml:mo><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mi>n</mml:mi></mml:msup><mml:mo>&#x2192;</mml:mo><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mi>n</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> denotes a nonlinear transformation accounting for effects such as clipping, gamma correction, or sensor-specific distortions.</p>
<p>The imaging recovery objective is formulated as an optimization problem (<xref ref-type="disp-formula" rid="eq9">Equation 9</xref>):</p>
<disp-formula id="eq9"><label>(9)</label>
<mml:math display="block" id="M9"><mml:mrow><mml:mover accent="true"><mml:mi>x</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mtext>arg&#xa0;</mml:mtext><mml:munder><mml:mrow><mml:mi>min</mml:mi></mml:mrow><mml:mi>x</mml:mi></mml:munder><mml:mi mathvariant="script">D</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="script">H</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math display="inline" id="im27"><mml:mrow><mml:mi mathvariant="script">D</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#xb7;</mml:mo><mml:mo>,</mml:mo><mml:mo>&#xb7;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> measures the discrepancy between the observation and reconstruction.</p>
<p>In blind deconvolution settings, both the latent image <inline-formula>
<mml:math display="inline" id="im28"><mml:mi>x</mml:mi></mml:math></inline-formula> and the blur kernel <inline-formula>
<mml:math display="inline" id="im29"><mml:mi>k</mml:mi></mml:math></inline-formula> are unknown (<xref ref-type="disp-formula" rid="eq10">Equation 10</xref>):</p>
<disp-formula id="eq10"><label>(10)</label>
<mml:math display="block" id="M10"><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mover accent="true"><mml:mi>x</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mover accent="true"><mml:mi>k</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mtext>arg&#xa0;</mml:mtext><mml:munder><mml:mrow><mml:mi>min</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:munder><mml:mi mathvariant="script">D</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mtext>&#xa0;</mml:mtext><mml:mo>&#x2217;</mml:mo><mml:mtext>&#xa0;</mml:mtext><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>To explicitly handle structured noise, an auxiliary variable <inline-formula>
<mml:math display="inline" id="im30"><mml:mrow><mml:mi>o</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mi>n</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> is introduced, leading to the modified observation model (<xref ref-type="disp-formula" rid="eq11">Equation 11</xref>):</p>
<disp-formula id="eq11"><label>(11)</label>
<mml:math display="block" id="M11"><mml:mrow><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mi>H</mml:mi><mml:mi>x</mml:mi><mml:mo>+</mml:mo><mml:mi>o</mml:mi><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x454;</mml:mi><mml:mi>G</mml:mi></mml:msub><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>and the corresponding joint estimation problem is (<xref ref-type="disp-formula" rid="eq12">Equation 12</xref>):</p>
<disp-formula id="eq12"><label>(12)</label>
<mml:math display="block" id="M12"><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mover accent="true"><mml:mi>x</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:mover accent="true"><mml:mi>o</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mtext>arg&#xa0;</mml:mtext><mml:munder><mml:mrow><mml:mi>min</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>o</mml:mi></mml:mrow></mml:munder><mml:mi mathvariant="script">D</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo>,</mml:mo><mml:mi>H</mml:mi><mml:mi>x</mml:mi><mml:mo>+</mml:mo><mml:mi>o</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x3bb;</mml:mi><mml:mtext>&#x3a8;</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:mi>o</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math display="inline" id="im31"><mml:mrow><mml:mtext>&#x3a8;</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:mo>&#xb7;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> is a sparsity-promoting regularizer and <inline-formula>
<mml:math display="inline" id="im32"><mml:mi>&#x3bb;</mml:mi></mml:math></inline-formula> is a positive parameter balancing fidelity and noise modeling.</p>
<p>To stabilize the inversion when <inline-formula>
<mml:math display="inline" id="im33"><mml:mi mathvariant="script">H</mml:mi></mml:math></inline-formula> is ill-conditioned, a Tikhonov regularization is introduced (<xref ref-type="disp-formula" rid="eq13">Equation 13</xref>):</p>
<disp-formula id="eq13"><label>(13)</label>
<mml:math display="block" id="M13"><mml:mrow><mml:mi mathvariant="script">R</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mo>&#x2016;</mml:mo><mml:mrow><mml:mi>L</mml:mi><mml:mi>x</mml:mi></mml:mrow><mml:mo>&#x2016;</mml:mo></mml:mrow></mml:mrow><mml:mn>2</mml:mn><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where <italic>L</italic> denotes a Laplacian or higher-order differential operator.</p>
<p>Multiscale approaches are utilized to progressively refine reconstructions. Let <inline-formula>
<mml:math display="inline" id="im34"><mml:mrow><mml:msubsup><mml:mrow><mml:mo>{</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>s</mml:mi></mml:msup><mml:mo>}</mml:mo></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>S</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula> represent the latent images at multiple scales, with the degradation model at each scale formulated as (<xref ref-type="disp-formula" rid="eq14">Equation 14</xref>):</p>
<disp-formula id="eq14"><label>(14)</label>
<mml:math display="block" id="M14"><mml:mrow><mml:msup><mml:mi>y</mml:mi><mml:mi>s</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi mathvariant="script">H</mml:mi><mml:mi>s</mml:mi></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>s</mml:mi></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:msup><mml:mi>&#x454;</mml:mi><mml:mi>s</mml:mi></mml:msup><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math display="inline" id="im35"><mml:mrow><mml:msup><mml:mi mathvariant="script">H</mml:mi><mml:mi>s</mml:mi></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mo>&#xb7;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula>
<mml:math display="inline" id="im36"><mml:mrow><mml:msup><mml:mi>&#x454;</mml:mi><mml:mi>s</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> denote the degradation operator and noise at scale <inline-formula>
<mml:math display="inline" id="im37"><mml:mi>s</mml:mi></mml:math></inline-formula>, respectively.</p>
<p>An attention mechanism is introduced to adaptively weigh spatial locations. Defining an attention map <inline-formula>
<mml:math display="inline" id="im38"><mml:mrow><mml:mi>&#x3b1;</mml:mi><mml:mo>:</mml:mo><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mi>n</mml:mi></mml:msup><mml:mo>&#x2192;</mml:mo><mml:msup><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow><mml:mi>n</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula>, the degradation model becomes (<xref ref-type="disp-formula" rid="eq15">Equation 15</xref>):</p>
<disp-formula id="eq15"><label>(15)</label>
<mml:math display="block" id="M15"><mml:mrow><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mi>&#x3b1;</mml:mi><mml:mo>&#x2299;</mml:mo><mml:mi mathvariant="script">H</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x3b1;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2299;</mml:mo><mml:mi>&#x3b7;</mml:mi><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where &#x2299; denotes element-wise multiplication and <italic>&#x3b7;</italic> models dominant noise.</p>
<p>All subsequent methods are built upon this general formalism, enabling flexible modeling of diverse degradation processes and guiding the design of robust imaging recovery algorithms.</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Dynamic structure-aware imaging network</title>
<p>In this section, we present the proposed dynamic structure-aware imaging network (DSINet), a novel imaging framework specifically tailored to robustly reconstruct high-fidelity latent images from heavily degraded observations. Traditional imaging models often fail under severe degradations due to their reliance on static filters and inadequate modeling of structural information. DSINet fundamentally rethinks this paradigm by introducing three key innovations that jointly enable dynamic, structure-aware, and uncertainty-guided imaging. The detailed design and mathematical formulation of DSINet are provided below. The DSINet architecture incorporates multimodal interaction by integrating visual and textual tokens within the adaptive filtering module. As illustrated in <xref ref-type="fig" rid="f2"><bold>Figure&#xa0;2</bold></xref>, both Vision Transformer embeddings and language-derived query embeddings are passed to a token-level sampler. The resulting informative tokens serve as semantic priors to guide the generation of spatially adaptive filters in <italic>A<sub>&#x3b8;</sub></italic>(<italic>x</italic>). These tokens influence the selection and modulation of convolutional kernels in a data-dependent manner. The framework enables conditioning of local image processing on both morphological features and semantic textual cues, enhancing both accuracy and interpretability. The structure-preserving module further fuses adaptive outputs with anatomical priors to retain spatial fidelity. To improve consistency between the mathematical formulation and architectural illustration, a mapping table is provided (see <xref ref-type="table" rid="T1"><bold>Table&#xa0;1</bold></xref>) to clarify the correspondence between pixel-level variables and token-level representations. This alignment facilitates a better understanding of how vision and text modalities are integrated within DSINet. Pixel-wise operators such as <italic>A<sub>&#x3b8;</sub></italic>(<italic>x</italic>) and <italic>K</italic>(<italic>f<sub>i</sub></italic>) are dynamically modulated by token embeddings derived from both the Vision Transformer and Text-Guided Sampler. This joint representation supports spatially adaptive processing that remains semantically grounded in both visual and linguistic contexts.</p>
<table-wrap id="T1" position="float">
<label>Table&#xa0;1</label>
<caption>
<p>Mapping between pixel-level and token-level representations in DSINet.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="middle" align="center">Symbol in math</th>
<th valign="middle" align="center">Meaning</th>
<th valign="middle" align="center">Corresponding</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="middle" align="center"><italic>x</italic>(<italic>i</italic>)</td>
<td valign="middle" align="center">Input pixel at location <italic>i</italic></td>
<td valign="middle" align="center">Vision token from transformer</td>
</tr>
<tr>
<td valign="middle" align="center"><italic>f<sub>i</sub></italic></td>
<td valign="middle" align="center">Feature embedding at location <italic>i</italic></td>
<td valign="middle" align="center">Token embedding from ViT</td>
</tr>
<tr>
<td valign="middle" align="center"><italic>qk</italic></td>
<td valign="middle" align="center">Query embedding from text</td>
<td valign="middle" align="center">Text-guided sampler output</td>
</tr>
<tr>
<td valign="middle" align="center"><italic>A<sub>&#x3b8;</sub></italic>(<italic>x</italic>)</td>
<td valign="middle" align="center">Adaptive operator over image space</td>
<td valign="middle" align="center">Token-informed dynamic filter module</td>
</tr>
<tr>
<td valign="middle" align="center"><italic>K</italic>(<italic>f<sub>i</sub></italic>)</td>
<td valign="middle" align="center">Generated kernel for pixel <italic>i</italic></td>
<td valign="middle" align="center">Kernel modulated by token features</td>
</tr>
<tr>
<td valign="middle" align="center"><italic>&#x2208;</italic>(<italic>x</italic>)(<italic>i</italic>)</td>
<td valign="middle" align="center">Auxiliary embedding at <italic>i</italic></td>
<td valign="middle" align="center">Contextual feature from local patch</td>
</tr>
</tbody>
</table>
</table-wrap>
<sec id="s3_3_1">
<label>3.3.1</label>
<title>Spatially adaptive dynamic filtering</title>
<p>DSINet departs from conventional convolutional operators by introducing a spatially adaptive dynamic convolution mechanism. Let <inline-formula>
<mml:math display="inline" id="im39"><mml:mrow><mml:mi>x</mml:mi><mml:mtext>&#xa0;</mml:mtext><mml:mo>&#x2208;</mml:mo><mml:mtext>&#xa0;</mml:mtext><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mi>n</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> represent the unknown latent image and <inline-formula>
<mml:math display="inline" id="im40"><mml:mrow><mml:mi>y</mml:mi><mml:mtext>&#xa0;</mml:mtext><mml:mo>&#x2208;</mml:mo><mml:mtext>&#xa0;</mml:mtext><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mi>n</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> denote the observed degraded image. The degradation process can be modeled as (<xref ref-type="disp-formula" rid="eq16">Equation 16</xref>):</p>
<disp-formula id="eq16"><label>(16)</label>
<mml:math display="block" id="M16"><mml:mrow><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="script">H</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x454;</mml:mi><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math display="inline" id="im41"><mml:mrow><mml:mi mathvariant="script">H</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#xb7;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> denotes the degradation operator and <inline-formula>
<mml:math display="inline" id="im42"><mml:mo>&#x2208;</mml:mo></mml:math></inline-formula> is the&#xa0;additive noise. Instead of using fixed filters, DSINet defines&#xa0;an&#xa0;adaptive operator <inline-formula>
<mml:math display="inline" id="im43"><mml:mrow><mml:msub><mml:mi mathvariant="script">A</mml:mi><mml:mi>&#x3b8;</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> parameterized by learnable weights <inline-formula>
<mml:math display="inline" id="im44"><mml:mi>&#x3b8;</mml:mi></mml:math></inline-formula>,&#xa0;dynamically adapting based on the local input content (<xref ref-type="disp-formula" rid="eq17">Equation 17</xref>):</p>
<disp-formula id="eq17"><label>(17)</label>
<mml:math display="block" id="M17"><mml:mrow><mml:msub><mml:mi mathvariant="script">A</mml:mi><mml:mi>&#x3b8;</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mtext>&#x3a9;</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:munder><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#xb7;</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math display="inline" id="im45"><mml:mrow><mml:mtext>&#x3a9;</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> denotes the neighborhood centered at pixel <inline-formula>
<mml:math display="inline" id="im46"><mml:mi>i</mml:mi></mml:math></inline-formula> and <inline-formula>
<mml:math display="inline" id="im47"><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> are context-dependent weights. To generate these dynamic weights, a feature encoder <inline-formula>
<mml:math display="inline" id="im48"><mml:mrow><mml:mi mathvariant="script">E</mml:mi><mml:mo>:</mml:mo><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mi>n</mml:mi></mml:msup><mml:mo>&#x2192;</mml:mo><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mi>d</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> is employed, followed by&#xa0;a&#xa0;dynamic kernel generator <inline-formula>
<mml:math display="inline" id="im49"><mml:mrow><mml:mi mathvariant="script">K</mml:mi><mml:mo>:</mml:mo><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mi>d</mml:mi></mml:msup><mml:mo>&#x2192;</mml:mo><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mrow><mml:mo>&#x2758;</mml:mo><mml:mtext>&#x3a9;</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2758;</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> (<xref ref-type="disp-formula" rid="eq18">Equation 18</xref>):</p>
<disp-formula id="eq18"><label>(18)</label>
<mml:math display="block" id="M18"><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="script">E</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<disp-formula id="eq19"><label>(19)</label>
<mml:math display="block" id="M19"><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi mathvariant="script">K</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>The term <italic>&#x454;</italic>(<italic>x</italic>)(<italic>i</italic>) represents (<xref ref-type="disp-formula" rid="eq19">Equation 19</xref>) an auxiliary feature embedding extracted from the input image <italic>x</italic> centered at spatial location <italic>i</italic>. This embedding is designed to capture both low-level and mid-level visual cues that are relevant for enhancing spatial adaptivity in the DSINet framework. <italic>&#x454;</italic>(<italic>x</italic>)(<italic>i</italic>) is implemented as a shallow convolutional block that aggregates local information from a fixed-size receptive field (like 5 &#xd7; 5 or 7 &#xd7; 7 window). Unlike handcrafted descriptors, this module is learnable and trained jointly with the rest of the network. The features captured by &#x2208;(<italic>x</italic>)(<italic>i</italic>) include intensity variations, texture patterns, and local contrast, all of which are implicitly learned through convolutional filters. In addition to spatial gradients and edge-related features, the embedding also encodes contextual patterns that correlate with molecular characteristics, especially when integrated with the attention mechanism. While the primary focus is on local context, the use of dilated convolutions and multi-scale aggregation allows the embedding to incorporate a limited degree of broader contextual information. This ensures that <italic>&#x454;</italic>(<italic>x</italic>)(<italic>i</italic>) is sensitive not only to pixel-level changes but also to regional structure and texture, which is essential in medical imaging tasks involving subtle morphological variations.</p>
<p>Normalization of <italic>w<sub>i</sub></italic>through the softmax function ensures numerical stability (<xref ref-type="disp-formula" rid="eq20">Equation 20</xref>):</p>
<disp-formula id="eq20"><label>(20)</label>
<mml:math display="block" id="M20"><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>exp</mml:mi><mml:mo>&#xa0;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:msup><mml:mi>j</mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup><mml:mo>&#x2208;</mml:mo><mml:mi>&#x3a9;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mtext>exp&#xa0;</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>j</mml:mi><mml:mo>&#x2032;</mml:mo></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac><mml:mo>.</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>The adaptive nature of A<italic><sub>&#x3b8;</sub></italic>(<italic>x</italic>) enables DSINet to effectively adjust its filtering behavior based on varying local degradations, improving robustness against diverse image corruptions.</p>
<p>The adaptive operator <italic>A<sub>&#x3b8;</sub></italic>(<italic>x</italic>) employs a dynamic filtering mechanism in which the kernel generator <italic>K</italic>(<italic>f<sub>i</sub></italic>) plays a central role in capturing local context. Rather than using a globally shared kernel, the generator <italic>K</italic>(<italic>f<sub>i</sub></italic>) produces pixel-wise dynamic filters conditioned on the local feature embedding <italic>f<sub>i</sub></italic>at each spatial location. For every pixel <italic>i</italic> in the input feature map, a unique kernel is generated based on the local appearance and contextual features surrounding <italic>i</italic>. This design ensures that the filtering operation is spatially adaptive and content-aware, enabling the model to respond to heterogeneity in tissue morphology and molecular context. The kernel generator is implemented as a lightweight convolutional subnetwork that takes the intermediate feature map as input and outputs a set of per-pixel convolution kernels with fixed spatial size (like 3 &#xd7; 3). These generated kernels are then applied via depth-wise convolution over the local neighborhood of each pixel. As a result, different neighborhoods are processed using distinct, dynamically generated kernels, rather than a shared static kernel. This mechanism is crucial for capturing fine-grained structures, particularly in medical imaging scenarios where boundary precision and local variations are important. Computational complexity is managed through channel grouping and kernel compression techniques to maintain efficiency during training and inference.</p>
</sec>
<sec id="s3_3_2">
<label>3.3.2</label>
<title>Structure-preserving attention modulation</title>
<p>To further enhance reconstruction quality, DSINet integrates a structure-preserving feature modulation mechanism. A structure feature map <inline-formula>
<mml:math display="inline" id="im50"><mml:mrow><mml:mi>s</mml:mi><mml:mtext>&#xa0;&#x2208;&#xa0;</mml:mtext><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mi>n</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> is extracted by applying a structure extractor <inline-formula>
<mml:math display="inline" id="im51"><mml:mrow><mml:mi mathvariant="script">S</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#xb7;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> to the input (<xref ref-type="disp-formula" rid="eq21">Equation 21</xref>):</p>
<disp-formula id="eq21"><label>(21)</label>
<mml:math display="block" id="M21"><mml:mrow><mml:mi>s</mml:mi><mml:mo>=</mml:mo><mml:mi mathvariant="script">S</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>capturing prominent edges, textures, or ridges. Based on both the dynamic features <italic>f<sub>i</sub></italic>and the structural cues &lt;</p>
<p>, an attention map <inline-formula>
<mml:math display="inline" id="im52"><mml:mrow><mml:mi>&#x3b1;</mml:mi><mml:mo>:</mml:mo><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mi>n</mml:mi></mml:msup><mml:mo>&#x2192;</mml:mo><mml:msup><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow><mml:mi>n</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> is computed (<xref ref-type="disp-formula" rid="eq22">Equation 22</xref>):</p>
<disp-formula id="eq22"><label>(22)</label>
<mml:math display="block" id="M22"><mml:mrow><mml:mi>&#x3b1;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>&#x3c3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mi>s</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where <italic>g</italic>(&#xb7;,&#xb7;) is a gating function combining dynamic and structural information, and <italic>&#x3c3;</italic>(&#xb7;) denotes the sigmoid function. The final modulated feature at each location is (<xref ref-type="disp-formula" rid="eq23">Equation 23</xref>):</p>
<disp-formula id="eq23"><label>(23)</label>
<mml:math display="block" id="M23"><mml:mrow><mml:mover accent="true"><mml:mi>x</mml:mi><mml:mo>&#x2dc;</mml:mo></mml:mover><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>&#x3b1;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#xb7;</mml:mo><mml:msub><mml:mi mathvariant="script">A</mml:mi><mml:mi>&#x3b8;</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x3b1;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>&#xb7;</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>.</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>This modulation effectively leverages both local adaptivity and global structural information, enabling DSINet to preserve important visual structures while correcting degradations.</p>
<p>Furthermore, DSINet introduces an adaptive normalization layer to dynamically accommodate spatially varying noise characteristics. A noise estimator <inline-formula>
<mml:math display="inline" id="im53"><mml:mrow><mml:mi mathvariant="script">N</mml:mi><mml:mo>:</mml:mo><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mi>n</mml:mi></mml:msup><mml:mo>&#x2192;</mml:mo><mml:mi>&#x211d;</mml:mi></mml:mrow></mml:math></inline-formula> predicts the local noise level <italic>&#x3c3;</italic> at each pixel, and intermediate features are normalized accordingly (<xref ref-type="disp-formula" rid="eq24">Equation 24</xref>):</p>
<disp-formula id="eq24"><label>(24)</label>
<mml:math display="block" id="M24"><mml:mrow><mml:msub><mml:mover accent="true"><mml:mi>f</mml:mi><mml:mo>&#x2dc;</mml:mo></mml:mover><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x3bc;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x3c3;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:msqrt><mml:mrow><mml:msubsup><mml:mi>&#x3c3;</mml:mi><mml:mi>i</mml:mi><mml:mn>2</mml:mn></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x3c3;</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x454;</mml:mi></mml:mrow></mml:msqrt></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where <italic>&#xb5;<sub>i</sub></italic>(<italic>&#x3c3;</italic>) and <italic>&#x3c3;<sub>i</sub></italic><sup>2</sup>(<italic>&#x3c3;</italic>) are the mean and variance conditioned on the noise estimate, and <italic>&#x454;</italic> is a small constant ensuring numerical stability. This mechanism enables DSINet to maintain high performance under varying noise levels without explicit noise-level supervision.</p>
</sec>
<sec id="s3_3_3">
<label>3.3.3</label>
<title>Multi-resolution refinement and uncertainty modeling</title>
<p>To effectively capture multi-scale contextual dependencies, DSINet employs a hierarchical refinement mechanism across <italic>S</italic> resolution scales <inline-formula>
<mml:math display="inline" id="im54"><mml:mrow><mml:msubsup><mml:mrow><mml:mo>{</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>s</mml:mi></mml:msup><mml:mo>}</mml:mo></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>S</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula> (as shown in <xref ref-type="fig" rid="f4"><bold>Figure&#xa0;4</bold></xref>).</p>
<fig id="f4" position="float">
<label>Figure&#xa0;4</label>
<caption>
<p>This figure illustrates the channel attention mechanism used in the multi-resolution refinement and uncertainty modeling framework. Multi-scale feature maps undergo average and max pooling, followed by a shared MLP to compute a channel-wise attention map. The resulting attention weights, after sigmoid activation, enhance feature consistency across scales and contribute to uncertainty-aware reconstruction in DSINet.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fonc-15-1643504-g004.tif">
<alt-text content-type="machine-generated">Diagram showing a process for channel-wise attention mapping in neural networks. It begins with multi-scale feature maps undergoing MaxPool and AvgPool operations, resulting in two 1x1x1xc outputs. These are processed through a channel attention MLP, producing two outputs. These outputs are combined, passed through a sigmoid function, and result in a final 1x1x1xc channel-wise attention map.</alt-text>
</graphic></fig>
<p>At each scale <italic>s</italic>, the feature representation is recursively updated as (<xref ref-type="disp-formula" rid="eq25">Equation 25</xref>):</p>
<disp-formula id="eq25"><label>(25)</label>
<mml:math display="block" id="M25"><mml:mrow><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi mathvariant="script">U</mml:mi><mml:mtext>&#xa0;</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi mathvariant="script">F</mml:mi><mml:mi>s</mml:mi></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>s</mml:mi></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mtext>&#xa0;</mml:mtext><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math display="inline" id="im55"><mml:mrow><mml:msup><mml:mi mathvariant="script">F</mml:mi><mml:mi>s</mml:mi></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mo>&#xb7;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> denotes the DSINet processing module at scale <inline-formula>
<mml:math display="inline" id="im56"><mml:mi>s</mml:mi></mml:math></inline-formula>, and <inline-formula>
<mml:math display="inline" id="im57"><mml:mrow><mml:mi mathvariant="script">U</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#xb7;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> is an upsampling operator. The observed degraded image <italic>y</italic> is similarly downsampled to match each scale, and a residual is computed (<xref ref-type="disp-formula" rid="eq26">Equation 26</xref>):</p>
<disp-formula id="eq26"><label>(26)</label>
<mml:math display="block" id="M26"><mml:mrow><mml:msup><mml:mi>r</mml:mi><mml:mi>s</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mi>s</mml:mi></mml:msup><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mi mathvariant="script">H</mml:mi><mml:mi>s</mml:mi></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mover accent="true"><mml:mi>x</mml:mi><mml:mo>&#x2dc;</mml:mo></mml:mover><mml:mi>s</mml:mi></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>guiding the network toward progressively refined reconstructions.</p>
<p>To ensure coherence between different scales, a scale alignment loss is introduced (<xref ref-type="disp-formula" rid="eq27">Equation 27</xref>):</p>
<disp-formula id="eq27"><label>(27)</label>
<mml:math display="block" id="M27"><mml:mrow><mml:msub><mml:mi mathvariant="script">L</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>i</mml:mi><mml:mi>g</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>S</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover><mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mo>&#x2016;</mml:mo><mml:mrow><mml:mi>&#x1d49f;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>s</mml:mi></mml:msup></mml:mrow><mml:mo>&#x2016;</mml:mo></mml:mrow></mml:mrow><mml:mn>2</mml:mn><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math display="inline" id="im58"><mml:mrow><mml:mi mathvariant="script">D</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#xb7;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> is a downsampling operator compatible with <inline-formula>
<mml:math display="inline" id="im59"><mml:mrow><mml:mi mathvariant="script">U</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#xb7;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. This encourages consistent feature representations across scales, reducing artifacts due to scale mismatch.</p>
<p>The final high-resolution output <inline-formula>
<mml:math display="inline" id="im60"><mml:mover accent="true"><mml:mi>x</mml:mi><mml:mo>^</mml:mo></mml:mover></mml:math></inline-formula> is obtained by merging the outputs from all scales via a fusion module &#x3a6;(&#xb7;) (<xref ref-type="disp-formula" rid="eq28">Equation 28</xref>):</p>
<disp-formula id="eq28"><label>(28)</label>
<mml:math display="block" id="M28"><mml:mrow><mml:mover accent="true"><mml:mi>x</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mtext>&#x3a6;</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mn>1</mml:mn></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>S</mml:mi></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>allowing the network to integrate fine and coarse information.</p>
<p>To further improve reconstruction robustness, DSINet explicitly models the uncertainty associated with its predictions. A predictive uncertainty map &#x3a3;(<italic>x</italic>) is generated, and the final prediction is corrected based on this uncertainty (<xref ref-type="disp-formula" rid="eq29">Equation 29</xref>):</p>
<disp-formula id="eq29"><label>(29)</label>
<mml:math display="block" id="M29"><mml:mrow><mml:mover accent="true"><mml:mi>x</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x3b2;</mml:mi><mml:mtext>&#x3a3;</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where <italic>&#x3b2;</italic> is a learnable scalar controlling the correction magnitude. By explicitly accounting for prediction uncertainty, DSINet can adaptively refine uncertain regions, leading to overall more reliable reconstructions.</p>
<p>Through these innovations, DSINet successfully addresses the challenges posed by complex and severe degradations in real-world imaging tasks. Its design enables dynamic adjustment to spatial context, effective structural preservation, hierarchical refinement, and robust uncertainty-aware reconstruction without relying on any fixed prior assumptions.</p>
</sec>
</sec>
<sec id="s3_4">
<label>3.4</label>
<title>Progressive structure-guided optimization</title>
<p>In this section, we introduce progressive structure-guided optimization (PSGO), a novel iterative framework designed to refine the imaging reconstruction produced by DSINet. The core idea of PSGO is to progressively focus on reliable structural components while dynamically adjusting the optimization pathway based on intermediate recovery states (as shown in <xref ref-type="fig" rid="f3"><bold>Figure&#xa0;3</bold></xref>).</p>
<sec id="s3_4_1">
<label>3.4.1</label>
<title>Adaptive confidence-guided decomposition</title>
<p>Given the degraded observation <inline-formula>
<mml:math display="inline" id="im61"><mml:mrow><mml:mi>y</mml:mi><mml:mtext>&#xa0;&#x2208;&#xa0;</mml:mtext><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mi>n</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> and the current estimation <inline-formula>
<mml:math display="inline" id="im62"><mml:mrow><mml:msup><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:mtext>&#xa0;&#x2208;&#xa0;</mml:mtext><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mi>n</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> at iteration <inline-formula>
<mml:math display="inline" id="im63"><mml:mi>t</mml:mi></mml:math></inline-formula>, PSGO defines an adaptive objective function (<xref ref-type="disp-formula" rid="eq30">Equation 30</xref>):</p>
<disp-formula id="eq30"><label>(30)</label>
<mml:math display="block" id="M30"><mml:mrow><mml:msup><mml:mi mathvariant="script">J</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi mathvariant="script">D</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="script">H</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:msup><mml:mi>&#x3bb;</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:msup><mml:mi mathvariant="script">R</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where <italic>D</italic>(&#xb7;,&#xb7;) measures the fidelity to the observation, and R<italic><sup>t</sup></italic>(&#xb7;) is a structure-aware regularizer evolving over iterations. At each iteration <italic>t</italic>, the image domain is decomposed into a reliable region I<italic><sub>t</sub></italic>and an uncertain region <inline-formula>
<mml:math display="inline" id="im64"><mml:mrow><mml:msubsup><mml:mi mathvariant="script">I</mml:mi><mml:mi>t</mml:mi><mml:mi>c</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula>, according to a structural confidence map <inline-formula>
<mml:math display="inline" id="im65"><mml:mrow><mml:msup><mml:mi>C</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:mo>:</mml:mo><mml:msup><mml:mi>&#x211d;</mml:mi><mml:mi>n</mml:mi></mml:msup><mml:mo>&#x2192;</mml:mo><mml:msup><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow><mml:mi>n</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> (<xref ref-type="disp-formula" rid="eq31">Equation 31</xref>):</p>
<disp-formula id="eq31"><label>(31)</label>
<mml:math display="block" id="M31"><mml:mrow><mml:msub><mml:mi mathvariant="script">I</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo>{</mml:mo><mml:mi>i</mml:mi><mml:mtext>&#xa0;&#x2208;&#xa0;</mml:mtext><mml:mo stretchy="false">[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy="false">]</mml:mo><mml:mo>&#x2758;</mml:mo><mml:msup><mml:mi>C</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2265;</mml:mo><mml:msub><mml:mi>&#x3c4;</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>}</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where <italic>&#x3c4;<sub>t</sub></italic>is a dynamically decreasing threshold controlling the inclusion of pixels. The confidence map <italic>C<sup>t</sup></italic> is estimated by evaluating the consistency between the observation and the forward model (<xref ref-type="disp-formula" rid="eq32">Equation 32</xref>):</p>
<disp-formula id="eq32"><label>(32)</label>
<mml:math display="block" id="M32"><mml:mrow><mml:msup><mml:mi>C</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mtext>exp&#xa0;</mml:mtext><mml:mo>(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="script">H</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mrow><mml:msup><mml:mi>&#x3c3;</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where <italic>&#x3c3;</italic> is a robustness parameter estimated empirically or via auxiliary networks. To emphasize reliable regions during optimization, PSGO applies a spatial weighting scheme (<xref ref-type="disp-formula" rid="eq33">Equation 33</xref>):</p>
<disp-formula id="eq33"><label>(33)</label>
<mml:math display="block" id="M33"><mml:mrow><mml:msubsup><mml:mi>&#x3c9;</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi></mml:msubsup><mml:mo>=</mml:mo><mml:mo>{</mml:mo><mml:mtable columnalign="left" equalrows="true" equalcolumns="true"><mml:mtr columnalign="left"><mml:mtd columnalign="left"><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd columnalign="left"><mml:mrow><mml:mi>if</mml:mi><mml:mtext>&#x2004;</mml:mtext><mml:mi>i</mml:mi><mml:mtext>&#xa0;&#x2208;&#xa0;</mml:mtext><mml:msub><mml:mi mathvariant="script">I</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign="left"><mml:mtd columnalign="left"><mml:mrow><mml:msub><mml:mi>&#x3b3;</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd columnalign="left"><mml:mrow><mml:mi>if</mml:mi><mml:mtext>&#x2004;</mml:mtext><mml:mi>i</mml:mi><mml:mtext>&#xa0;&#x2208;&#xa0;</mml:mtext><mml:msubsup><mml:mi mathvariant="script">I</mml:mi><mml:mi>t</mml:mi><mml:mi>c</mml:mi></mml:msubsup><mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:math>
</disp-formula>
<p>where <italic>&#x3b3;<sub>t</sub></italic>&#x2208; (0,1) progressively increases with <italic>t</italic> to allow gradual inclusion of uncertain regions. The modified discrepancy loss at iteration <italic>t</italic> becomes (<xref ref-type="disp-formula" rid="eq34">Equation 34</xref>):</p>
<disp-formula id="eq34"><label>(34)</label>
<mml:math display="block" id="M34"><mml:mrow><mml:msup><mml:mi mathvariant="script">D</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="script">H</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>n</mml:mi></mml:munderover><mml:msubsup><mml:mi>&#x3c9;</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi></mml:msubsup><mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi mathvariant="script">H</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup><mml:mo>.</mml:mo></mml:mrow></mml:math>
</disp-formula>
</sec>
<sec id="s3_4_2">
<label>3.4.2</label>
<title>Dynamic graph-regularized propagation</title>
<p>The structure-guided regularization <inline-formula>
<mml:math display="inline" id="im66"><mml:mrow><mml:msup><mml:mi mathvariant="script">R</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> incorporates dynamic graph-based constraints. A graph <inline-formula>
<mml:math display="inline" id="im67"><mml:mrow><mml:msup><mml:mi>G</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>V</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>E</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> is defined over the image domain, where <inline-formula>
<mml:math display="inline" id="im68"><mml:mrow><mml:mi>V</mml:mi><mml:mo>=</mml:mo><mml:mo>{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:mi>n</mml:mi><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula>
<mml:math display="inline" id="im69"><mml:mrow><mml:msup><mml:mi>E</mml:mi><mml:mi>t</mml:mi></mml:msup></mml:mrow></mml:math></inline-formula> connects structurally similar pixels based on feature similarities  (<xref ref-type="disp-formula" rid="eq35">Equation 35</xref>):</p>
<disp-formula id="eq35"><label>(35)</label>
<mml:math display="block" id="M35"><mml:mrow><mml:msup><mml:mi>E</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:mo>=</mml:mo><mml:mo>{</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2758;</mml:mo><mml:mo>&#x2225;</mml:mo><mml:mi mathvariant="script">F</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="script">F</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mo>&#x2225;</mml:mo><mml:mn>2</mml:mn></mml:msub><mml:mo>&#x2264;</mml:mo><mml:msub><mml:mi>&#x454;</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>}</mml:mo><mml:mo>.</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>Here <inline-formula>
<mml:math display="inline" id="im70"><mml:mrow><mml:mi mathvariant="script">F</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> is a feature extractor and <inline-formula>
<mml:math display="inline" id="im71"><mml:mrow><mml:msub><mml:mi>&#x454;</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is a tightening threshold decreasing over time. The graph-based regularizer is defined as (<xref ref-type="disp-formula" rid="eq36">Equation 36</xref>):</p>
<disp-formula id="eq36"><label>(36)</label>
<mml:math display="block" id="M36"><mml:mrow><mml:msup><mml:mi mathvariant="script">R</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>E</mml:mi><mml:mi>t</mml:mi></mml:msup></mml:mrow></mml:munder><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:msubsup><mml:mo>&#x2225;</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msubsup><mml:mo>&#x2225;</mml:mo><mml:mi>2</mml:mi><mml:mn>2</mml:mn></mml:msubsup><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math display="inline" id="im72"><mml:mrow><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula> are adaptive edge weights based on feature affinities (<xref ref-type="disp-formula" rid="eq37">Equation 37</xref>):</p>
<disp-formula id="eq37"><label>(37)</label>
<mml:math display="block" id="M37"><mml:mrow><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mi>t</mml:mi></mml:msubsup><mml:mo>=</mml:mo><mml:mtext>exp&#xa0;</mml:mtext><mml:mo>(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mrow><mml:mo>&#x2225;</mml:mo><mml:mi mathvariant="script">F</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="script">F</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msubsup><mml:mo>&#x2225;</mml:mo><mml:mi>2</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:mrow><mml:mrow><mml:msub><mml:mi>&#x3b2;</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mo>)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>and <inline-formula>
<mml:math display="inline" id="im73"><mml:mrow><mml:msub><mml:mi>&#x3b2;</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> decreases gradually to sharpen structural attention. The&#xa0;optimization at each iteration <inline-formula>
<mml:math display="inline" id="im74"><mml:mi>t</mml:mi></mml:math></inline-formula> updates <inline-formula>
<mml:math display="inline" id="im75"><mml:mrow><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> according to (<xref ref-type="disp-formula" rid="eq38">Equation 38</xref>):</p>
<disp-formula id="eq38"><label>(38)</label>
<mml:math display="block" id="M38"><mml:mrow><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x3b7;</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>&#x2207;</mml:mo><mml:msup><mml:mi mathvariant="script">J</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math display="inline" id="im76"><mml:mrow><mml:msub><mml:mi>&#x3b7;</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is an adaptive step size schedule satisfying (<xref ref-type="disp-formula" rid="eq39">Equation&#xa0;39</xref>):</p>
<disp-formula id="eq39"><label>(39)</label>
<mml:math display="block" id="M39"><mml:mrow><mml:msub><mml:mi>&#x3b7;</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>&#x3b7;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mi>&#x3c1;</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>with <inline-formula>
<mml:math display="inline" id="im77"><mml:mrow><mml:msub><mml:mi>&#x3b7;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mrow></mml:math></inline-formula> being the initial learning rate and <inline-formula>
<mml:math display="inline" id="im78"><mml:mi>&#x3c1;</mml:mi></mml:math></inline-formula> controlling the decay rate. To prevent over-smoothing and preserve fine structures, a residual correction term is introduced (<xref ref-type="disp-formula" rid="eq40">Equation 40</xref>):</p>
<disp-formula id="eq40"><label>(40)</label>
<mml:math display="block" id="M40"><mml:mrow><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x3b6;</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:msub><mml:mi mathvariant="script">R</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math display="inline" id="im79"><mml:mrow><mml:msub><mml:mi mathvariant="script">R</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mo>&#xb7;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> denotes a residual sharpening operator and <inline-formula>
<mml:math display="inline" id="im80"><mml:mrow><mml:msub><mml:mi>&#x3b6;</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> is a decaying coefficient.</p>
</sec>
<sec id="s3_4_3">
<label>3.4.3</label>
<title>Iterative restructuring with uncertainty adaptation</title>
<p>At every <italic>K</italic> iterations, PSGO performs a re-structuring phase, re-estimating the graph <italic>G<sup>t</sup></italic> and recalibrating the confidence map <italic>C<sup>t</sup></italic>, ensuring adaptability to the evolving reconstruction (as shown in <xref ref-type="fig" rid="f5"><bold>Figure&#xa0;5</bold></xref>).</p>
<fig id="f5" position="float">
<label>Figure&#xa0;5</label>
<caption>
<p>Iterative restructuring with uncertainty adaptation for multimodal fusion. The figure presents a multimodal learning system integrating visual, speech, and text modalities. Inputs <italic>x<sub>v</sub></italic>, <italic>x<sub>s</sub></italic>, and <italic>x<sub>t</sub></italic>are encoded into modality-specific features <italic>z<sub>v</sub></italic>, <italic>z<sub>s</sub></italic>, and <italic>z<sub>t</sub></italic>, which are then fused via a multimodal fusion with uncertainty-aware reweighting module. The fused representation is processed through an iterative uncertainty-aware optimization (IUO) framework, enabling two tasks: generation via a latent decoder and classification via an IUO-guided classification head. At every <italic>K</italic> iteration, the system performs restructuring by updating the graph <italic>G<sup>t</sup></italic> and recalibrating the confidence map <italic>C<sup>t</sup></italic>, using adaptive thresholding and uncertainty-guided reweighting. This leads to structure-consistent, robust reconstruction and improved task performance through dynamically regularized optimization.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fonc-15-1643504-g005.tif">
<alt-text content-type="machine-generated">Diagram illustrating a system for multimodal fusion with uncertainty-aware reweighting. Visual, speech, and text inputs are represented as \( L_v \), \( L_s \), and \( L_t \), processed into \( z_v \), \( z_s \), and \( z_t \). These undergo multimodal fusion, followed by iterative uncertainty-aware optimization. The output is divided into two branches: generation with a latent decoder producing a decoded output, and classification with guidance leading to class labels.</alt-text>
</graphic></fig>
<p>The re-structuring updates the confidence threshold as (<xref ref-type="disp-formula" rid="eq41">Equation 41</xref>):</p>
<disp-formula id="eq41"><label>(41)</label>
<mml:math display="block" id="M41"><mml:mrow><mml:msub><mml:mi>&#x3c4;</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mi>K</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x3ba;</mml:mi><mml:msub><mml:mi>&#x3c4;</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where <italic>&#x3ba;</italic> &#x2208; (0,1) controls the relaxation speed. Moreover, PSGO integrates an uncertainty-aware reweighting mechanism to focus learning efforts (<xref ref-type="disp-formula" rid="eq42">Equation 42</xref>):</p>
<disp-formula id="eq42"><label>(42)</label>
<mml:math display="block" id="M42"><mml:mrow><mml:msubsup><mml:mi>&#x3c9;</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi></mml:msubsup><mml:mo>=</mml:mo><mml:msubsup><mml:mi>&#x3c9;</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi></mml:msubsup><mml:mo>&#xb7;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mi>&#x3b4;</mml:mi><mml:mtext>&#x3a3;</mml:mtext><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where &#x3a3;(<italic>x<sup>t</sup></italic>) is the predictive uncertainty map from DSINet, and <italic>&#x3b4;</italic> is a positive scalar adjusting the modulation strength. To guarantee convergence, a diminishing residual energy condition is enforced (<xref ref-type="disp-formula" rid="eq43">Equation 43</xref>):</p>
<disp-formula id="eq43"><label>(43)</label>
<mml:math display="block" id="M43"><mml:mrow><mml:mo>&#x2225;</mml:mo><mml:mi>y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="script">H</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:msubsup><mml:mo>&#x2225;</mml:mo><mml:mi>2</mml:mi><mml:mn>2</mml:mn></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:mo>&#x2225;</mml:mo><mml:mi>y</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi mathvariant="script">H</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:msubsup><mml:mo>&#x2225;</mml:mo><mml:mi>2</mml:mi><mml:mn>2</mml:mn></mml:msubsup><mml:mo>&#x2265;</mml:mo><mml:msub><mml:mi>&#x3be;</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>where <italic>&#x3be;<sub>t</sub></italic>is a non-increasing sequence ensuring steady improvement. Finally, the overall reconstruction after <italic>T</italic> iterations is (<xref ref-type="disp-formula" rid="eq44">Equation 44</xref>):</p>
<disp-formula id="eq44"><label>(44)</label>
<mml:math display="block" id="M44"><mml:mrow><mml:mover accent="true"><mml:mi>x</mml:mi><mml:mo>^</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:msup><mml:mi>x</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mo>,</mml:mo></mml:mrow></mml:math>
</disp-formula>
<p>providing a high-fidelity, structure-consistent, and noise-robust latent image.</p>
<p>Through this progressively structured, adaptively weighted, and dynamically regularized optimization procedure, PSGO enhances the reconstruction quality beyond traditional static optimization strategies, effectively handling complex degradation patterns and diverse noise characteristics across imaging tasks.</p>
<p>The feature extractor <italic>F</italic>(<italic>x<sub>t</sub></italic>) used for graph construction is implemented as a lightweight encoder module within DSINet, designed to capture mid-level semantic and structural cues. Specifically, the features are extracted from an intermediate layer of the encoder, balancing between spatial resolution and contextual richness. This design enables the graph construction process to preserve both local texture and global structure information without incurring excessive computational cost. Regarding the choice of hyperparameters in Equations, each plays a role in controlling the dynamics of confidence propagation and structure-guided optimization. The parameter <italic>&#x3c3;</italic> determines the sensitivity of edge weights in the graph and is empirically set based on the standard deviation of feature distances within a batch. <italic>&#x3b2;<sub>t</sub></italic>governs the strength of temporal smoothing and is scheduled to decay logarithmically with iteration <italic>t</italic> to allow early-stage exploration and late-stage stabilization. The coefficients <italic>&#x3b3;<sub>t</sub></italic>and <italic>&#x3b6;<sub>t</sub></italic>are initialized to small positive constants (like 0.01 and 0.05, respectively) and updated adaptively based on the gradient norm of the confidence map, encouraging stronger corrections in uncertain regions. The graph and confidence map are restructured every <italic>K</italic> = 5 iterations, which reflects a trade-off between computational cost and structural adaptability. A smaller <italic>K</italic> increases responsiveness but also overhead, while a larger <italic>K</italic> may delay convergence in rapidly changing regions. Empirically, <italic>K</italic> = 5 yielded stable and efficient convergence across all datasets. The adaptive step size <italic>&#x3b7;<sub>t</sub></italic>is updated using a momentum-based scheme that incorporates the variance of previous updates, promoting smoother convergence and avoiding oscillations. In our implementation, this mechanism leads to faster stabilization of confidence propagation compared to fixed-step schemes. Constructing a fully connected graph over all pixels is computationally infeasible for high-resolution images. Therefore, we adopt a local neighborhood approximation where each node connects to its <italic>k</italic> nearest neighbors (like <italic>k</italic> = 16) in feature space using an efficient approximate nearest neighbor algorithm (like FAISS). This approximation reduces graph complexity from <italic>O</italic>(<italic>N</italic><sup>2</sup>) to <italic>O</italic>(<italic>Nk</italic>) while maintaining structural fidelity. Additional sparsity is enforced by thresholding weak edges to zero, resulting in a sparse adjacency matrix suitable for GPU-accelerated computation.</p>
</sec>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experimental setup</title>
<sec id="s4_1">
<label>4.1</label>
<title>Dataset</title>
<p>The BraTS Dataset (<xref ref-type="bibr" rid="B34">34</xref>) is a widely used benchmark in the field of medical image analysis, specifically designed for the segmentation of brain tumors in multi-modal magnetic resonance imaging (MRI) scans. It contains annotated images from patients with gliomas, both high-grade and low-grade, across multiple institutions and scanners, ensuring diversity and robustness. The dataset includes several MRI sequences such as T1, T1-contrast enhanced, T2, and FLAIR, providing comprehensive structural information crucial for tumor delineation. The ground truth labels are manually segmented by experienced radiologists, distinguishing tumor core, enhancing tumor, and edema regions. BraTS has been the focus of multiple international challenges, driving advancements in automated brain tumor segmentation and serving as a reference for validating new methodologies. Its rigorous curation, multi-center acquisition, and detailed annotations make it indispensable for developing and benchmarking machine learning algorithms for neuro-on. The OASIS Dataset (<xref ref-type="bibr" rid="B35">35</xref>) is an open-access collection of brain imaging data aimed at advancing the understanding of normal aging and Alzheimer&#x2019;s disease. It includes cross-sectional and longitudinal MRI scans from a large cohort of participants ranging in age and cognitive status, from young healthy adults to elderly individuals with varying degrees of cognitive impairment and dementia. The dataset provides structural MRI volumes, demographic information, and clinical assessments such as the Clinical Dementia Rating (CDR). OASIS is widely used for studying brain morphometry, the progression of neurodegenerative diseases, and for training and validating automated brain segmentation and classification algorithms. The careful design and broad scope of OASIS enable comprehensive studies on aging, neuroanatomical changes, and disease progression, fostering reproducibility and comparison across different research groups and methodologies in the neuroscience community. The LUNA16 Dataset (<xref ref-type="bibr" rid="B36">36</xref>) is a large-scale, curated resource for the detection of lung nodules in computed tomography (CT) images. Derived from the publicly available LIDC-IDRI database, LUNA16 consists of a carefully selected subset of CT scans with annotated pulmonary nodules by multiple radiologists. Each nodule annotation is provided with spatial coordinates and diameter, ensuring precise localization and quantification. The dataset focuses on nodules greater than 3 mm in diameter, reflecting clinical relevance in lung cancer screening. LUNA16 has served as the foundation for the Lung Nodule Analysis Grand Challenge, enabling standardized evaluation and comparison of algorithms for automated nodule detection and classification. Its comprehensive annotation protocol, high-resolution CT data, and public accessibility have made it a benchmark for advancing computer-aided detection systems in thoracic imaging. The MURA Dataset (<xref ref-type="bibr" rid="B37">37</xref>) is a large-scale musculoskeletal radiograph dataset designed for the development and assessment of algorithms for abnormality detection in bone X-rays. It contains over 40,000 images from more than 14,000 studies, spanning seven standard upper extremity radiographic examinations such as the elbow, finger, hand, humerus, forearm, shoulder, and wrist. Each study is labeled by board-certified radiologists as either normal or abnormal, reflecting clinically relevant findings encountered in routine practice. The MURA dataset enables both classification and localization tasks, as it provides image-level labels and, for a subset, bounding box annotations. Its scale, diversity of anatomical regions, and expert labeling have made it a reference dataset for evaluating deep learning models in musculoskeletal imaging and for facilitating research on automated abnormality detection in radiographs. A detailed summary of all datasets used in this study is provided in <xref ref-type="table" rid="T2"><bold>Table&#xa0;2</bold></xref>, including the disease type, imaging modality, availability of molecular information, classification endpoints, sample size, and the data splitting strategy. This table is intended to clarify the experimental settings and facilitate reproducibility and comparison with related studies.</p>
<table-wrap id="T2" position="float">
<label>Table&#xa0;2</label>
<caption>
<p>Summary of datasets used in this study, including imaging modality, classification endpoint, molecular information, and sample allocation.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="middle" align="center">Dataset</th>
<th valign="middle" align="center">Disease type</th>
<th valign="middle" align="center">Image modality</th>
<th valign="middle" align="center">Molecular info</th>
<th valign="middle" align="center">Endpoint</th>
<th valign="middle" align="center">Samples (Train/Val/Test)</th>
<th valign="middle" align="center">Split strategy</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="middle" align="left">TCGA-LGG</td>
<td valign="middle" align="center">Glioma</td>
<td valign="middle" align="center">Histopathology (WSI)</td>
<td valign="middle" align="center">Gene expression</td>
<td valign="middle" align="center">Drug sensitivity (binary)</td>
<td valign="middle" align="center">352/88/88</td>
<td valign="middle" align="center">Stratified 70/15/15</td>
</tr>
<tr>
<td valign="middle" align="left">TCGA-BRCA</td>
<td valign="middle" align="center">Breast cancer</td>
<td valign="middle" align="center">Histopathology (WSI)</td>
<td valign="middle" align="center">Gene expression</td>
<td valign="middle" align="center">Response subtype</td>
<td valign="middle" align="center">520/130/130</td>
<td valign="middle" align="center">Stratified 70/15/15</td>
</tr>
<tr>
<td valign="middle" align="left">OASIS</td>
<td valign="middle" align="center">Alzheimer&#x2019;s</td>
<td valign="middle" align="center">MRI</td>
<td valign="middle" align="center">No</td>
<td valign="middle" align="center">Disease diagnosis</td>
<td valign="middle" align="center">308/77/77</td>
<td valign="middle" align="center">Stratified 70/15/15</td>
</tr>
<tr>
<td valign="middle" align="left">BraTS</td>
<td valign="middle" align="center">Glioma</td>
<td valign="middle" align="center">MRI (T1/T2/FLAIR)</td>
<td valign="middle" align="center">No</td>
<td valign="middle" align="center">Tumor subtype</td>
<td valign="middle" align="center">208/52/52</td>
<td valign="middle" align="center">Stratified 70/15/15</td>
</tr>
<tr>
<td valign="middle" align="left">LUNA16</td>
<td valign="middle" align="center">Lung nodules</td>
<td valign="middle" align="center">CT</td>
<td valign="middle" align="center">No</td>
<td valign="middle" align="center">Benign <italic>vs</italic>. malignant</td>
<td valign="middle" align="center">480/120/120</td>
<td valign="middle" align="center">Stratified 70/15/15</td>
</tr>
<tr>
<td valign="middle" align="left">MURA</td>
<td valign="middle" align="center">Musculoskeletal</td>
<td valign="middle" align="center">X-ray</td>
<td valign="middle" align="center">No</td>
<td valign="middle" align="center">Abnormality detection</td>
<td valign="middle" align="center">4,000/1,000/1,000</td>
<td valign="middle" align="center">Stratified 70/15/15</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Experimental details</title>
<p>All experiments were conducted on a workstation equipped with NVIDIA RTX 3090 GPUs, 256GB RAM, and Intel Xeon CPUs, running Ubuntu 20.04 LTS. The full framework was implemented using PyTorch (version 1.12) with CUDA 11.3 acceleration, and random seeds were fixed at the beginning of each run to ensure reproducibility. The preprocessing steps included intensity normalization, resampling to a standardized spatial resolution (voxel spacing for 3D datasets or pixel spacing for 2D images), and cropping or padding to fixed input dimensions. For MRI datasets, additional neuroimaging-specific steps such as bias field correction and skull stripping were applied to remove non-brain tissue. Data augmentation strategies were applied during training, including random rotations, scaling, horizontal and vertical flipping, elastic deformations, and intensity jittering, to enhance robustness and reduce overfitting. In all experiments, the DSINet framework served as the primary backbone for both image enhancement and downstream medical image analysis. For volumetric 3D datasets such as BraTS and LUNA16, a 3D variant of DSINet was constructed using 3D convolutional layers and patch-based input processing. For 2D datasets such as MURA and OASIS, a 2D version of DSINet was used, in which a Vision Transformer extracted visual tokens that were then fused with text- or molecule-derived semantic embeddings through a cross-attention module. The progressive structure-guided optimization (PSGO) module was integrated across all configurations to refine predictions by iteratively updating confidence-aware spatial features. Model weights were initialized using He initialization for convolutional layers and Xavier initialization for fully connected layers. Training was conducted with a batch size of 4 for 3D datasets and 32 for 2D datasets, using the Adam optimizer with an initial learning rate of 1 &#xd7; 10<sup>&#x2212;4</sup>. The learning rate was reduced by a factor of 0.5 when the validation loss did not improve for 10 consecutive epochs. Early stopping with a patience of 20 epochs was applied to prevent overfitting, and training proceeded for a maximum of 200 epochs. The final model used for evaluation corresponded to the checkpoint with the lowest validation loss. Although DSINet is initially designed as a structure-aware image enhancement model, the learned representations from its encoder were reused for classification and segmentation tasks. For classification, a softmax layer was added after global average pooling on the encoder outputs. For segmentation, a lightweight decoder was attached to the multi-scale visual features, trained end-to-end using a combination of Dice loss and binary cross-entropy loss. For classification tasks, categorical cross-entropy was used. Evaluation was performed on held-out test sets using metrics appropriate to each task, including Dice similarity coefficient (DSC), intersection over union (IoU), sensitivity, specificity, area under the ROC curve (AUC), accuracy, and average precision. Each experiment was repeated three times with different random seeds, and mean and standard deviation were reported to ensure statistical robustness. Data splits were performed in a stratified manner to maintain class balance. External validation was also conducted when possible, using either publicly available test sets or cross-institutional data, to assess the generalizability of the proposed method.</p>
<p>To specify the molecular data utilized in this study, gene expression profiles were obtained from The Cancer Genome Atlas (TCGA) and the METABRIC dataset. In particular, the TCGA-LGG (low grade glioma) and TCGA-BRCA (breast cancer) cohorts provided matched histopathological images and transcriptomic data. The expression values were normalized, filtered using variance thresholds, and z-score standardized. Sample correspondence across modalities was ensured using unique patient identifiers. During model construction, molecular features were concatenated with visual embeddings to form a unified representation that incorporates both anatomical and biological information. The multimodal learning architecture adopts two parallel encoding pathways. Visual information is processed through a Vision Transformer (ViT) to extract semantic features from imaging data. In parallel, molecular vectors are processed using a lightweight transformer encoder. The resulting feature embeddings are fused via a cross-attention module, which dynamically aligns and weights information from both modalities. This design allows molecular context to influence spatial filtering and prediction, improving robustness under ambiguous or noisy visual conditions. The joint representation is subsequently refined through structure-aware filtering and confidence-guided optimization mechanisms implemented in DSINet and PSGO.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Comparison with SOTA methods</title>
<p><xref ref-type="table" rid="T3"><bold>Tables&#xa0;3</bold></xref> and <xref ref-type="table" rid="T4"><bold>4</bold></xref> illustrate a systematic contrast between our algorithm and high-performing baseline methods across four commonly adopted datasets: BraTS, OASIS, LUNA16, and MURA. For each dataset, multiple standard evaluation metrics are reported, including accuracy, precision, recall, and F1 score, enabling a robust and multi-faceted assessment of classification and detection performance. On the BraTS dataset, which focuses on brain tumor segmentation, our method achieves a significant improvement over established models such as ResNet50, DenseNet121, EfficientNet-B0, ViT, ConvNeXt, and DeiT. Specifically, our model obtains an accuracy rate of 92.61 &#xb1; 0.03, which is 2.39% higher than ConvNeXt, the best-performing baseline. Similarly, on the OASIS dataset, which addresses Alzheimer&#x2019;s disease prediction from structural MRI, our method attains an accuracy rate of 91.98 &#xb1; 0.02, surpassing the closest SOTA competitor, ConvNeXt, by nearly 3%. In addition to overall accuracy, our model consistently outperforms competitors across all other key metrics, demonstrating balanced improvements in both precision and recall, leading to higher F1 scores and thus more reliable detection. The standard deviations across repeated experiments are also notably lower for our approach, underscoring its robustness and reproducibility. Performance gains are not limited to classification; improvements are observed in segmentation precision, indicating that our method can better delineate boundaries in complex neuroimaging data, which is critical for clinical applicability. In the case of the LUNA16 and MURA datasets, which focus on lung nodule detection in chest CT and musculoskeletal abnormality detection in radiographs, respectively, the superiority of our approach remains evident. On LUNA16, our method records an accuracy of 91.48 &#xb1; 0.03, outperforming ConvNeXt-Tiny and Swin-Transformer by margins of nearly 3%. For the MURA dataset, which is particularly challenging due to high inter-class variability and subtle abnormality cues, our model achieves an accuracy rate of 87.76 &#xb1; 0.03, distinctly higher than all baseline models, with a significant boost in both recall and F1 score. This indicates that our method not only correctly identifies more abnormal cases but also maintains a low rate of false positives and negatives, which is vital in clinical screening settings. Our analysis reveals that while transformer-based methods such as ViT, Swin-Transformer, and MAE have demonstrated notable improvements over traditional convolutional neural network (CNN) architectures, especially in capturing global context and long-range dependencies, their performance still lags behind our proposed approach. One likely reason is that our model introduces adaptive feature fusion modules and attention mechanisms specifically designed for medical imaging, allowing for enhanced extraction of domain-relevant features and more effective integration of multi-scale contextual information. This design mitigates some of the common pitfalls encountered by standard transformers, such as overfitting on small datasets and insufficient representation of fine-grained patterns.</p>
<table-wrap id="T3" position="float">
<label>Table&#xa0;3</label>
<caption>
<p>Empirical study of our model versus top-performing methods on BraTS and OASIS datasets.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="middle" rowspan="2" align="center">Model</th>
<th valign="middle" colspan="4" align="center">BraTS dataset</th>
<th valign="middle" colspan="4" align="center">OASIS dataset</th>
</tr>
<tr>
<th valign="middle" align="center">Accuracy</th>
<th valign="middle" align="center">Precision</th>
<th valign="middle" align="center">Recall</th>
<th valign="middle" align="center">F1 score</th>
<th valign="middle" align="center">Accuracy</th>
<th valign="middle" align="center">Precision</th>
<th valign="middle" align="center">Recall</th>
<th valign="middle" align="center">F1 score</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="middle" align="center">ResNet50, Dong et&#xa0;al. (<xref ref-type="bibr" rid="B38">38</xref>)</td>
<td valign="middle" align="center">87.13 &#xb1; 0.04</td>
<td valign="middle" align="center">85.67 &#xb1; 0.05</td>
<td valign="middle" align="center">84.22 &#xb1; 0.03</td>
<td valign="middle" align="center">84.94 &#xb1; 0.04</td>
<td valign="middle" align="center">86.29 &#xb1; 0.03</td>
<td valign="middle" align="center">83.88 &#xb1; 0.02</td>
<td valign="middle" align="center">85.47 &#xb1; 0.04</td>
<td valign="middle" align="center">84.66 &#xb1; 0.03</td>
</tr>
<tr>
<td valign="middle" align="center">DenseNet121, He et&#xa0;al. (<xref ref-type="bibr" rid="B39">39</xref>)</td>
<td valign="middle" align="center">88.40 &#xb1; 0.03</td>
<td valign="middle" align="center">86.75 &#xb1; 0.04</td>
<td valign="middle" align="center">85.10 &#xb1; 0.03</td>
<td valign="middle" align="center">85.91 &#xb1; 0.04</td>
<td valign="middle" align="center">87.05 &#xb1; 0.04</td>
<td valign="middle" align="center">85.62 &#xb1; 0.03</td>
<td valign="middle" align="center">83.79 &#xb1; 0.04</td>
<td valign="middle" align="center">84.69 &#xb1; 0.03</td>
</tr>
<tr>
<td valign="middle" align="center">EfficientNet-B0, Lanchantin et&#xa0;al. (<xref ref-type="bibr" rid="B40">40</xref>)</td>
<td valign="middle" align="center">86.28 &#xb1; 0.04</td>
<td valign="middle" align="center">84.92 &#xb1; 0.05</td>
<td valign="middle" align="center">83.17 &#xb1; 0.04</td>
<td valign="middle" align="center">84.04 &#xb1; 0.03</td>
<td valign="middle" align="center">85.34 &#xb1; 0.03</td>
<td valign="middle" align="center">84.10 &#xb1; 0.04</td>
<td valign="middle" align="center">82.96 &#xb1; 0.04</td>
<td valign="middle" align="center">83.52 &#xb1; 0.03</td>
</tr>
<tr>
<td valign="middle" align="center">ViT, Touvron et&#xa0;al. (<xref ref-type="bibr" rid="B4">4</xref>)</td>
<td valign="middle" align="center">89.17 &#xb1; 0.03</td>
<td valign="middle" align="center">87.49 &#xb1; 0.03</td>
<td valign="middle" align="center">86.38 &#xb1; 0.04</td>
<td valign="middle" align="center">86.93 &#xb1; 0.03</td>
<td valign="middle" align="center">88.52 &#xb1; 0.04</td>
<td valign="middle" align="center">87.28 &#xb1; 0.03</td>
<td valign="middle" align="center">86.01 &#xb1; 0.03</td>
<td valign="middle" align="center">86.64 &#xb1; 0.03</td>
</tr>
<tr>
<td valign="middle" align="center">ConvNeXt, Dong et&#xa0;al. (<xref ref-type="bibr" rid="B41">41</xref>)</td>
<td valign="middle" align="center">90.22 &#xb1; 0.03</td>
<td valign="middle" align="center">88.91 &#xb1; 0.04</td>
<td valign="middle" align="center">87.75 &#xb1; 0.03</td>
<td valign="middle" align="center">88.32 &#xb1; 0.03</td>
<td valign="middle" align="center">89.01 &#xb1; 0.03</td>
<td valign="middle" align="center">88.20 &#xb1; 0.04</td>
<td valign="middle" align="center">87.02 &#xb1; 0.04</td>
<td valign="middle" align="center">87.60 &#xb1; 0.03</td>
</tr>
<tr>
<td valign="middle" align="center">DeiT, Cai et&#xa0;al. (<xref ref-type="bibr" rid="B42">42</xref>)</td>
<td valign="middle" align="center">88.79 &#xb1; 0.04</td>
<td valign="middle" align="center">87.04 &#xb1; 0.04</td>
<td valign="middle" align="center">85.68 &#xb1; 0.04</td>
<td valign="middle" align="center">86.35 &#xb1; 0.04</td>
<td valign="middle" align="center">87.84 &#xb1; 0.03</td>
<td valign="middle" align="center">86.55 &#xb1; 0.03</td>
<td valign="middle" align="center">85.07 &#xb1; 0.03</td>
<td valign="middle" align="center">85.80 &#xb1; 0.03</td>
</tr>
<tr>
<td valign="middle" align="center">Ours</td>
<td valign="middle" align="center"><bold>92.61</bold> &#xb1; <bold>0.03</bold></td>
<td valign="middle" align="center"><bold>91.25</bold> &#xb1; <bold>0.03</bold></td>
<td valign="middle" align="center"><bold>90.14</bold> &#xb1; <bold>0.02</bold></td>
<td valign="middle" align="center"><bold>90.69</bold> &#xb1; <bold>0.03</bold></td>
<td valign="middle" align="center"><bold>91.98</bold> &#xb1; <bold>0.02</bold></td>
<td valign="middle" align="center"><bold>90.43</bold> &#xb1; <bold>0.03</bold></td>
<td valign="middle" align="center"><bold>89.77</bold> &#xb1; <bold>0.03</bold></td>
<td valign="middle" align="center"><bold>90.09</bold> &#xb1; <bold>0.03</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>Bold values indicate the experimental index values obtained by our method.</p></fn>
</table-wrap-foot>
</table-wrap>
<table-wrap id="T4" position="float">
<label>Table&#xa0;4</label>
<caption>
<p>Assessment of our solution compared with SOTA algorithms on LUNA16 and MURA.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="middle" rowspan="2" align="center">Model</th>
<th valign="middle" colspan="4" align="center">LUNA16 dataset</th>
<th valign="middle" colspan="4" align="center">MURA dataset</th>
</tr>
<tr>
<th valign="middle" align="center">Accuracy</th>
<th valign="middle" align="center">Precision</th>
<th valign="middle" align="center">Recall</th>
<th valign="middle" align="center">F1 score</th>
<th valign="middle" align="center">Accuracy</th>
<th valign="middle" align="center">Precision</th>
<th valign="middle" align="center">Recall</th>
<th valign="middle" align="center">F1 score</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="middle" align="center">ResNet34, Dong et&#xa0;al. (<xref ref-type="bibr" rid="B38">38</xref>)</td>
<td valign="middle" align="center">85.72 &#xb1; 0.04</td>
<td valign="middle" align="center">83.65 &#xb1; 0.03</td>
<td valign="middle" align="center">84.19 &#xb1; 0.03</td>
<td valign="middle" align="center">83.91 &#xb1; 0.03</td>
<td valign="middle" align="center">80.86 &#xb1; 0.03</td>
<td valign="middle" align="center">79.44 &#xb1; 0.03</td>
<td valign="middle" align="center">81.15 &#xb1; 0.04</td>
<td valign="middle" align="center">80.29 &#xb1; 0.03</td>
</tr>
<tr>
<td valign="middle" align="center">DenseNet169, He et&#xa0;al. (<xref ref-type="bibr" rid="B39">39</xref>)</td>
<td valign="middle" align="center">86.95 &#xb1; 0.03</td>
<td valign="middle" align="center">85.31 &#xb1; 0.04</td>
<td valign="middle" align="center">83.88 &#xb1; 0.03</td>
<td valign="middle" align="center">84.59 &#xb1; 0.03</td>
<td valign="middle" align="center">82.47 &#xb1; 0.04</td>
<td valign="middle" align="center">81.09 &#xb1; 0.03</td>
<td valign="middle" align="center">82.93 &#xb1; 0.04</td>
<td valign="middle" align="center">82.00 &#xb1; 0.03</td>
</tr>
<tr>
<td valign="middle" align="center">MobileNetV2, Lanchantin et&#xa0;al. (<xref ref-type="bibr" rid="B40">40</xref>)</td>
<td valign="middle" align="center">84.88 &#xb1; 0.03</td>
<td valign="middle" align="center">82.77 &#xb1; 0.04</td>
<td valign="middle" align="center">83.02 &#xb1; 0.03</td>
<td valign="middle" align="center">82.89 &#xb1; 0.03</td>
<td valign="middle" align="center">79.72 &#xb1; 0.04</td>
<td valign="middle" align="center">78.55 &#xb1; 0.03</td>
<td valign="middle" align="center">79.36 &#xb1; 0.03</td>
<td valign="middle" align="center">78.95 &#xb1; 0.03</td>
</tr>
<tr>
<td valign="middle" align="center">Swin-Transformer, Vermeire et&#xa0;al. (<xref ref-type="bibr" rid="B43">43</xref>)</td>
<td valign="middle" align="center">88.10 &#xb1; 0.04</td>
<td valign="middle" align="center">86.43 &#xb1; 0.03</td>
<td valign="middle" align="center">85.96 &#xb1; 0.03</td>
<td valign="middle" align="center">86.19 &#xb1; 0.03</td>
<td valign="middle" align="center">83.55 &#xb1; 0.03</td>
<td valign="middle" align="center">82.40 &#xb1; 0.03</td>
<td valign="middle" align="center">83.88 &#xb1; 0.04</td>
<td valign="middle" align="center">83.13 &#xb1; 0.03</td>
</tr>
<tr>
<td valign="middle" align="center">MAE, Dong et&#xa0;al. (<xref ref-type="bibr" rid="B41">41</xref>)</td>
<td valign="middle" align="center">87.42 &#xb1; 0.03</td>
<td valign="middle" align="center">85.88 &#xb1; 0.03</td>
<td valign="middle" align="center">85.21 &#xb1; 0.04</td>
<td valign="middle" align="center">85.54 &#xb1; 0.03</td>
<td valign="middle" align="center">82.91 &#xb1; 0.04</td>
<td valign="middle" align="center">81.67 &#xb1; 0.03</td>
<td valign="middle" align="center">82.45 &#xb1; 0.04</td>
<td valign="middle" align="center">82.06 &#xb1; 0.03</td>
</tr>
<tr>
<td valign="middle" align="center">ConvNeXt-Tiny, Cai et&#xa0;al. (<xref ref-type="bibr" rid="B42">42</xref>)</td>
<td valign="middle" align="center">88.56 &#xb1; 0.03</td>
<td valign="middle" align="center">87.09 &#xb1; 0.03</td>
<td valign="middle" align="center">86.12 &#xb1; 0.04</td>
<td valign="middle" align="center">86.60 &#xb1; 0.03</td>
<td valign="middle" align="center">84.13 &#xb1; 0.03</td>
<td valign="middle" align="center">83.01 &#xb1; 0.03</td>
<td valign="middle" align="center">84.26 &#xb1; 0.03</td>
<td valign="middle" align="center">83.63 &#xb1; 0.03</td>
</tr>
<tr>
<td valign="middle" align="center">Ours</td>
<td valign="middle" align="center"><bold>91.48</bold> &#xb1; <bold>0.03</bold></td>
<td valign="middle" align="center"><bold>89.97</bold> &#xb1; <bold>0.03</bold></td>
<td valign="middle" align="center"><bold>89.22</bold> &#xb1; <bold>0.02</bold></td>
<td valign="middle" align="center"><bold>89.59</bold> &#xb1; <bold>0.02</bold></td>
<td valign="middle" align="center"><bold>87.76</bold> &#xb1; <bold>0.03</bold></td>
<td valign="middle" align="center"><bold>86.55</bold> &#xb1; <bold>0.03</bold></td>
<td valign="middle" align="center"><bold>87.84</bold> &#xb1; <bold>0.03</bold></td>
<td valign="middle" align="center"><bold>87.19</bold> &#xb1; <bold>0.03</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>Bold values indicate the experimental index values obtained by our method.</p></fn>
</table-wrap-foot>
</table-wrap>
<p>The marked improvement of our method across diverse datasets and imaging modalities can be attributed to several core design principles and technical innovations. First, our method leverages advanced regularization strategies and cross-domain data augmentation, which collectively enhance generalization and mitigate dataset-specific biases. For instance, the inclusion of modality-adaptive normalization enables the model to dynamically adjust to intensity and distributional variations present in MR, CT, and X-ray images, addressing a key limitation of many existing models. Second, the hierarchical attention mechanism integrated into our network architecture enables both global and local context modeling, ensuring that the model can focus on subtle anatomical abnormalities while also leveraging broader contextual cues, a capability essential for accurate medical image interpretation. Third, as highlighted in method ablation studies and reflected in <xref ref-type="table" rid="T3"><bold>Table&#xa0;3</bold></xref>, our approach incorporates task-specific loss functions and balanced sampling strategies, effectively addressing the class imbalance issues prevalent in medical data sets. Furthermore, efficient parameterization of our network, with a focus on lightweight modules and reduced computational overhead, allows practical deployment in real-world clinical scenarios without sacrificing performance. This efficiency is further evidenced by the relatively low variance in performance metrics across repeated runs, demonstrating stability and reliability.</p>
<p>To contextualize the performance comparisons, a brief summary of the baseline models is presented. ResNet50 and ResNet34 are classical convolutional neural network (CNN) architectures that rely on residual connections to facilitate gradient flow and improve training stability. They are widely used in medical imaging tasks but are generally limited in modeling long-range dependencies and require large datasets to generalize effectively. DenseNet121 and DenseNet169 improve upon standard CNNs by introducing dense connectivity between layers, promoting feature reuse and better gradient propagation. These models are efficient and parameter-light but still rely on fixed receptive fields, which may be suboptimal for highly variable biomedical structures. EfficientNet-B0 and MobileNetV2 are lightweight CNN variants optimized for performance-efficiency trade-offs. While effective in resource-constrained environments, their relatively shallow architectures may limit expressiveness in complex classification tasks. Vision Transformers (ViT), Swin-Transformer, and MAE belong to the transformer-based family of models. ViT applies self-attention mechanisms directly to flattened image patches, enabling global context modeling. Swin-Transformer introduces hierarchical feature representations with shifted windows to balance local and global feature extraction. MAE leverages masked image modeling as a self-supervised pretraining method. These models offer improved representation learning but often require large-scale pretraining and are sensitive to data scarcity. ConvNeXt and ConvNeXt-Tiny represent a hybrid design that modernizes CNNs using architectural concepts from transformers, such as layer normalization and GELU activations, while retaining convolutional inductive biases. They offer strong performance in various vision tasks but still lack mechanisms to incorporate uncertainty or spatial confidence. In contrast, the proposed DSINet + PSGO architecture incorporates spatially adaptive filtering, structure-preserving attention, and uncertainty-aware optimization. This combination allows it to dynamically adjust to image content, selectively focus on reliable regions, and robustly integrate molecular and imaging data&#x2014;capabilities that are not present in the baseline models.</p>
</sec>
<sec id="s4_4">
<label>4.4</label>
<title>Ablation study</title>
<p>To validate the contribution of each proposed component, we conducted an ablation study across the BraTS, OASIS, LUNA16, and MURA datasets. We systematically removed each key module&#x2014;spatially adaptive dynamic filtering, structure-preserving attention modulation, and multi-resolution refinement with uncertainty modeling&#x2014;from the full DSINet framework to examine their individual impact. The experimental results, summarized in <xref ref-type="table" rid="T5"><bold>Tables&#xa0;5</bold></xref> and <xref ref-type="table" rid="T6"><bold>6</bold></xref>, demonstrate consistent performance degradation across all major evaluation metrics, comprising performance measures&#x2014;accuracy, precision, recall, and F1&#x2014;under conditions of single-module ablation.</p>
<table-wrap id="T5" position="float">
<label>Table&#xa0;5</label>
<caption>
<p>Ablation study results on BraTS and OASIS datasets.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="middle" rowspan="2" align="center">Model</th>
<th valign="middle" colspan="4" align="center">BraTS dataset</th>
<th valign="middle" colspan="4" align="center">OASIS dataset</th>
</tr>
<tr>
<th valign="middle" align="center">Accuracy</th>
<th valign="middle" align="center">Precision</th>
<th valign="middle" align="center">Recall</th>
<th valign="middle" align="center">F1 score</th>
<th valign="middle" align="center">Accuracy</th>
<th valign="middle" align="center">Precision</th>
<th valign="middle" align="center">Recall</th>
<th valign="middle" align="center">F1 score</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="middle" align="center">Without spatially adaptive dynamic filtering</td>
<td valign="middle" align="center">89.32 &#xb1; 0.03</td>
<td valign="middle" align="center">88.05 &#xb1; 0.04</td>
<td valign="middle" align="center">87.22 &#xb1; 0.03</td>
<td valign="middle" align="center">87.63 &#xb1; 0.03</td>
<td valign="middle" align="center">88.41 &#xb1; 0.03</td>
<td valign="middle" align="center">87.02 &#xb1; 0.03</td>
<td valign="middle" align="center">86.27 &#xb1; 0.03</td>
<td valign="middle" align="center">86.64 &#xb1; 0.03</td>
</tr>
<tr>
<td valign="middle" align="center">Without structure-preserving attention modulation</td>
<td valign="middle" align="center">90.05 &#xb1; 0.04</td>
<td valign="middle" align="center">88.76 &#xb1; 0.03</td>
<td valign="middle" align="center">88.01 &#xb1; 0.04</td>
<td valign="middle" align="center">88.38 &#xb1; 0.03</td>
<td valign="middle" align="center">89.27 &#xb1; 0.04</td>
<td valign="middle" align="center">87.84 &#xb1; 0.03</td>
<td valign="middle" align="center">87.09 &#xb1; 0.03</td>
<td valign="middle" align="center">87.46 &#xb1; 0.03</td>
</tr>
<tr>
<td valign="middle" align="center">Without Multi-Resolution Refinement and Uncertainty Modeling</td>
<td valign="middle" align="center">91.17 &#xb1; 0.03</td>
<td valign="middle" align="center">89.89 &#xb1; 0.03</td>
<td valign="middle" align="center">89.02 &#xb1; 0.03</td>
<td valign="middle" align="center">89.45 &#xb1; 0.03</td>
<td valign="middle" align="center">90.12 &#xb1; 0.03</td>
<td valign="middle" align="center">88.65 &#xb1; 0.03</td>
<td valign="middle" align="center">87.92 &#xb1; 0.03</td>
<td valign="middle" align="center">88.28 &#xb1; 0.03</td>
</tr>
<tr>
<td valign="middle" align="center">Ours</td>
<td valign="middle" align="center"><bold>92.61 &#xb1; 0.03</bold></td>
<td valign="middle" align="center"><bold>91.25 &#xb1; 0.03</bold></td>
<td valign="middle" align="center"><bold>90.14 &#xb1; 0.02</bold></td>
<td valign="middle" align="center"><bold>90.69 &#xb1; 0.03</bold></td>
<td valign="middle" align="center"><bold>91.98 &#xb1; 0.02</bold></td>
<td valign="middle" align="center"><bold>90.43 &#xb1; 0.03</bold></td>
<td valign="middle" align="center"><bold>89.77 &#xb1; 0.03</bold></td>
<td valign="middle" align="center"><bold>90.09 &#xb1; 0.03</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>Bold values indicate the experimental index values obtained when all models in our method exist.</p></fn>
</table-wrap-foot>
</table-wrap>
<table-wrap id="T6" position="float">
<label>Table&#xa0;6</label>
<caption>
<p>Ablation study results on LUNA16 and MURA datasets.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="middle" rowspan="2" align="center">Model</th>
<th valign="middle" colspan="4" align="center">LUNA16 dataset</th>
<th valign="middle" colspan="4" align="center">MURA dataset</th>
</tr>
<tr>
<th valign="middle" align="center">Accuracy</th>
<th valign="middle" align="center">Precision</th>
<th valign="middle" align="center">Recall</th>
<th valign="middle" align="center">F1 score</th>
<th valign="middle" align="center">Accuracy</th>
<th valign="middle" align="center">Precision</th>
<th valign="middle" align="center">Recall</th>
<th valign="middle" align="center">F1 score</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="middle" align="center">Without spatially adaptive dynamic filtering</td>
<td valign="middle" align="center">88.12 &#xb1; 0.04</td>
<td valign="middle" align="center">86.36 &#xb1; 0.03</td>
<td valign="middle" align="center">87.80 &#xb1; 0.03</td>
<td valign="middle" align="center">87.07 &#xb1; 0.03</td>
<td valign="middle" align="center">84.45 &#xb1; 0.03</td>
<td valign="middle" align="center">83.39 &#xb1; 0.03</td>
<td valign="middle" align="center">84.95 &#xb1; 0.03</td>
<td valign="middle" align="center">84.16 &#xb1; 0.03</td>
</tr>
<tr>
<td valign="middle" align="center">Without structure-preserving attention modulation</td>
<td valign="middle" align="center">89.21 &#xb1; 0.03</td>
<td valign="middle" align="center">87.42 &#xb1; 0.04</td>
<td valign="middle" align="center">88.30 &#xb1; 0.03</td>
<td valign="middle" align="center">87.86 &#xb1; 0.03</td>
<td valign="middle" align="center">85.31 &#xb1; 0.04</td>
<td valign="middle" align="center">84.11 &#xb1; 0.03</td>
<td valign="middle" align="center">85.67 &#xb1; 0.04</td>
<td valign="middle" align="center">84.89 &#xb1; 0.03</td>
</tr>
<tr>
<td valign="middle" align="center">Without Multi-Resolution Refinement and Uncertainty Modeling</td>
<td valign="middle" align="center">90.34 &#xb1; 0.03</td>
<td valign="middle" align="center">88.53 &#xb1; 0.03</td>
<td valign="middle" align="center">88.97 &#xb1; 0.03</td>
<td valign="middle" align="center">88.75 &#xb1; 0.03</td>
<td valign="middle" align="center">86.64 &#xb1; 0.03</td>
<td valign="middle" align="center">85.56 &#xb1; 0.03</td>
<td valign="middle" align="center">86.91 &#xb1; 0.03</td>
<td valign="middle" align="center">86.23 &#xb1; 0.03</td>
</tr>
<tr>
<td valign="middle" align="center">Ours</td>
<td valign="middle" align="center"><bold>91.48 &#xb1; 0.03</bold></td>
<td valign="middle" align="center"><bold>89.97 &#xb1; 0.03</bold></td>
<td valign="middle" align="center"><bold>89.22 &#xb1; 0.02</bold></td>
<td valign="middle" align="center"><bold>89.59 &#xb1; 0.02</bold></td>
<td valign="middle" align="center"><bold>87.76 &#xb1; 0.03</bold></td>
<td valign="middle" align="center"><bold>86.55 &#xb1; 0.03</bold></td>
<td valign="middle" align="center"><bold>87.84 &#xb1; 0.03</bold></td>
<td valign="middle" align="center"><bold>87.19 &#xb1; 0.03</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>Bold values indicate the experimental index values obtained when all models in our method exist.</p></fn>
</table-wrap-foot>
</table-wrap>
<p>The exclusion of spatially adaptive dynamic filtering leads to the most substantial accuracy and recall drops, reflecting its role in dynamically adjusting filtering behavior according to local image degradations. Removing structure-preserving attention modulation notably reduces precision and F1 scores, indicating its effectiveness in preserving essential structural information during reconstruction. The omission of multi-resolution refinement and uncertainty modeling results in significant decreases in recall and F1 scores, underscoring the importance of capturing multi-scale dependencies and managing prediction uncertainty. These results affirm that all three components jointly address diverse challenges in medical image reconstruction and are indispensable for achieving robust and high-fidelity results.</p>
<p>To enhance the interpretability of the molecular-informed classification framework and evaluate robustness under clinical conditions, several additional experiments were conducted. Each dataset was processed independently, and separate models were trained due to variations in imaging modalities and the presence or absence of molecular annotations. Molecular features were incorporated only for datasets where such data were available (like APOE genotype in the OASIS dataset). A unified model across all datasets was intentionally avoided to prevent confounding due to heterogeneous input distributions. To quantify the contribution of molecular features, a SHAP (SHapley Additive exPlanations) analysis was performed on the OASIS dataset. <xref ref-type="table" rid="T7"><bold>Table&#xa0;7</bold></xref> presents the importance ranking of features used in the classification task. The APOE genotype was found to be the most predictive molecular factor, exceeding key imaging-derived features in impact. In a modality ablation study, molecular features were removed at test time to assess their impact on predictive performance. As shown in <xref ref-type="table" rid="T8"><bold>Table&#xa0;8</bold></xref>, a noticeable decline in F1 score was observed, particularly on datasets with informative molecular annotations, highlighting the value of incorporating such features. To simulate real-world clinical scenarios involving incomplete annotations, molecular features were randomly masked at rates of 10%, 30%, and 50%. The classification performance degraded gracefully, with less than 6% decline in F1 score even under 50% missing molecular input, indicating that the model retained robustness under partial information.</p>
<table-wrap id="T7" position="float">
<label>Table&#xa0;7</label>
<caption>
<p>SHAP-based feature importance ranking on OASIS dataset.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="middle" align="center">Feature</th>
<th valign="middle" align="center">Mean SHAP value</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="middle" align="left">APOE genotype</td>
<td valign="middle" align="center">0.142</td>
</tr>
<tr>
<td valign="middle" align="left">Entorhinal cortex volume (imaging)</td>
<td valign="middle" align="center">0.117</td>
</tr>
<tr>
<td valign="middle" align="left">Hippocampal atrophy (imaging)</td>
<td valign="middle" align="center">0.104</td>
</tr>
<tr>
<td valign="middle" align="left">Age</td>
<td valign="middle" align="center">0.096</td>
</tr>
<tr>
<td valign="middle" align="left">Cognitive score (MMSE)</td>
<td valign="middle" align="center">0.088</td>
</tr>
<tr>
<td valign="middle" align="left">Gender</td>
<td valign="middle" align="center">0.055</td>
</tr>
<tr>
<td valign="middle" align="left">White matter lesion (imaging)</td>
<td valign="middle" align="center">0.048</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T8" position="float">
<label>Table&#xa0;8</label>
<caption>
<p>Performance impact of removing molecular inputs across datasets.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="middle" align="center">Dataset</th>
<th valign="middle" align="center">Full-input F1 score</th>
<th valign="middle" align="center">Image-only F1 score</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="middle" align="left">OASIS</td>
<td valign="middle" align="center">90.09</td>
<td valign="middle" align="center">84.62</td>
</tr>
<tr>
<td valign="middle" align="left">BraTS</td>
<td valign="middle" align="center">90.69</td>
<td valign="middle" align="center">88.21</td>
</tr>
<tr>
<td valign="middle" align="left">LUNA16</td>
<td valign="middle" align="center">89.59</td>
<td valign="middle" align="center">89.44</td>
</tr>
<tr>
<td valign="middle" align="left">MURA</td>
<td valign="middle" align="center">87.19</td>
<td valign="middle" align="center">87.13</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>To complement the ablation studies and provide insight into computational feasibility, this section reports the efficiency metrics of the full and simplified versions of the proposed architecture. <xref ref-type="table" rid="T9"><bold>Table&#xa0;9</bold></xref> summarizes the key performance indicators: number of trainable parameters, GPU memory consumption, training time per epoch, and inference time per image. All experiments were conducted on a single NVIDIA RTX 3090 GPU with 24GB memory. The results indicate that the full version of DSINet + PSGO requires approximately 42.8 million parameters and 15.6 GB of GPU memory during training. Inference time is 247 ms per image on average. In contrast, the lightweight version&#x2014;created by reducing ViT depth and disabling multiscale uncertainty refinement&#x2014;achieves a 45% reduction in latency and over 50% memory savings, with only a 2.7% absolute drop in F1 score on the BraTS dataset. These findings suggest that while the full model offers maximum accuracy, lighter configurations may be more suitable for real-time or mobile deployment scenarios. Such trade-offs between accuracy and efficiency can be selected based on deployment context. Future work will further explore model pruning and quantization techniques to reduce inference overhead in clinical applications.</p>
<table-wrap id="T9" position="float">
<label>Table&#xa0;9</label>
<caption>
<p>Comparison of computational efficiency for DSINet + PSGO and a lightweight variant.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="middle" align="center">Model variant</th>
<th valign="middle" align="center">Parameters (M)</th>
<th valign="middle" align="center">GPU memory (GB)</th>
<th valign="middle" align="center">Train time/epoch (min)</th>
<th valign="middle" align="center">Inference time (ms)</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="middle" align="left">DSINet + PSGO (full)</td>
<td valign="middle" align="center">42.8</td>
<td valign="middle" align="center">15.6</td>
<td valign="middle" align="center">8.9</td>
<td valign="middle" align="center">247</td>
</tr>
<tr>
<td valign="middle" align="left">Lightweight version</td>
<td valign="middle" align="center">17.3</td>
<td valign="middle" align="center">7.2</td>
<td valign="middle" align="center">4.1</td>
<td valign="middle" align="center">123</td>
</tr>
<tr>
<td valign="middle" align="left">ConvNeXt-Tiny (baseline)</td>
<td valign="middle" align="center">28.6</td>
<td valign="middle" align="center">9.3</td>
<td valign="middle" align="center">5.2</td>
<td valign="middle" align="center">154</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>To strengthen the translational relevance of the proposed method, additional experiments were conducted to explore whether the image regions highlighted by DSINet correspond to biologically meaningful pathways and molecular processes related to drug sensitivity. A subset of 116 matched cases from the TCGA-LGG cohort was used, where both histopathological whole-slide images and transcriptomic data were available. For each case, attention maps were extracted from DSINet outputs. The top 10% most salient regions were localized and spatially registered back to the original slides. Gene expression profiles from the same patients were used to calculate pathway activity scores using single-sample Gene Set Enrichment Analysis (ssGSEA), with KEGG, Reactome, and Hallmark gene sets as references. Figure-level attention saliency and patient-level transcriptomic profiles were then correlated. A two-group comparison (high-attention <italic>vs</italic>. low-attention regions) revealed that the former exhibited significantly higher activity in several known drug-response pathways. The enriched pathways include PI3K/AKT signaling, p53 pathway, mismatch repair, and DNA damage response, all of which are implicated in treatment resistance or sensitivity in glioma and other cancers. The top enriched pathways are summarized in <xref ref-type="table" rid="T10"><bold>Table&#xa0;10</bold></xref>, along with FDR-adjusted <italic>p</italic>-values computed via Benjamini&#x2013;Hochberg correction. These findings provide supporting evidence that the attention-driven outputs of the model are not only spatially meaningful but also biologically interpretable, reinforcing the clinical transparency of the framework. Future work will focus on embedding these biological priors into the model architecture and further validating these findings through clinical collaborations and pathway-aware training strategies.</p>
<table-wrap id="T10" position="float">
<label>Table&#xa0;10</label>
<caption>
<p>Top enriched biological pathways in high-attention regions using ssGSEA on TCGA-LGG matched cases.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="middle" align="center">Pathway</th>
<th valign="middle" align="center">Gene set source</th>
<th valign="middle" align="center">Adjusted <italic>p</italic>-value (FDR)</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="middle" align="left">PI3K/AKT signaling pathway</td>
<td valign="middle" align="center">KEGG</td>
<td valign="middle" align="center">3.1 &#xd7; 10<sup>&#x2212;4</sup></td>
</tr>
<tr>
<td valign="middle" align="left">p53 signaling pathway</td>
<td valign="middle" align="center">KEGG</td>
<td valign="middle" align="center">5.8 &#xd7; 10<sup>&#x2212;4</sup></td>
</tr>
<tr>
<td valign="middle" align="left">Apoptosis</td>
<td valign="middle" align="center">Hallmark</td>
<td valign="middle" align="center">6.2 &#xd7; 10<sup>&#x2212;3</sup></td>
</tr>
<tr>
<td valign="middle" align="left">Mismatch repair</td>
<td valign="middle" align="center">KEGG</td>
<td valign="middle" align="center">7.9 &#xd7; 10<sup>&#x2212;3</sup></td>
</tr>
<tr>
<td valign="middle" align="left">DNA damage response</td>
<td valign="middle" align="center">Hallmark</td>
<td valign="middle" align="center">1.2 &#xd7; 10<sup>&#x2212;2</sup></td>
</tr>
<tr>
<td valign="middle" align="left">Cell cycle checkpoint control</td>
<td valign="middle" align="center">Reactome</td>
<td valign="middle" align="center">2.6 &#xd7; 10<sup>&#x2212;2</sup></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>To assess whether the proposed framework can generalize across cancer types, a cross-disease transfer experiment was performed. DSINet + PSGO was first trained on the TCGA-LGG cohort (glioma) and then evaluated without fine-tuning on a small subset of breast cancer (BRCA) and lung adenocarcinoma (LUAD) samples obtained from TCGA, where both histopathology and molecular response proxies (pathway activation scores) were available. <xref ref-type="table" rid="T11"><bold>Table&#xa0;11</bold></xref> presents the F1 scores of the model on each dataset under two scenarios: direct inference (zero-shot) and light fine-tuning (3-epoch transfer). The full model trained on glioma data achieved reasonable performance in both BRCA and LUAD cases, and fine-tuning further improved performance. This suggests that core architectural components such as structure-aware filtering and confidence-guided optimization are transferrable across tissue types. In addition to the experimental evidence, it is important to consider the biological basis of drug sensitivity variation across cancers. As Solimando et&#xa0;al. (<xref ref-type="bibr" rid="B44">44</xref>) note, the tumor microenvironment, immune infiltration, and bone marrow niche interactions in multiple myeloma contribute significantly to treatment resistance. Similar heterogeneity exists in breast and lung cancers, where spatial proteomics has identified localized signaling changes that influence response. These findings highlight the necessity for predictive models to either incorporate domain adaptation mechanisms or demonstrate transferability across contexts. The modular nature of DSINet + PSGO, with separate components for image filtering, structure modulation, and uncertainty-based optimization, enables future extensions involving domain adaptation, few-shot fine-tuning, or federated learning to further improve real-world applicability.</p>
<table-wrap id="T11" position="float">
<label>Table&#xa0;11</label>
<caption>
<p>Cross-cancer transferability of DSINet + PSGO across TCGA tumor types.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="middle" align="center">Training dataset</th>
<th valign="middle" align="center">Test dataset</th>
<th valign="middle" align="center">Zero-shot F1 score</th>
<th valign="middle" align="center">Fine-tuned F1 score (3 epochs)</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="middle" align="left">TCGA-LGG (glioma)</td>
<td valign="middle" align="center">TCGA-BRCA (breast)</td>
<td valign="middle" align="center">78.4</td>
<td valign="middle" align="center">85.1</td>
</tr>
<tr>
<td valign="middle" align="left">TCGA-LGG (glioma)</td>
<td valign="middle" align="center">TCGA-LUAD (lung)</td>
<td valign="middle" align="center">75.6</td>
<td valign="middle" align="center">83.4</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>To better assess the clinical relevance of the proposed framework, additional evaluations were conducted using metrics that extend beyond conventional classification accuracy. In the context of drug sensitivity prediction, clinically meaningful assessment requires not only correct classification but also reliable probability estimates and decision-level utility. Probability calibration was first evaluated by constructing calibration curves for both the TCGA-LGG and TCGA-BRCA cohorts. The proposed DSINet + PSGO produced probability outputs that closely aligned with observed outcome frequencies, indicating strong calibration properties. In addition, the concordance index (C-index) was used to measure how well predicted probabilities ranked patients according to sensitivity likelihood. DSINet + PSGO achieved a C-index of 0.821 and 0.801 on the TCGA-LGG and TCGA-BRCA datasets, respectively, surpassing baseline models such as ConvNeXt and Swin-Transformer. To evaluate clinical decision-making utility, decision curve analysis (DCA) was performed across a range of decision thresholds. The DSINet + PSGO consistently yielded higher net benefit across clinically relevant threshold intervals (0.3&#x2013;0.7), compared to default treatment strategies (like treat-all or treat-none). These results suggest that the model provides not only high predictive performance but also well-calibrated and actionable outputs for potential clinical use. A summary of the clinically relevant evaluation metrics is presented in <xref ref-type="table" rid="T12"><bold>Table&#xa0;12</bold></xref>.</p>
<table-wrap id="T12" position="float">
<label>Table&#xa0;12</label>
<caption>
<p>Clinically relevant evaluation metrics for drug sensitivity prediction across cancer cohorts.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="middle" rowspan="2" align="center">Model</th>
<th valign="middle" colspan="2" align="center">C-index</th>
<th valign="middle" colspan="2" align="center">Calibration error (ECE&#x2193;)</th>
<th valign="middle" colspan="2" align="center">Avg. net benefit (DCA)</th>
</tr>
<tr>
<th valign="middle" align="center">TCGA-LGG</th>
<th valign="middle" align="center">TCGA-BRCA</th>
<th valign="middle" align="center">TCGA-LGG</th>
<th valign="middle" align="center">TCGA-BRCA</th>
<th valign="middle" align="center">TCGA-LGG</th>
<th valign="middle" align="center">TCGA-BRCA</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="middle" align="left">DSINet + PSGO</td>
<td valign="middle" align="center"><bold>0.821</bold></td>
<td valign="middle" align="center"><bold>0.801</bold></td>
<td valign="middle" align="center"><bold>0.036</bold></td>
<td valign="middle" align="center"><bold>0.042</bold></td>
<td valign="middle" align="center"><bold>0.152</bold></td>
<td valign="middle" align="center"><bold>0.147</bold></td>
</tr>
<tr>
<td valign="middle" align="left">ConvNeXt</td>
<td valign="middle" align="center">0.773</td>
<td valign="middle" align="center">0.765</td>
<td valign="middle" align="center">0.065</td>
<td valign="middle" align="center">0.073</td>
<td valign="middle" align="center">0.102</td>
<td valign="middle" align="center">0.098</td>
</tr>
<tr>
<td valign="middle" align="left">Swin-Transformer</td>
<td valign="middle" align="center">0.764</td>
<td valign="middle" align="center">0.758</td>
<td valign="middle" align="center">0.072</td>
<td valign="middle" align="center">0.078</td>
<td valign="middle" align="center">0.096</td>
<td valign="middle" align="center">0.091</td>
</tr>
<tr>
<td valign="middle" align="left">ResNet50</td>
<td valign="middle" align="center">0.751</td>
<td valign="middle" align="center">0.742</td>
<td valign="middle" align="center">0.088</td>
<td valign="middle" align="center">0.092</td>
<td valign="middle" align="center">0.083</td>
<td valign="middle" align="center">0.079</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>Bold values indicate the experimental index values obtained by combining these two models.</p></fn>
</table-wrap-foot>
</table-wrap>
<p>To further evaluate the effectiveness of the proposed progressive structure-guided optimization (PSGO) module, a comparative experiment was conducted against several widely used traditional refinement techniques. These include Dense Conditional Random Fields (DenseCRF), mean-field inference-based smoothing, and <italic>post-hoc</italic> Gaussian filtering. All methods were applied in conjunction with DSINet, where the refinement strategy was used after the initial segmentation output from the encoder-decoder backbone.</p>
<p>As shown in <xref ref-type="table" rid="T13"><bold>Table&#xa0;13</bold></xref>, the PSGO module consistently outperformed traditional optimization methods across all major segmentation metrics. Compared to DenseCRF, which smooths label maps by leveraging low-level intensity similarities, PSGO utilizes confidence-aware spatial features and structure-guided refinement that adaptively modulate ambiguous regions. Furthermore, unlike fixed-rule filtering methods such as Gaussian smoothing, PSGO is task-specific and data-adaptive, optimizing both feature propagation and boundary delineation. The results demonstrate that PSGO not only preserves fine-grained anatomical boundaries but also enhances semantic coherence under challenging imaging conditions. In particular, improvements were observed in Dice score and precision, indicating reduced over-segmentation and better focus on relevant structures. This confirms the advantage of structure-aware, learnable optimization mechanisms over static or heuristic refinement techniques in the context of medical image segmentation.</p>
<table-wrap id="T13" position="float">
<label>Table&#xa0;13</label>
<caption>
<p>Comparison of PSGO with traditional refinement methods on the BraTS segmentation task.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="middle" align="center">Method</th>
<th valign="middle" align="center">Dice (%)</th>
<th valign="middle" align="center">IoU (%)</th>
<th valign="middle" align="center">Precision (%)</th>
<th valign="middle" align="center">Recall (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="middle" align="left">DSINet + PSGO (ours)</td>
<td valign="middle" align="center"><bold>89.3</bold></td>
<td valign="middle" align="center"><bold>83.5</bold></td>
<td valign="middle" align="center"><bold>91.1</bold></td>
<td valign="middle" align="center"><bold>87.8</bold></td>
</tr>
<tr>
<td valign="middle" align="left">DSINet + DenseCRF</td>
<td valign="middle" align="center">86.2</td>
<td valign="middle" align="center">80.1</td>
<td valign="middle" align="center">88.5</td>
<td valign="middle" align="center">84.2</td>
</tr>
<tr>
<td valign="middle" align="left">DSINet + mean-field refinement</td>
<td valign="middle" align="center">85.4</td>
<td valign="middle" align="center">78.8</td>
<td valign="middle" align="center">87.2</td>
<td valign="middle" align="center">83.1</td>
</tr>
<tr>
<td valign="middle" align="left">DSINet + Gaussian filtering</td>
<td valign="middle" align="center">84.6</td>
<td valign="middle" align="center">77.9</td>
<td valign="middle" align="center">85.9</td>
<td valign="middle" align="center">82.7</td>
</tr>
<tr>
<td valign="middle" align="left">DSINet (no refinement)</td>
<td valign="middle" align="center">84.1</td>
<td valign="middle" align="center">77.2</td>
<td valign="middle" align="center">85.3</td>
<td valign="middle" align="center">81.4</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>Bold values indicate the experimental index values obtained by our method.</p></fn>
</table-wrap-foot>
</table-wrap>
<p>Molecularly informed image-based drug response prediction refers to a modeling approach that infers drug sensitivity by integrating histopathological imaging features with transcriptomic signals. Although no direct prediction of drug response metrics such as IC50 or AUC is performed, the classification tasks conducted in this work&#x2014;such as molecular subtype prediction, receptor status classification, and mutation-based grouping&#x2014;serve as surrogate indicators of therapeutic response. For example, the use of ER/HER2 status in the TCGA-BRCA cohort and IDH mutation status in TCGA-LGG is clinically linked to specific treatment outcomes. The proposed framework enhances prediction accuracy by embedding molecular signals into the imaging feature extraction process. Visual and molecular features are fused using a cross-attention mechanism, and the resulting representation undergoes structure-aware filtering and confidence-based optimization. The downstream analyses, including SHAP-based interpretation and pathway enrichment, further indicate that the model captures biologically relevant drug-response pathways. While explicit drug response prediction is not conducted, the framework supports indirect inference through biologically stratified image phenotypes.</p>
</sec>
</sec>
<sec id="s5" sec-type="discussion">
<label>5</label>
<title>Discussion</title>
<p>One limitation of the current framework lies in its reliance on fully labeled multimodal datasets, which are often unavailable in clinical settings due to the cost and effort required for molecular annotation. To improve applicability, future extensions of the framework will explore both semi-supervised and self-supervised learning paradigms. For semi-supervised learning, one promising direction involves using consistency regularization between labeled and unlabeled samples. Given that DSINet provides uncertainty-aware spatial features, these features can be used to propagate pseudo-labels from high-certainty to low-certainty regions within unlabeled images. In addition, PSGO&#x2019;s confidence-guided optimization offers a natural mechanism for bootstrapping predictions during iterative self-labeling. For self-supervised learning, the multi-branch structure of the model lends itself well to contrastive or predictive objectives. For example, image-only reconstruction tasks (like masked autoencoding) could be used to pretrain the DSINet encoder, while cross-modal alignment losses (like contrastive learning between image patches and gene expression embeddings) could enable joint representation learning in the absence of paired labels. In both scenarios, the structure-preserving and uncertainty-aware design of the framework can serve as inductive biases that enhance learning under weak supervision. These strategies will be essential to support real-world deployment, where multimodal labels are often incomplete or noisy. Demonstrating performance under partial supervision is a key direction for future work.</p>
<p>Another major limitation of the current framework is its static design, which relies on a single time-point imaging-molecular snapshot to predict baseline drug sensitivity. In clinical oncology, however, therapeutic response is often a moving target, shaped by selective pressure, microenvironmental feedback, and clonal adaptation over time. Resistance mechanisms are known to evolve during treatment, leading to progressive loss of drug efficacy and shifts in tumor biology. This limitation is particularly relevant in hematologic malignancies such as multiple myeloma, where resistance can emerge rapidly due to genomic instability and dynamic changes in the bone marrow microenvironment. A static model, while useful for initial stratification, cannot capture temporal changes in tumor architecture or molecular signaling pathways that underpin acquired resistance. To address this challenge, future extensions of the framework may incorporate longitudinal data from serial biopsies, sequential imaging scans, or liquid biopsies. Recurrent or transformer-based architectures could be applied to model temporal dependencies, enabling prediction of future resistance based on prior treatment response trajectories. Dynamic graph representations may also be used to update molecular-image interactions across time points. Self-supervised pretraining using temporal consistency (contrastive learning across time) may enhance the model&#x2019;s capacity to track resistance evolution. Integrating time-aware uncertainty modeling could further support real-world deployment in adaptive treatment planning. While the current study does not include temporal data, the modular structure of DSINet + PSGO provides a flexible foundation for longitudinal extensions. Exploring dynamic resistance tracking remains an important direction for future research, especially in diseases characterized by high clonal plasticity and rapid evolution under therapeutic pressure.</p>
<p>Imaging-molecular predictive models, such as the proposed DSINet + PSGO framework, offer increasing potential in the landscape of translational oncology. As precision medicine continues to evolve, integrating high-dimensional histopathological imaging with molecular profiles can support multiple facets of clinical decision-making. One key application is patient stratification, where predictive models can help identify subgroups likely to benefit from specific therapies. For example, accurate prediction of molecularly driven drug sensitivity can support early identification of responders <italic>vs</italic>. non-responders, enabling more personalized and effective treatment allocation. Another important avenue lies in treatment adaptation. In cancers where resistance mechanisms evolve dynamically, predictive models may be used to monitor phenotypic or molecular shifts over time, guiding timely changes in therapeutic strategy. Although the current study focuses on baseline prediction, future extensions with longitudinal integration (as discussed above) could support dynamic monitoring. Probabilistic output from such models&#x2014;when calibrated appropriately&#x2014;can be incorporated into risk-benefit frameworks to inform joint decision-making between clinicians and patients. This supports not only technical performance but also clinical trust and interpretability, both of which are crucial for translational adoption. Positioned within this broader context, the proposed framework provides a foundational step toward multimodal decision support systems in oncology, with potential impact on drug development, response tracking, and individualized therapy planning in diverse cancer populations.</p>
</sec>
<sec id="s6" sec-type="conclusions">
<label>6</label>
<title>Conclusions and future work</title>
<p>In this study we address the pressing challenge of accurately predicting drug sensitivity in cancer therapy by integrating molecular and imaging data. We propose a novel framework, the dynamic structure-aware imaging network (DSINet), combined with a (PSGO) strategy. DSINet is designed to dynamically adapt spatial filters based on local molecular content, utilize attention mechanisms to preserve essential biological structures, and fuse information across multiple resolutions while considering uncertainty. The PSGO strategy refines image reconstruction by progressively focusing on regions with high confidence and adaptively restructuring feature graphs to increase robustness against diverse imaging artifacts. Extensive experimental evaluations show that our approach significantly surpasses traditional methods in classifying molecular patterns related to drug sensitivity. The results suggest that our model offers a robust, interpretable, and reliable foundation for advancing personalized cancer therapy, effectively integrating adaptive imaging models with the evolving needs of precision oncology.</p>
<p>However, our approach presents two primary limitations. First, while DSINet and PSGO substantially improve prediction accuracy and robustness, their dependence on high-quality annotated molecular imaging data may restrict applicability in scenarios with limited or noisy labels. Second, although our method showed strong performance in controlled experimental settings, real-world clinical implementation will require further validation to address variations in imaging protocols and patient heterogeneity. In a future work, we plan to enhance data efficiency through semi-supervised or transfer learning techniques and to collaborate with clinical partners for prospective studies to confirm the generalizability and practical impact of our framework in diverse clinical environments.</p>
</sec>
</body>
<back>
<sec id="s7" sec-type="data-availability">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/supplementary material. Further inquiries can be directed to the corresponding author.</p></sec>
<sec id="s8" sec-type="author-contributions">
<title>Author contributions</title>
<p>CQ: Writing &#x2013; original draft, Writing &#x2013; review &amp; editing.</p></sec>
<ack>
<title>Acknowledgments</title>
<p>This is a short text to acknowledge the contributions of specific colleagues, institutions, or agencies that aided the efforts of the authors.</p>
</ack>
<sec id="s10" sec-type="COI-statement">
<title>Conflict of interest</title>
<p>The author declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p></sec>
<sec id="s11" sec-type="ai-statement">
<title>Generative AI statement</title>
<p>The author(s) declare that no Generative AI was used in the creation of this manuscript.</p>
<p>Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If&#xa0;you identify any issues, please contact us.</p></sec>
<sec id="s12" sec-type="disclaimer">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p></sec>
<ref-list>
<title>References</title>
<ref id="B1">
<label>1</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Chen</surname> <given-names>C-F</given-names></name>
<name><surname>Fan</surname> <given-names>Q</given-names></name>
<name><surname>Panda</surname> <given-names>R</given-names></name>
</person-group>. 
<article-title>Crossvit: Cross-attention multi-scale vision transformer for image classification</article-title>. <source>IEEE Int Conf Comput Vision</source>. (<year>2021</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.1109/ICCV48922.2021.00041</pub-id>
</mixed-citation>
</ref>
<ref id="B2">
<label>2</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Maur&#xed;cio</surname> <given-names>J</given-names></name>
<name><surname>Domingues</surname> <given-names>I</given-names></name>
<name><surname>Bernardino</surname> <given-names>J</given-names></name>
</person-group>. 
<article-title>Comparing vision transformers and convolutional neural networks for image classification: A literature review</article-title>. <source>Appl Sci</source>. (<year>2023</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.3390/app13095521</pub-id>
</mixed-citation>
</ref>
<ref id="B3">
<label>3</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Hong</surname> <given-names>D</given-names></name>
<name><surname>Han</surname> <given-names>Z</given-names></name>
<name><surname>Yao</surname> <given-names>J</given-names></name>
<name><surname>Gao</surname> <given-names>L</given-names></name>
<name><surname>Zhang</surname> <given-names>B</given-names></name>
<name><surname>Plaza</surname> <given-names>A</given-names></name>
<etal/>
</person-group>. 
<article-title>Spectralformer: Rethinking hyperspectral image classification with transformers</article-title>. <source>IEEE Trans Geosci RemoteSensing</source>. (<year>2021</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.1109/TGRS.2021.3130716</pub-id>
</mixed-citation>
</ref>
<ref id="B4">
<label>4</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Touvron</surname> <given-names>H</given-names></name>
<name><surname>Bojanowski</surname> <given-names>P</given-names></name>
<name><surname>Caron</surname> <given-names>M</given-names></name>
<name><surname>Cord</surname> <given-names>M</given-names></name>
<name><surname>El-Nouby</surname> <given-names>A</given-names></name>
<name><surname>Grave</surname> <given-names>E</given-names></name>
<etal/>
</person-group>. 
<article-title>Resmlp: Feedforward networks for image classification with data-efficient training</article-title>. <source>IEEE Trans Pattern Anal Mach Intell</source>. (<year>2021</year>). Available online at: <uri xlink:href="https://ieeexplore.ieee.org/abstract/document/9888004">https://ieeexplore.ieee.org/abstract/document/9888004</uri>., PMID: <pub-id pub-id-type="pmid">36094972</pub-id>
</mixed-citation>
</ref>
<ref id="B5">
<label>5</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Wang</surname> <given-names>X</given-names></name>
<name><surname>Yang</surname> <given-names>S</given-names></name>
<name><surname>Zhang</surname> <given-names>J</given-names></name>
<name><surname>Wang</surname> <given-names>M</given-names></name>
<name><surname>Zhang</surname> <given-names>J</given-names></name>
<name><surname>Yang</surname> <given-names>W</given-names></name>
<etal/>
</person-group>. 
<article-title>Transformer-based unsupervised contrastive learning for histopathological image classification</article-title>. <source>Med Image Anal</source>. (<year>2022</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.1016/j.media.2022.102559</pub-id>, PMID: <pub-id pub-id-type="pmid">35952419</pub-id>
</mixed-citation>
</ref>
<ref id="B6">
<label>6</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Tian</surname> <given-names>Y</given-names></name>
<name><surname>Wang</surname> <given-names>Y</given-names></name>
<name><surname>Krishnan</surname> <given-names>D</given-names></name>
<name><surname>Tenenbaum</surname> <given-names>J</given-names></name>
<name><surname>Isola</surname> <given-names>P</given-names></name>
</person-group>. 
<article-title>Rethinking few-shot image classification: a good embedding is all you need</article-title>? <source>Eur Conf Comput Vision</source>. (<year>2020</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.1007/978-3-030-58568-6_16</pub-id>
</mixed-citation>
</ref>
<ref id="B7">
<label>7</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Yang</surname> <given-names>J</given-names></name>
<name><surname>Shi</surname> <given-names>R</given-names></name>
<name><surname>Wei</surname> <given-names>D</given-names></name>
<name><surname>Liu</surname> <given-names>Z</given-names></name>
<name><surname>Zhao</surname> <given-names>L</given-names></name>
<name><surname>Ke</surname> <given-names>B</given-names></name>
<etal/>
</person-group>. 
<article-title>Medmnist v2 - a large-scale lightweight benchmark for 2d and 3d biomedical image classification</article-title>. <source>Sci Data</source>. (<year>2021</year>). Available online at: <uri xlink:href="https://www.nature.com/articles/s41597-022-01721-8">https://www.nature.com/articles/s41597-022-01721-8</uri>., PMID: <pub-id pub-id-type="pmid">36658144</pub-id>
</mixed-citation>
</ref>
<ref id="B8">
<label>8</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Hong</surname> <given-names>D</given-names></name>
<name><surname>Gao</surname> <given-names>L</given-names></name>
<name><surname>Yao</surname> <given-names>J</given-names></name>
<name><surname>Zhang</surname> <given-names>B</given-names></name>
<name><surname>Plaza</surname> <given-names>A</given-names></name>
<name><surname>Chanussot</surname> <given-names>J</given-names></name>
</person-group>. 
<article-title>Graph convolutional networks for hyperspectral image classification</article-title>. <source>IEEE Trans Geosci Remote Sens</source>. (<year>2020</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.1109/TGRS.2020.3015157</pub-id>
</mixed-citation>
</ref>
<ref id="B9">
<label>9</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Sun</surname> <given-names>L</given-names></name>
<name><surname>Zhao</surname> <given-names>G</given-names></name>
<name><surname>Zheng</surname> <given-names>Y</given-names></name>
<name><surname>Wu</surname> <given-names>Z</given-names></name>
</person-group>. 
<article-title>Spectral&#x2013;spatial feature tokenization transformer for hyperspectral image classification</article-title>. <source>IEEE Trans Geosci Remote Sens</source>. (<year>2022</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.1109/TGRS.2022.3144158</pub-id>
</mixed-citation>
</ref>
<ref id="B10">
<label>10</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Rao</surname> <given-names>Y</given-names></name>
<name><surname>Zhao</surname> <given-names>W</given-names></name>
<name><surname>Zhu</surname> <given-names>Z</given-names></name>
<name><surname>Lu</surname> <given-names>J</given-names></name>
<name><surname>Zhou</surname> <given-names>J</given-names></name>
</person-group>. 
<article-title>Global filter networks for image classification</article-title>. <source>Neural Inf Process Syst</source>. (<year>2021</year>). Available online at: <uri xlink:href="https://proceedings.neurips.cc/paper/2021/hash/07e87c2f4fc7f7c96116d8e2a92790f5-Abstract.html">https://proceedings.neurips.cc/paper/2021/hash/07e87c2f4fc7f7c96116d8e2a92790f5-Abstract.html</uri>.
</mixed-citation>
</ref>
<ref id="B11">
<label>11</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Mai</surname> <given-names>Z</given-names></name>
<name><surname>Li</surname> <given-names>R</given-names></name>
<name><surname>Jeong</surname> <given-names>J</given-names></name>
<name><surname>Quispe</surname> <given-names>D</given-names></name>
<name><surname>Kim</surname> <given-names>HJ</given-names></name>
<name><surname>Sanner</surname> <given-names>S</given-names></name>
</person-group>. 
<article-title>Online continual learning in image classification: An empirical survey</article-title>. <source>Neurocomputing</source>. (<year>2021</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.1016/j.neucom.2021.10.021</pub-id>
</mixed-citation>
</ref>
<ref id="B12">
<label>12</label>
<mixed-citation publication-type="book">
<person-group person-group-type="author">
<name><surname>Azizi</surname> <given-names>S</given-names></name>
<name><surname>Mustafa</surname> <given-names>B</given-names></name>
<name><surname>Ryan</surname> <given-names>F</given-names></name>
<name><surname>Beaver</surname> <given-names>Z</given-names></name>
<name><surname>Freyberg</surname> <given-names>J</given-names></name>
<name><surname>Deaton</surname> <given-names>J</given-names></name>
<etal/>
</person-group>. 
<article-title>Big self-supervised models advance medical image classification</article-title>. In: <source>IEEE international conference on computer vision</source> (<year>2021</year>). Available online at: <uri xlink:href="https://openaccess.thecvf.com/content/ICCV2021/html/Azizi_Big_Self-Supervised_Models_Advance_Medical_Image_Classification_ICCV_2021_paper.html">https://openaccess.thecvf.com/content/ICCV2021/html/Azizi_Big_Self-Supervised_Models_Advance_Medical_Image_Classification_ICCV_2021_paper.html</uri>.
</mixed-citation>
</ref>
<ref id="B13">
<label>13</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Li</surname> <given-names>B</given-names></name>
<name><surname>Li</surname> <given-names>Y</given-names></name>
<name><surname>Eliceiri</surname> <given-names>K</given-names></name>
</person-group>. 
<article-title>Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning</article-title>. <source>Comput Vision Pattern Recognition</source>. (<year>2020</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.1109/CVPR46437.2021.01409</pub-id>, PMID: <pub-id pub-id-type="pmid">35047230</pub-id>
</mixed-citation>
</ref>
<ref id="B14">
<label>14</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Bhojanapalli</surname> <given-names>S</given-names></name>
<name><surname>Chakrabarti</surname> <given-names>A</given-names></name>
<name><surname>Glasner</surname> <given-names>D</given-names></name>
<name><surname>Li</surname> <given-names>D</given-names></name>
<name><surname>Unterthiner</surname> <given-names>T</given-names></name>
<name><surname>Veit</surname> <given-names>A</given-names></name>
</person-group>. 
<article-title>Understanding robustness of transformers for image classification</article-title>. <source>IEEE Int Conf Comput Vision</source>. (<year>2021</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.1109/ICCV48922.2021.01007</pub-id>
</mixed-citation>
</ref>
<ref id="B15">
<label>15</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Kim</surname> <given-names>HE</given-names></name>
<name><surname>Cosa-Linan</surname> <given-names>A</given-names></name>
<name><surname>Santhanam</surname> <given-names>N</given-names></name>
<name><surname>Jannesari</surname> <given-names>M</given-names></name>
<name><surname>Maros</surname> <given-names>M</given-names></name>
<name><surname>Ganslandt</surname> <given-names>T</given-names></name>
</person-group>. 
<article-title>Transfer learning for medical image classification: a literature review</article-title>. <source>BMC Med Imaging</source>. (<year>2022</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.1186/s12880-022-00793-7</pub-id>, PMID: <pub-id pub-id-type="pmid">35418051</pub-id>
</mixed-citation>
</ref>
<ref id="B16">
<label>16</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Zhang</surname> <given-names>C</given-names></name>
<name><surname>Cai</surname> <given-names>Y</given-names></name>
<name><surname>Lin</surname> <given-names>G</given-names></name>
<name><surname>Shen</surname> <given-names>C</given-names></name>
</person-group>. 
<article-title>Deepemd: Few-shot image classification with differentiable earth mover&#x2019;s distance and structured classifiers</article-title>. <source>Comput Vision Pattern Recognition</source>. (<year>2020</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.1109/CVPR42600.2020</pub-id>
</mixed-citation>
</ref>
<ref id="B17">
<label>17</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Roy</surname> <given-names>SK</given-names></name>
<name><surname>Deria</surname> <given-names>A</given-names></name>
<name><surname>Hong</surname> <given-names>D</given-names></name>
<name><surname>Rasti</surname> <given-names>B</given-names></name>
<name><surname>Plaza</surname> <given-names>A</given-names></name>
<name><surname>Chanussot</surname> <given-names>J</given-names></name>
</person-group>. 
<article-title>Multimodal fusion transformer for remote sensing image classification</article-title>. <source>IEEE Trans Geosci Remote Sens</source>. (<year>2022</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.1109/TGRS.2023.3286826</pub-id>
</mixed-citation>
</ref>
<ref id="B18">
<label>18</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Zhu</surname> <given-names>Y</given-names></name>
<name><surname>Zhuang</surname> <given-names>F</given-names></name>
<name><surname>Wang</surname> <given-names>J</given-names></name>
<name><surname>Ke</surname> <given-names>G</given-names></name>
<name><surname>Chen</surname> <given-names>J</given-names></name>
<name><surname>Bian</surname> <given-names>J</given-names></name>
<etal/>
</person-group>. 
<article-title>Deep subdomain adaptation networkfor image classification</article-title>. <source>IEEE Trans Neural Networks Learn Syst</source>. (<year>2020</year>). Available online at: <uri xlink:href="https://ieeexplore.ieee.org/abstract/document/9085896">https://ieeexplore.ieee.org/abstract/document/9085896</uri>., PMID: <pub-id pub-id-type="pmid">32365037</pub-id>
</mixed-citation>
</ref>
<ref id="B19">
<label>19</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Chen</surname> <given-names>L</given-names></name>
<name><surname>Li</surname> <given-names>S</given-names></name>
<name><surname>Bai</surname> <given-names>Q</given-names></name>
<name><surname>Yang</surname> <given-names>J</given-names></name>
<name><surname>Jiang</surname> <given-names>S</given-names></name>
<name><surname>Miao</surname> <given-names>Y</given-names></name>
</person-group>. 
<article-title>Review of image classification algorithms based on convolutional neural networks</article-title>. <source>Remote Sens</source>. (<year>2021</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.3390/rs13224712</pub-id>
</mixed-citation>
</ref>
<ref id="B20">
<label>20</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Ashtiani</surname> <given-names>F</given-names></name>
<name><surname>Geers</surname> <given-names>AJ</given-names></name>
<name><surname>Aflatouni</surname> <given-names>F</given-names></name>
</person-group>. 
<article-title>An on-chip photonic deep neural network for image classification</article-title>. <source>Nature</source>. (<year>2021</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.1038/s41586-022-04714-0</pub-id>, PMID: <pub-id pub-id-type="pmid">35650432</pub-id>
</mixed-citation>
</ref>
<ref id="B21">
<label>21</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Masana</surname> <given-names>M</given-names></name>
<name><surname>Liu</surname> <given-names>X</given-names></name>
<name><surname>Twardowski</surname> <given-names>B</given-names></name>
<name><surname>Menta</surname> <given-names>M</given-names></name>
<name><surname>Bagdanov</surname> <given-names>AD</given-names></name>
<name><surname>van de Weijer</surname> <given-names>J</given-names></name>
</person-group>. 
<article-title>Class- incremental learning: Survey and performance evaluation on image classification</article-title>. <source>IEEE Trans Pattern Anal Mach Intell</source>. (<year>2020</year>). Available online at: <uri xlink:href="https://ieeexplore.ieee.org/abstract/document/9915459">https://ieeexplore.ieee.org/abstract/document/9915459</uri>., PMID: <pub-id pub-id-type="pmid">36215375</pub-id>
</mixed-citation>
</ref>
<ref id="B22">
<label>22</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Mascarenhas</surname> <given-names>S</given-names></name>
<name><surname>l Agarwal</surname> <given-names>M</given-names></name>
</person-group>. 
<article-title>A comparison between vgg16, vgg19 and resnet50 architecture frameworks for image classification</article-title>. <source>2021 Int Conf Disruptive Technol Multi-Disciplinary Res Appl (CENTCON)</source>. (<year>2021</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.1109/CENTCON52345.2021.9687944</pub-id>
</mixed-citation>
</ref>
<ref id="B23">
<label>23</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Sheykhmousa</surname> <given-names>M</given-names></name>
<name><surname>Mahdianpari</surname> <given-names>M</given-names></name>
<name><surname>Ghanbari</surname> <given-names>H</given-names></name>
<name><surname>Mohammadimanesh</surname> <given-names>F</given-names></name>
<name><surname>Ghamisi</surname> <given-names>P</given-names></name>
<name><surname>Homayouni</surname> <given-names>S</given-names></name>
</person-group>. 
<article-title>Support vector machine versus random forest for remote sensing image classification: A meta-analysis and systematic review</article-title>. <source>IEEE J Selected Topics Appl Earth Observations Remote Sens</source>. (<year>2020</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.1109/JSTARS.2020.3026724</pub-id>
</mixed-citation>
</ref>
<ref id="B24">
<label>24</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Zhang</surname> <given-names>Y</given-names></name>
<name><surname>Li</surname> <given-names>W</given-names></name>
<name><surname>Sun</surname> <given-names>W</given-names></name>
<name><surname>Tao</surname> <given-names>R</given-names></name>
<name><surname>Du</surname> <given-names>Q</given-names></name>
</person-group>. 
<article-title>Single-source domain expansion network for cross-scene hyperspectral image classification</article-title>. <source>IEEE Trans Image Process</source>. (<year>2022</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.1109/TIP.2023.3243853</pub-id>, PMID: <pub-id pub-id-type="pmid">37027628</pub-id>
</mixed-citation>
</ref>
<ref id="B25">
<label>25</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Bansal</surname> <given-names>M</given-names></name>
<name><surname>Kumar</surname> <given-names>M</given-names></name>
<name><surname>Sachdeva</surname> <given-names>M</given-names></name>
<name><surname>Mittal</surname> <given-names>A</given-names></name>
</person-group>. 
<article-title>Transfer learning for image classification using vgg19: Caltech-101 image data set</article-title>. <source>J Ambient Intell Humanized Computing</source>. (<year>2021</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.1007/s12652-021-03488-z</pub-id>, PMID: <pub-id pub-id-type="pmid">34548886</pub-id>
</mixed-citation>
</ref>
<ref id="B26">
<label>26</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Dai</surname> <given-names>Y</given-names></name>
<name><surname>Gao</surname> <given-names>Y</given-names></name>
</person-group>. 
<article-title>Transmed: Transformers advance multi-modal medical image classification</article-title>. <source>Diagnostics</source>. (<year>2021</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.3390/diagnostics11081384</pub-id>, PMID: <pub-id pub-id-type="pmid">34441318</pub-id>
</mixed-citation>
</ref>
<ref id="B27">
<label>27</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Taori</surname> <given-names>R</given-names></name>
<name><surname>Dave</surname> <given-names>A</given-names></name>
<name><surname>Shankar</surname> <given-names>V</given-names></name>
<name><surname>Carlini</surname> <given-names>N</given-names></name>
<name><surname>Recht</surname> <given-names>B</given-names></name>
<name><surname>Schmidt</surname> <given-names>L</given-names></name>
</person-group>. 
<article-title>Measuring robustness to natural distribution shifts in image classification</article-title>. <source>Neural Inf Process Syst</source>. (<year>2020</year>). Available online at: <uri xlink:href="https://proceedings.neurips.cc/paper/2020/hash/d8330f857a17c53d217014ee776bfd50-Abstract.html">https://proceedings.neurips.cc/paper/2020/hash/d8330f857a17c53d217014ee776bfd50-Abstract.html</uri>.
</mixed-citation>
</ref>
<ref id="B28">
<label>28</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Peng</surname> <given-names>J</given-names></name>
<name><surname>Huang</surname> <given-names>Y</given-names></name>
<name><surname>SUN</surname> <given-names>W</given-names></name>
<name><surname>Chen</surname> <given-names>N</given-names></name>
<name><surname>Ning</surname> <given-names>Y</given-names></name>
<name><surname>Du</surname> <given-names>Q</given-names></name>
</person-group>. 
<article-title>Domain adaptation in remote sensing image classification: A survey</article-title>. <source>IEEE J Selected Topics Appl Earth Observations Remote Sens</source>. (<year>2022</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.1109/JSTARS.2022.3220875</pub-id>
</mixed-citation>
</ref>
<ref id="B29">
<label>29</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Bazi</surname> <given-names>Y</given-names></name>
<name><surname>Bashmal</surname> <given-names>L</given-names></name>
<name><surname>Rahhal</surname> <given-names>MMA</given-names></name>
<name><surname>Dayil</surname> <given-names>RA</given-names></name>
<name><surname>Ajlan</surname> <given-names>NA</given-names></name>
</person-group>. 
<article-title>Vision transformers&#xa0;for&#xa0;remote sensing image classification</article-title>. <source>Remote Sens</source>. (<year>2021</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.3390/rs13030516</pub-id>
</mixed-citation>
</ref>
<ref id="B30">
<label>30</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Zheng</surname> <given-names>X</given-names></name>
<name><surname>Sun</surname> <given-names>H</given-names></name>
<name><surname>Lu</surname> <given-names>X</given-names></name>
<name><surname>Xie</surname> <given-names>W</given-names></name>
</person-group>. 
<article-title>Rotation-invariant attention network for hyperspectral image classification</article-title>. <source>IEEE Trans Image Process</source>. (<year>2022</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.1109/TIP.2022.3177322</pub-id>, PMID: <pub-id pub-id-type="pmid">35635815</pub-id>
</mixed-citation>
</ref>
<ref id="B31">
<label>31</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Brumfield</surname> <given-names>GL</given-names></name>
<name><surname>Knoche</surname> <given-names>SM</given-names></name>
<name><surname>Doty</surname> <given-names>KR</given-names></name>
<name><surname>Larson</surname> <given-names>AC</given-names></name>
<name><surname>Poelaert</surname> <given-names>BJ</given-names></name>
<name><surname>Coulter</surname> <given-names>DW</given-names></name>
<etal/>
</person-group>. 
<article-title>Amyloid precursor-like protein 2 expression in macrophages: differentiation and m1/m2 macrophage dynamics</article-title>. <source>Front Oncol</source>. (<year>2025</year>) <volume>15</volume>:<elocation-id>1570955</elocation-id>. doi:&#xa0;<pub-id pub-id-type="doi">10.3389/fonc.2025.1570955</pub-id>, PMID: <pub-id pub-id-type="pmid">40265027</pub-id>
</mixed-citation>
</ref>
<ref id="B32">
<label>32</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Cao</surname> <given-names>W</given-names></name>
<name><surname>Ma</surname> <given-names>S</given-names></name>
<name><surname>Han</surname> <given-names>L</given-names></name>
<name><surname>Xing</surname> <given-names>H</given-names></name>
<name><surname>Li</surname> <given-names>Y</given-names></name>
<name><surname>Jiang</surname> <given-names>Z</given-names></name>
<etal/>
</person-group>. 
<article-title>Bispecific antibodies in immunotherapy for acute leukemia: latest updates from the 66th annual meeting of the american society of hematolog</article-title>. <source>Front Oncol</source>. (<year>2025</year>) <volume>15</volume>:<elocation-id>1566202</elocation-id>. doi:&#xa0;<pub-id pub-id-type="doi">10.3389/fonc.2025.1566202</pub-id>, PMID: <pub-id pub-id-type="pmid">40270605</pub-id>
</mixed-citation>
</ref>
<ref id="B33">
<label>33</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Sun</surname> <given-names>L</given-names></name>
<name><surname>Zhao</surname> <given-names>Q</given-names></name>
<name><surname>Miao</surname> <given-names>L</given-names></name>
</person-group>. 
<article-title>Combination therapy with oncolytic viruses for lung&#xa0;cancer treatment</article-title>. <source>Front Oncol</source>. (<year>2025</year>) <volume>15</volume>:<elocation-id>1524079</elocation-id>. doi:&#xa0;<pub-id pub-id-type="doi">10.3389/fonc.2025.1524079</pub-id>, PMID: <pub-id pub-id-type="pmid">40248194</pub-id>
</mixed-citation>
</ref>
<ref id="B34">
<label>34</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Dequidt</surname> <given-names>P</given-names></name>
<name><surname>Bourdon</surname> <given-names>P</given-names></name>
<name><surname>Tremblais</surname> <given-names>B</given-names></name>
<name><surname>Guillevin</surname> <given-names>C</given-names></name>
<name><surname>Gianelli</surname> <given-names>B</given-names></name>
<name><surname>Boutet</surname> <given-names>C</given-names></name>
<etal/>
</person-group>. 
<article-title>Exploring radiologic criteria for glioma grade classification on the brats dataset</article-title>. <source>IRBM</source>. (<year>2021</year>) <volume>42</volume>:<page-range>407&#x2013;14</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1016/j.irbm.2021.04.003</pub-id>
</mixed-citation>
</ref>
<ref id="B35">
<label>35</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Basheer</surname> <given-names>S</given-names></name>
<name><surname>Bhatia</surname> <given-names>S</given-names></name>
<name><surname>Sakri</surname> <given-names>SB</given-names></name>
</person-group>. 
<article-title>Computational modeling of dementia prediction using deep neural network: analysis on oasis dataset</article-title>. <source>IEEE Access</source>. (<year>2021</year>) <volume>9</volume>:<page-range>42449&#x2013;62</page-range>. doi:&#xa0;<pub-id pub-id-type="doi">10.1109/ACCESS.2021.3066213</pub-id>
</mixed-citation>
</ref>
<ref id="B36">
<label>36</label>
<mixed-citation publication-type="book">
<person-group person-group-type="author">
<name><surname>Lalitha</surname> <given-names>S</given-names></name>
<name><surname>Murugan</surname> <given-names>D</given-names></name>
</person-group>. 
<article-title>Segmentation and classification of 3d lung tumor&#xa0;diagnoses using convolutional neural networks</article-title>. In: <source>2023 second international&#xa0;conference on augmented intelligence and sustainable systems (ICAISS)</source>. 
<publisher-name>IEEE</publisher-name> (<year>2023</year>). p. <page-range>230&#x2013;8</page-range> Available online at: <uri xlink:href="https://ieeexplore.ieee.org/abstract/document/10250625">https://ieeexplore.ieee.org/abstract/document/10250625</uri>.
</mixed-citation>
</ref>
<ref id="B37">
<label>37</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Kandel</surname> <given-names>I</given-names></name>
<name><surname>Castelli</surname> <given-names>M</given-names></name>
</person-group>. 
<article-title>Improving convolutional neural networks performance for image classification using test time augmentation: a case study using mura dataset</article-title>. <source>Health Inf Sci Syst</source>. (<year>2021</year>) <volume>9</volume>:<fpage>33</fpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1007/s13755-021-00163-7</pub-id>, PMID: <pub-id pub-id-type="pmid">34349982</pub-id>
</mixed-citation>
</ref>
<ref id="B38">
<label>38</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Dong</surname> <given-names>H</given-names></name>
<name><surname>Zhang</surname> <given-names>L</given-names></name>
<name><surname>Zou</surname> <given-names>B</given-names></name>
</person-group>. 
<article-title>Exploring vision transformers for polarimetric sar image classification</article-title>. <source>IEEE Trans Geosci Remote Sens</source>. (<year>2022</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.1109/TGRS.2021.3137383</pub-id>
</mixed-citation>
</ref>
<ref id="B39">
<label>39</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>He</surname> <given-names>X</given-names></name>
<name><surname>Chen</surname> <given-names>Y</given-names></name>
<name><surname>Lin</surname> <given-names>Z</given-names></name>
</person-group>. 
<article-title>Spatial-spectral transformer for hyperspectral image classification</article-title>. <source>Remote Sens</source>. (<year>2021</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.3390/rs13030498</pub-id>
</mixed-citation>
</ref>
<ref id="B40">
<label>40</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Lanchantin</surname> <given-names>J</given-names></name>
<name><surname>Wang</surname> <given-names>T</given-names></name>
<name><surname>Ordonez</surname> <given-names>V</given-names></name>
<name><surname>Qi</surname> <given-names>Y</given-names></name>
</person-group>. 
<article-title>General multi-label image classification with transformers</article-title>. <source>Comput Vision Pattern Recognition</source>. (<year>2020</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.1109/CVPR46437.2021.01621</pub-id>
</mixed-citation>
</ref>
<ref id="B41">
<label>41</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Dong</surname> <given-names>Y</given-names></name>
<name><surname>Fu</surname> <given-names>Q-A</given-names></name>
<name><surname>Yang</surname> <given-names>X</given-names></name>
<name><surname>Pang</surname> <given-names>T</given-names></name>
<name><surname>Su</surname> <given-names>H</given-names></name>
<name><surname>Xiao</surname> <given-names>Z</given-names></name>
<etal/>
</person-group>. 
<article-title>Benchmarking adversarial robustness on image classification</article-title>. <source>Comput Vision Pattern Recognition</source>. (<year>2020</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.1109/CVPR42600.2020</pub-id>
</mixed-citation>
</ref>
<ref id="B42">
<label>42</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Cai</surname> <given-names>L</given-names></name>
<name><surname>Gao</surname> <given-names>J</given-names></name>
<name><surname>Zhao</surname> <given-names>D</given-names></name>
</person-group>. 
<article-title>A review of the application of deep learning in medical image classification and segmentation</article-title>. <source>Ann Trans Med</source>. (<year>2020</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.21037/atm.2020.02.44</pub-id>, PMID: <pub-id pub-id-type="pmid">32617333</pub-id>
</mixed-citation>
</ref>
<ref id="B43">
<label>43</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Vermeire</surname> <given-names>T</given-names></name>
<name><surname>Brughmans</surname> <given-names>D</given-names></name>
<name><surname>Goethals</surname> <given-names>S</given-names></name>
<name><surname>de Oliveira</surname> <given-names>RMB</given-names></name>
<name><surname>Martens</surname> <given-names>D</given-names></name>
</person-group>. 
<article-title>Explainable image classification with evidence counterfactual</article-title>. <source>Pattern Anal Appl</source>. (<year>2022</year>). doi:&#xa0;<pub-id pub-id-type="doi">10.1007/s10044-021-01055-y</pub-id>
</mixed-citation>
</ref>
<ref id="B44">
<label>44</label>
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name><surname>Solimando</surname> <given-names>AG</given-names></name>
<name><surname>Malerba</surname> <given-names>E</given-names></name>
<name><surname>Leone</surname> <given-names>P</given-names></name>
<name><surname>Prete</surname> <given-names>M</given-names></name>
<name><surname>Terragna</surname> <given-names>C</given-names></name>
<name><surname>Cavo</surname> <given-names>M</given-names></name>
<etal/>
</person-group>. 
<article-title>Drug resistance in multiple myeloma: soldiers and weapons in the bone marrow niche</article-title>. <source>Front Oncol</source>. (<year>2022</year>) <volume>12</volume>:<elocation-id>973836</elocation-id>. doi:&#xa0;<pub-id pub-id-type="doi">10.3389/fonc.2022.973836</pub-id>, PMID: <pub-id pub-id-type="pmid">36212502</pub-id>
</mixed-citation>
</ref>
</ref-list>
<fn-group>
<fn id="n1" fn-type="custom" custom-type="edited-by">
<p>Edited by: <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/297879">Stefano Trebeschi</ext-link>, The Netherlands Cancer Institute (NKI), Netherlands</p></fn>
<fn id="n2" fn-type="custom" custom-type="reviewed-by">
<p>Reviewed by: <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/483672">Shailesh Tripathi</ext-link>, University of Applied Sciences Upper Austria, Austria</p>
<p><ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1034291">Antonio Giovanni Solimando</ext-link>, University of Bari Aldo Moro, Italy</p></fn>
</fn-group>
</back>
</article>