<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Med.</journal-id>
<journal-title>Frontiers in Medicine</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Med.</abbrev-journal-title>
<issn pub-type="epub">2296-858X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fmed.2025.1661984</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Medicine</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Multi-interactive feature embedding learning for medical image segmentation</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Huang</surname> <given-names>Yijia</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/3123537/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/conceptualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/data-curation/"/>
<role content-type="https://credit.niso.org/contributor-roles/investigation/"/>
<role content-type="https://credit.niso.org/contributor-roles/methodology/"/>
<role content-type="https://credit.niso.org/contributor-roles/software/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Luo</surname> <given-names>Yue</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/3204911/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/funding-acquisition/"/>
<role content-type="https://credit.niso.org/contributor-roles/resources/"/>
<role content-type="https://credit.niso.org/contributor-roles/supervision/"/>
<role content-type="https://credit.niso.org/contributor-roles/validation/"/>
<role content-type="https://credit.niso.org/contributor-roles/visualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>School of Public Health, Chengdu University of Traditional Chinese Medicine, Chengdu</institution>, <addr-line>Sichuan</addr-line>, <country>China</country></aff>
<aff id="aff2"><sup>2</sup><institution>School of Intelligent Medicine, Chengdu University of Traditional Chinese Medicine, Chengdu</institution>, <addr-line>Sichuan</addr-line>, <country>China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Linfeng Li, Capital Medical University, China</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Zhixiong Huang, Dalian Nationalities University, China</p>
<p>Zhaojin Fu, Beijing Information Science and Technology University, China</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Yue Luo <email>luoyue&#x00040;cdutcm.edu.cn</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>24</day>
<month>09</month>
<year>2025</year>
</pub-date>
<pub-date pub-type="collection">
<year>2025</year>
</pub-date>
<volume>12</volume>
<elocation-id>1661984</elocation-id>
<history>
<date date-type="received">
<day>08</day>
<month>07</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>02</day>
<month>09</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2025 Huang and Luo.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Huang and Luo</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>Medical image segmentation task can provide the lesion object semantic information, but ignores edge texture details from the lesion region. Conversely, the medical image reconstruction task furnishes the object detailed information to facilitate the semantic segmentation through self-supervised learning. The two tasks are supplementary to each other. Therefore, we propose a multi-interactive feature embedding learning for medical image segmentation. In the medical image reconstruction task, we aim to generate the detailed feature representations containing rich textures, edges, and structures, thus bridging the low-level details lost from segmentation features. In particular, we propose an adaptive feature modulation module to efficiently aggregate foreground and background features to obtain a comprehensive feature representation. In the medical segmentation task, we propose a bi-directional fusion module fusing all important complementary information between the two tasks. Besides, we introduce a multi-branch visual mamba to capture structural information at different scales, thus enhancing model adaptation to different lesion regions. Extensive experiments on four datasets demonstrate the effectiveness of our framework.</p></abstract>
<kwd-group>
<kwd>medical image segmentation</kwd>
<kwd>self-supervised learning</kwd>
<kwd>adaptive feature modulation module</kwd>
<kwd>bi-directional fusion module</kwd>
<kwd>multi-branch vision mamba</kwd>
</kwd-group>
<counts>
<fig-count count="14"/>
<table-count count="6"/>
<equation-count count="17"/>
<ref-count count="60"/>
<page-count count="16"/>
<word-count count="9400"/>
</counts>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Dermatology</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1 Introduction</title>
<p>Medical image segmentation tasks (<xref ref-type="bibr" rid="B1">1</xref>&#x02013;<xref ref-type="bibr" rid="B5">5</xref>) focus on extracting lesion regions from complex medical images, thereby assisting doctors to perform subsequent disease diagnosis, treatment planning and efficacy assessment. In particular, skin lesion segmentation and cell boundary detection tasks enable precise localization of key tissues or lesions, which supports in early diagnosis and clinical assisted decision making by visualizing lesion results (<xref ref-type="bibr" rid="B6">6</xref>). Therefore, in public health management, deep learning-based medical image segmentation methods can effectively improve the efficiency of group patient lesion detection. These methods can help public health departments to better monitor and predict the disease spread, thereby promoting disease prevention and treatment.</p>
<p>Existing medical segmentation methods (<xref ref-type="bibr" rid="B7">7</xref>, <xref ref-type="bibr" rid="B8">8</xref>) construct complex network structures to improve performance, but ignore texture and boundary detail information about lesion regions in medical images. U-Net (<xref ref-type="bibr" rid="B9">9</xref>) introduces encoder-decoder structure, and designs skip connections to combine the different-level semantic information. UNet&#x0002B;&#x0002B; (<xref ref-type="bibr" rid="B10">10</xref>) adds dense jump paths and nested decoders to enhance multiscale feature learning. MFSNet (<xref ref-type="bibr" rid="B11">11</xref>) combines multi-scale feature extraction and attention mechanisms, which further improves segmentation performance. However, medical image segmentation task emphasizes on extracting high-level semantic features, resulting in the loss of pixel-level detail information. In contrast, the medical image reconstruction task can provide pixel-level detail information (e.g., texture and boundaries) to the medical image segmentation task through a self-supervised learning strategy, thus obtaining more accurate segmentation results.</p>
<p>Moreover, since convolutional neural network (CNN)&#x02014;based segmentation methods (<xref ref-type="bibr" rid="B12">12</xref>&#x02013;<xref ref-type="bibr" rid="B14">14</xref>) rely on local receptive fields and convolutional structures, it is difficult to effectively capture the non-local relations and structural ambiguity features present in the lesion region. Therefore, Transformer-based segmentation methods (<xref ref-type="bibr" rid="B15">15</xref>&#x02013;<xref ref-type="bibr" rid="B18">18</xref>) aim to improve modeling ability for global context, thus enhancing semantic consistency and regional integrity. For example, TransUNet (<xref ref-type="bibr" rid="B19">19</xref>) combines the local feature learning of CNN and the global context learning of Transformer advantages. TransFuse (<xref ref-type="bibr" rid="B20">20</xref>) designs a two-branch network to capture local and global features, and then fuses them using a fusion module in the decoding stage. This architectural design enhances the model&#x00027;s capability to capture fine-grained boundaries and structural information, thereby improving segmentation accuracy. Although Transformer-based methods can help to recognize organ contours, lesion shapes, and spatial layouts by capturing distant dependencies in medical images through a self-attention mechanism, they require high computational and memory costs. Compared with Transformer-based architectures, Mamba (<xref ref-type="bibr" rid="B21">21</xref>, <xref ref-type="bibr" rid="B22">22</xref>) offers lower computational overhead while maintaining strong long-sequence modeling and structural awareness. This is especially valuable in medical image segmentation, where accurate delineation of anatomical structures requires modeling long-range dependencies and preserving fine-grained spatial details. By efficiently extracting spatially hierarchical features, Mamba enables real-time and resource-constrained applications while ensuring precise boundary segmentation.</p>
<p>In this paper, we propose a multi-interactive feature embedding learning (MFEL) for medical image segmentation. Specifically, MFEL consists of a feature interaction-driven image reconstruction (FIIR) and a feature-embedded representation image segmentation (FRIS). On the one hand, FIIR reconstructs the foreground image, background image and medical image through self-supervised learning, thus extracting complete pixel-level features. In particular, an adaptive feature modulation module effectively enhances foreground and background feature representation via the learned modulation parameters, thereby obtaining a more comprehensive and fine-grained pixel-level feature information. On the other hand, FRIS aims to fuse the two different-level features between the reconstruction and segmentation tasks, thereby improving the performance of segmentation task. In particular, a bi-directional fusion module is designed to fuse the feature representations from two tasks, which enhances the information interaction. Moreover, a multi-branch vision mamba utilizes the parallel branching structure and linear state space modeling capability, improving model semantic understanding about different lesion regions.</p>
<p>Our contributions can be summarized as follows:</p>
<list list-type="bullet">
<list-item><p>We explore an MFEL framework between medical image reconstruction task and medical image segmentation task, and then achieve superior segmentation performance.</p></list-item>
<list-item><p>An adaptive feature modulation module is proposed to construct modulation parameters from foreground and background features, thus obtaining a comprehensive pixel-level feature representation.</p></list-item>
<list-item><p>A bi-directional fusion module is introduced to establish complementary relationships between structural details and deep semantics, thus enabling feature information interaction between two different-level tasks.</p></list-item>
<list-item><p>Multi-branch vision mamba is designed to combine state-space modeling and multi-branch parallel mechanism, efficiently modeling the multi-scale structural information from lesion regions.</p></list-item>
</list></sec>
<sec id="s2">
<title>2 Related work</title>
<sec>
<title>2.1 Medical image segmentation methods</title>
<p>Convolutional neural networks (CNNs) have achieved remarkable success in medical image segmentation by leveraging hierarchical representations and strong inductive biases (<xref ref-type="bibr" rid="B23">23</xref>&#x02013;<xref ref-type="bibr" rid="B26">26</xref>). Recent methods enhance segmentation performance by integrating boundary cues and multi-scale features. DCSAU-Net (<xref ref-type="bibr" rid="B27">27</xref>) introduces a split attention mechanism with semantic retention, while U-Net v2 (<xref ref-type="bibr" rid="B28">28</xref>) incorporates boundary information to refine local detail representations. Transformer-based architectures address CNNs&#x00027; limitations in modeling long-range dependencies. These models exhibit strong global context awareness and have demonstrated competitive performance in medical image segmentation (<xref ref-type="bibr" rid="B19">19</xref>, <xref ref-type="bibr" rid="B29">29</xref>&#x02013;<xref ref-type="bibr" rid="B35">35</xref>). CASF-Net (<xref ref-type="bibr" rid="B36">36</xref>) employs dual-branch modeling to combine global semantics and fine-grained features. CSWin-UNet (<xref ref-type="bibr" rid="B18">18</xref>) utilizes cross-shaped window attention to improve spatial interactions with low computational cost. Hybrid designs, such as TBConvL-Net (<xref ref-type="bibr" rid="B37">37</xref>) and MobileUNETR (<xref ref-type="bibr" rid="B38">38</xref>), further balance local detail extraction and global reasoning. Since medical image segmentation as a high-level vision task focuses on extracting semantic structural information, the pixel-level details are ignored. In contrast, we introduce the medical image reconstruction task to learn fine-grained feature representations through self-supervised learning, thus bridging the shortcomings from the semantic segmentation task.</p></sec>
<sec>
<title>2.2 Self-supervised learning methods</title>
<p>Self-supervised learning methods have been widely applied in tasks such as image reconstruction (<xref ref-type="bibr" rid="B39">39</xref>, <xref ref-type="bibr" rid="B40">40</xref>), inpainting (<xref ref-type="bibr" rid="B41">41</xref>&#x02013;<xref ref-type="bibr" rid="B43">43</xref>), and enhancement (<xref ref-type="bibr" rid="B44">44</xref>, <xref ref-type="bibr" rid="B45">45</xref>). For example, Self-path (<xref ref-type="bibr" rid="B46">46</xref>) introduces a region-aware contrastive learning framework, which enforces consistency between local and global representations. This strategy effectively enhances feature discrimination and contextual modeling for downstream segmentation tasks. DSFormer (<xref ref-type="bibr" rid="B47">47</xref>), designed for multi-contrast MRI reconstruction, proposes a dual-domain self-supervised Transformer architecture. It performs joint reconstruction and context restoration in both k-space and image space, achieving collaborative modeling of structural information and significantly improving reconstruction quality and generalization. MiM (<xref ref-type="bibr" rid="B48">48</xref>) targets 3D medical image analysis by proposing a hierarchical Mask-in-Mask masking mechanism. Through a coarse-to-fine masking strategy combined with residual reconstruction, it guides the model to learn rich semantic structures and fine spatial details, thereby improving its adaptability to downstream tasks such as segmentation and classification. In contrast, the medical image reconstruction task guides the model to focus on pixel-level content (e.g., texture, structure, edges), thus compensating the loss of important details in the cell and skin lesion segmentation tasks.</p></sec>
<sec>
<title>2.3 Vision mamba</title>
<p>Mamba (<xref ref-type="bibr" rid="B21">21</xref>) is a novel sequence modeling architecture built upon State Space Models. It enables efficient inference while modeling long-range dependencies. Unlike traditional self-attention mechanisms, Mamba introduces learnable state space kernels and applies linear operations in a sliding-window manner. This design supports global modeling while significantly reducing computational complexity, achieving linear time and space costs. VMamba (<xref ref-type="bibr" rid="B22">22</xref>) extends Mamba to vision tasks by introducing a 2D Selective Scan mechanism, which aggregates spatial context from multiple directions with linear complexity, achieving superior accuracy and efficiency over Vision Transformers. Compared with CNNs, Mamba is not limited by local receptive fields and can capture global sequential and contextual information. Compared with Transformers, Mamba avoids the high computational overhead of self-attention in long sequences, achieving better efficiency and performance. These advantages make Mamba particularly suitable for high-resolution or 3D medical image tasks. In medical image segmentation (<xref ref-type="bibr" rid="B49">49</xref>&#x02013;<xref ref-type="bibr" rid="B51">51</xref>), accurately capturing the spatial structure and contextual relationships of lesions is critical for performance. Mamba&#x00027;s strength in long-range modeling and computational efficiency provides strong support for this task. Recently, Mamba has been increasingly applied in medical scenarios. U-VM-UNet (<xref ref-type="bibr" rid="B52">52</xref>) integrates sparse gating and low-rank decomposition to design an efficient visual selective scan module, achieving strong segmentation results across datasets. Mamba-Sea (<xref ref-type="bibr" rid="B53">53</xref>) proposes a global-to-local sequence augmentation mechanism and builds a pure SSM-based framework, improving generalization in cross-domain segmentation tasks. VM-UNetV2 (<xref ref-type="bibr" rid="B54">54</xref>) combines Vision Mamba with the UNet v2 (<xref ref-type="bibr" rid="B28">28</xref>) architecture and introduces a semantic and detail injection module, showing better performance than conventional models in skin and polyp segmentation. SMM-UNet (<xref ref-type="bibr" rid="B55">55</xref>) constructs selective and multi-scale fusion Mamba modules to enhance feature representation at different scales while keeping the network compact. CAMS (<xref ref-type="bibr" rid="B56">56</xref>) completely removes convolution and attention mechanisms, adopting a pure Mamba encoder and dual decoder structure to balance global modeling and fine-grained detail recovery in cardiac image segmentation. Therefore, we adopt a multi-branch mamba structure to establish long-distance dependency capturing global relationships and effectively aggregating contextual information, thus enhancing global semantic representation.</p></sec></sec>
<sec sec-type="methods" id="s3">
<title>3 Methods</title>
<p>Medical image segmentation task aims to extract the lesion object semantic information, but ignores the pixel-level detail information. In contrast, medical image reconstruction task focuses on mining low-level content information. Therefore, we combine the medical image reconstruction task and the medical image segmentation task, which is jointly optimized to improve the segmentation performance. Our MFEL framework is shown in <xref ref-type="fig" rid="F1">Figure 1</xref>, which includes a feature interaction-driven image reconstruction (FIIR) and a feature-embedded representation image segmentation (FRIS). The specific details are as follows.</p>
<fig position="float" id="F1">
<label>Figure 1</label>
<caption><p>The overall framework illustration of the proposed MFEL. MFEL consists of FIIR and FRIS. FIIR aims to extract pixel-level features through self-supervised learning, thus helping the segmentation task to obtain finer-grained information. FRIS fuses semantic segmentation features and fine-grained reconstruction features to generate a more comprehensive feature representation.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmed-12-1661984-g0001.tif">
<alt-text>Diagram of a multi-stage image segmentation neural network with modules for foreground processing, background processing, and segmentation. The network features three encoding paths: foreground (blue), reconstruction (yellow), and background (green), leading to segmentation (purple). Each stage includes specific encoding and decoding units. Insets display input images and segmentation results. Annotations define symbols and modules like adaptive feature modulation and fusion modules.</alt-text>
</graphic>
</fig>
<sec>
<title>3.1 Feature interaction-driven image reconstruction</title>
<p>FIIR employs self-supervised learning to obtain fine-grained feature representations, thereby enhancing the segmentation feature representations. It consists of three components: foreground image reconstruction (FIR), background image reconstruction (BIR), and medical image reconstruction (MIR). Specifically, FIR generates foreground feature, BIR provides background feature, and MIR obtains fine pixel-level feature. Foreground feature contains the key object information (e.g., edges, textures, structures), and background feature includes the irrelevant environment information. In this way, the two features can enhance the pixel-level fine-grained feature representation during medical image reconstruction.</p>
<sec>
<title>3.1.1 Foreground and background feature extraction</title>
<p>The medical image <italic>I</italic><sub><italic>s</italic></sub> first can be divided into foreground image <italic>I</italic><sub><italic>f</italic></sub> and background image <italic>I</italic><sub><italic>b</italic></sub> by the segmentation mask <italic>I</italic><sub><italic>m</italic></sub>. <italic>I</italic><sub><italic>m</italic></sub> labels the foreground information as 1 and the background information as 0. Therefore, <italic>I</italic><sub><italic>f</italic></sub> and <italic>I</italic><sub><italic>b</italic></sub> can be formulated as follows:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02297;</mml:mo><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02297;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Then, <italic>I</italic><sub><italic>f</italic></sub> and <italic>I</italic><sub><italic>b</italic></sub> are respectively fed into the foreground encoder <inline-formula><mml:math id="M2"><mml:msubsup><mml:mrow><mml:mi>E</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and the background encoder <inline-formula><mml:math id="M3"><mml:msubsup><mml:mrow><mml:mi>E</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> to extract the foreground feature <inline-formula><mml:math id="M4"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and the background feature <inline-formula><mml:math id="M5"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, where <italic>i</italic> &#x0003D; 1, 2, 3 denotes the layer index. Finally, <inline-formula><mml:math id="M6"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula><mml:math id="M7"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> are input to the foreground decoder <italic>D</italic><sub><italic>f</italic></sub> and the background decoder <italic>D</italic><sub><italic>b</italic></sub> to reconstruct the foreground image <inline-formula><mml:math id="M8"><mml:msubsup><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and the background image <inline-formula><mml:math id="M9"><mml:msubsup><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>. The self-supervised foreground reconstruction loss <italic>L</italic><sub><italic>f</italic></sub> and self-supervised background reconstruction loss <italic>L</italic><sub><italic>b</italic></sub> focus on extracting foreground and background information of the medical segmentation image, which can be formulated as:</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M10"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>|</mml:mo><mml:mo>|</mml:mo><mml:msubsup><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>|</mml:mo><mml:mo>|</mml:mo><mml:msubsup><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where ||&#x000B7;||<sub>1</sub> represents the <italic>l</italic><sub>1</sub> norm.</p></sec>
<sec>
<title>3.1.2 Pixel-level fine-grained feature generation</title>
<p>As shown in <xref ref-type="fig" rid="F2">Figure 2</xref>, <inline-formula><mml:math id="M11"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula><mml:math id="M12"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> from each layer are fed into the adaptive feature modulation module <inline-formula><mml:math id="M13"><mml:msubsup><mml:mrow><mml:mi>&#x003A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, thereby helping SIR to obtain more significant foreground and background features. Specifically, <inline-formula><mml:math id="M14"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, <inline-formula><mml:math id="M15"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and the initial pixel-level feature <inline-formula><mml:math id="M16"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> first perform channel feature concatenation to generate the fusion feature <italic>F</italic><sub><italic>u</italic></sub>, and then the global semantic features are extracted by using global average pooling. Further, we utilize dual-stream convolutional blocks to generate calibration parameters &#x003B1; and &#x003B2; to guide <inline-formula><mml:math id="M17"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>. This calibration process can be represented as:</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M18"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000D7;</mml:mo><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B2;</mml:mi><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<fig position="float" id="F2">
<label>Figure 2</label>
<caption><p>Architecture of adaptive feature modulation module.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmed-12-1661984-g0002.tif">
<alt-text>Flowchart of a neural network architecture. Features Fi_r, Fif, and Fi_b are concatenated, followed by global average pooling (GAP) and convolution layers. Outputs are combined using weighted sums with parameters &#x003B1; and &#x003B1; and activation function &#x003C3;, resulting in Fir.</alt-text>
</graphic>
</fig>
<p>Next, the calibrated pixel-level fine-grained feature <inline-formula><mml:math id="M19"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is fed into the reconstruction decoder to reconstruct the medical image. Finally, the medical image reconstruction loss <inline-formula><mml:math id="M20"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> ensures that the pixel-level fine-grained features can reconstruct a complete segmentation image, which can be expressed as follows:</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M21"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>||</mml:mo><mml:msubsup><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>I</italic><sub><italic>s</italic></sub> denotes a medical image, and <inline-formula><mml:math id="M22"><mml:msubsup><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> represents a reconstructed medical image.</p>
</sec></sec>
<sec>
<title>3.2 Feature representation reinforcement learning</title>
<sec>
<title>3.2.1 Bi-directional fusion module via hierarchical guidance</title>
<p>In Section 3.1, we obtain pixel-level fine-grained feature <inline-formula><mml:math id="M23"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> from FIIR. Specifically, as shown in <xref ref-type="fig" rid="F3">Figure 3</xref>, <italic>I</italic><sub><italic>s</italic></sub> is first fed into the segmentation encoder <inline-formula><mml:math id="M24"><mml:msubsup><mml:mrow><mml:mi>E</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> to extract the segmentation semantic feature <inline-formula><mml:math id="M25"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>. Then, <inline-formula><mml:math id="M26"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula><mml:math id="M27"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> are input to the bi-directional fusion module <inline-formula><mml:math id="M28"><mml:msubsup><mml:mrow><mml:mi>&#x003A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> to obtain a complete feature representation with strong semantics and rich details. Specifically, we compute respectively the cross-attention weights between <inline-formula><mml:math id="M29"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula><mml:math id="M30"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, thereby jointly modeling the complementary relationship between the reconstruction branch and the semantic branch. In this process, <inline-formula><mml:math id="M31"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> uses semantic clues to guide <inline-formula><mml:math id="M32"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> to enhance structural perception, while <inline-formula><mml:math id="M33"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> employs textural details to enhance the spatial resolution of <inline-formula><mml:math id="M34"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>. In particular, First, <inline-formula><mml:math id="M35"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> generates the query vector <italic>Q</italic><sub><italic>s</italic></sub> and <inline-formula><mml:math id="M36"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> generates the key-value pair (<italic>K</italic><sub><italic>r</italic></sub>, <italic>V</italic><sub><italic>r</italic></sub>). Similarly, <italic>Q</italic><sub><italic>r</italic></sub> is obtained via <inline-formula><mml:math id="M37"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, and (<italic>K</italic><sub><italic>s</italic></sub>, <italic>V</italic><sub><italic>s</italic></sub>) is generated through <inline-formula><mml:math id="M38"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>. The two-stream cross-modal attention is computed as follows:</p>
<disp-formula id="E5"><label>(5)</label><mml:math id="M39"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mo class="qopname">Attn</mml:mo></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo class="qopname">Softmax</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mi>K</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:msqrt><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msqrt></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mo class="qopname">Attn</mml:mo></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo class="qopname">Softmax</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mi>K</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:msqrt><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msqrt></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<fig position="float" id="F3">
<label>Figure 3</label>
<caption><p>Architecture of bi-directional fusion module.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmed-12-1661984-g0003.tif">
<alt-text>Flowchart diagram illustrating a multi-head cross-attention (MHCA) mechanism. Two input blocks, labeled Flr and Fls, pass through normalization (Norm) before being split into query (Q), key (K), and value (V) components. These components are processed by the MHCA module. The outputs, depicted as tildeFlr and tildeFls, are shown after merging. Each block and arrow is color-coded, indicating different processing stages.</alt-text>
</graphic>
</fig>
<p>where <inline-formula><mml:math id="M40"><mml:msqrt><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msqrt></mml:math></inline-formula> is the channel dimension of each attention head and <italic>Softmax</italic> is an activation function. Attn<sub><italic>r</italic></sub> denotes the segmentation-guided attention map, and Attn<sub><italic>s</italic></sub> represents the reconstruction-guided attention map. Subsequently, we adopt feature aggregation and residual concatenation to generate two enhanced features <inline-formula><mml:math id="M41"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula><mml:math id="M42"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, which can be formulated as:</p>
<disp-formula id="E6"><label>(6)</label><mml:math id="M43"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mo class="qopname">Attn</mml:mo></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mo class="qopname">^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mo class="qopname">&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mo class="qopname">Attn</mml:mo></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mo class="qopname">^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M44"><mml:msubsup><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> denotes one convolutional layer with 1 &#x000D7; 1 kernel. Finally, we fuse <inline-formula><mml:math id="M45"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula><mml:math id="M46"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> to generate the refined segmentation feature <inline-formula><mml:math id="M47"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mo>&#x00304;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> through concatenation and convolution operations, thus enhancing the feature representation ability.</p>
</sec>
<sec>
<title>3.2.2 High-level semantic feature mining via multi-branch vision mamba</title>
<p>Multi-branch vision mamba is constructed to force the model to mine high-level semantic information, thus improving the feature representation. Specifically, as shown in <xref ref-type="fig" rid="F4">Figure 4</xref>, <inline-formula><mml:math id="M48"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mo>&#x00304;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> firstly is fed into <italic>E</italic><sub><italic>M</italic></sub> to perform flattening and normalization, thereby generating segmentation sequence feature <italic>N</italic><sub><italic>s</italic></sub>. Then, we divide <italic>N</italic><sub><italic>s</italic></sub> into four groups to learn the important representations of different sub-regions, which can be represented as:</p>
<disp-formula id="E7"><label>(7)</label><mml:math id="M49"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msubsup></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>S</mml:mi><mml:mi>p</mml:mi><mml:mi>l</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<fig position="float" id="F4">
<label>Figure 4</label>
<caption><p>Architecture of multi-branch vision mamba.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmed-12-1661984-g0004.tif">
<alt-text>Flowchart of a neural network architecture with four &#x0201C;Visual Mamba blocks&#x0201D; processing input N_s. Outputs from the blocks are concatenated and passed through a &#x0201C;LN &#x00026; Linear&#x0201D; step. An additional process involving calculations &#x003B3;,MNsj, and Nsj occurs in a separate module, contributing to the final output N.</alt-text>
</graphic>
</fig>
<p>Next, each subsequence is respectively fed into the weight-sharing Mamba module to perform state modeling, and then refine the representation by residual operations:</p>
<disp-formula id="E8"><label>(8)</label><mml:math id="M50"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="script">M</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B3;</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:msubsup><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mtext>&#x02003;</mml:mtext><mml:mi>j</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mn>3</mml:mn><mml:mo>,</mml:mo><mml:mn>4</mml:mn></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003B3; is a scaling factor. <inline-formula><mml:math id="M51"><mml:mrow><mml:mi mathvariant="script">M</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x000B7;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> denotes the Mamba function. Then, the updated subsequence <italic>N</italic><sub><italic>j</italic></sub> is performed to feature concatenation from the channel dimension, thus generating the enhanced sequence representation:</p>
<disp-formula id="E9"><label>(9)</label><mml:math id="M52"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>Concat</italic>(&#x000B7;) denotes the feature concatenation operation. Subsequently, <inline-formula><mml:math id="M53"><mml:msub><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is normalized and linearly transformed to project to the original feature dimension, which can be expressed as:</p>
<disp-formula id="E10"><label>(10)</label><mml:math id="M54"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>P</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi><mml:mi>j</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>L</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>LN</italic>(&#x000B7;) represents layer normalization, and <italic>Proj</italic>(&#x000B7;) indicates linear projection.</p>
<p>Therefore, we utilized the state-space mechanism of Mamba to capture long-distance contextual information. Then, multi-branch decomposition is used to enhance the feature representation between different sub-regions. In this way, multi-branch vision mamba establishes the dependency between global semantics and local details, thus helping the model to improve the segmentation accuracy of key objects.</p></sec></sec>
<sec>
<title>3.3 Model training</title>
<sec>
<title>3.3.1 Image reconstruction head</title>
<p>To constrain the difference at the pixel level between the reconstructed image and the segmentation image, the image reconstruction head <italic>D</italic><sub><italic>f</italic></sub>, <italic>D</italic><sub><italic>r</italic></sub> and <italic>D</italic><sub><italic>b</italic></sub> adopt the reconstruction loss <inline-formula><mml:math id="M55"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, which can be defined as follows:</p>
<disp-formula id="E11"><label>(11)</label><mml:math id="M56"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M57"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula><mml:math id="M58"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula><mml:math id="M59"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> denote foreground reconstruction loss, background reconstruction loss and medical image reconstruction loss, respectively.</p></sec>
<sec>
<title>3.3.2 Semantic segmentation head</title>
<p>The BCE loss <inline-formula><mml:math id="M60"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mi>c</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> aims to predict the per-pixel classification accuracy. The Dice loss <inline-formula><mml:math id="M61"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>c</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> can measure the overall overlap region between the prediction mask <inline-formula><mml:math id="M62"><mml:msubsup><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and the ground truth <italic>I</italic><sub><italic>gt</italic></sub>. Thus, we jointly <inline-formula><mml:math id="M63"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mi>c</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula><mml:math id="M64"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>c</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> constrain the segmentation head <italic>S</italic><sub><italic>h</italic></sub>, which can be expressed as:</p>
<disp-formula id="E12"><label>(12)</label><mml:math id="M65"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mi>c</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>i</mml:mi><mml:mi>c</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Finally, the total training loss can be expressed as:</p>
<disp-formula id="E13"><label>(13)</label><mml:math id="M66"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>o</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>s</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></sec></sec></sec>
<sec id="s4">
<title>4 Experiments</title>
<p>In this section, we present a comprehensive overview of our experiments. We begin by introducing the datasets used in the study, followed by detailed descriptions of the experimental settings and implementation details. We then report the results of comparison experiments against state-of-the-art methods. In addition, we perform ablation studies to assess the impact of each key component. These experiments are designed to validate the effectiveness of the proposed method and to provide insights into the contribution of different modules to the overall performance.</p>
<sec>
<title>4.1 Experimental settings</title>
<sec>
<title>4.1.1 Datasets</title>
<p><bold>GLAS</bold> (<xref ref-type="bibr" rid="B57">57</xref>) dataset consists of 165 microscopy images of colorectal adenocarcinoma tissue sections at stage T3 or T4, stained with H&#x00026;E. Each image has a resolution of 128 &#x000D7; 128 pixels and is collected from a different patient. Due to variations in cancer progression, the lesions exhibit significant differences in shape and distribution. Meanwhile, since all samples originate from the same type of tissue, the surrounding environments are relatively consistent. Additionally, some cells are damaged or ruptured during the sampling process, resulting in large inter-cell variability. These factors make the dataset highly challenging. According to the official split, the training set contains 85 images and the test set contains 80 images. This dataset is mainly used to assess the model&#x00027;s capability in segmenting dense lesion regions and small targets.</p>
<p><bold>ISIC2016</bold> <bold>(</bold><xref ref-type="bibr" rid="B58"><bold>58</bold></xref><bold>)</bold> and <bold>ISIC2017</bold> <bold>(</bold><xref ref-type="bibr" rid="B59"><bold>59</bold></xref><bold>)</bold> datasets were released by the International Skin Imaging Collaboration (ISIC) in 2016 and 2017, respectively. They were used as official datasets for the skin lesion analysis challenges held in those years. The goal of these datasets is to raise global awareness of skin disease diagnosis and to improve the detection of melanoma and other benign or malignant lesions. Both datasets contain a large number of samples and include various types of skin lesions. Due to the diversity of lesion types and the wide range of patient backgrounds, the samples show high variability in texture, color, and structure. In addition, some mild lesions look very similar to normal skin, which makes it hard to identify lesion boundaries. This increases the difficulty of the segmentation task. In this study, we evaluate the segmentation performance of our model using the ISIC2016 and ISIC2017 datasets. Both datasets follow the official training and testing splits: ISIC2016 includes 900 training images and 379 testing images, while ISIC2017 consists of 2,000 training images and 600 testing images. All images are resized to 256 &#x000D7; 256 pixels to ensure consistency during the experiments.</p>
<p><bold>PH2</bold> <bold>(</bold><xref ref-type="bibr" rid="B60"><bold>60</bold></xref><bold>)</bold> is a public dataset designed for dermoscopic image segmentation and classification. It aims to support research on computer-aided diagnosis of melanocytic lesions. The images were collected at the dermatology department of Pedro Hispano Hospital in Portugal. All images were captured under the same conditions using the Tuebinger mole analyzer system with 20 &#x000D7; magnification. The dataset contains 200 dermoscopic images of melanocytic lesions, including 80 common nevi, 80 atypical nevi, and 40 melanomas. PH2 serves as a reliable benchmark for evaluating lesion detection, segmentation, and classification algorithms. In our experiments, all images were resized to 256 &#x000D7; 256 pixels. It is worth noting that we used PH2 as an external validation dataset. We tested it directly using the model trained on ISIC2016 to assess the effectiveness of our method and its potential for future clinical applications.</p></sec>
<sec>
<title>4.1.2 Metrics</title>
<p>In the quantitative analysis, we adopt widely used evaluation metrics in the field of medical image segmentation. Specifically, we use Precision, Recall, F1, and Intersection over Union (IoU) to assess the performance of the proposed model. Here, TP denotes true positives, FP denotes false positives, TN denotes true negatives, and FN denotes false negatives. These metrics jointly provide a comprehensive evaluation of the model&#x00027;s accuracy and completeness from multiple perspectives.</p>
<p>Precision measures the proportion of true positives among all regions predicted as positive (e.g., lesion areas). It reflects the model&#x00027;s ability to control FP. In medical image segmentation, a high Precision means the model can avoid wrongly identifying normal areas as lesions, which helps reduce the risk of misdiagnosis. When Precision is high, most of the predicted lesion regions are actually correct, and the FP rate is low. This is especially important in cases with small lesions or strong background noise. In such situations, Precision is a key metric to evaluate how well the model limits over-segmentation. The formula is given as:</p>
<disp-formula id="E14"><label>(14)</label><mml:math id="M67"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtext class="textrm" mathvariant="normal">Precision</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Recall measures the model&#x00027;s ability to detect all positive targets. It shows how many of the actual positive pixels are correctly identified. In medical image segmentation, a high Recall means the model can successfully detect most lesion areas, which helps reduce missed detections and is important for clinical diagnosis support. The formula is given as:</p>
<disp-formula id="E15"><label>(15)</label><mml:math id="M68"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtext class="textrm" mathvariant="normal">Recall</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>F1 is the harmonic mean of Precision and Recall. It is used to evaluate both the accuracy and completeness of the model. When there is a large gap between Precision and Recall, F1 provides a more balanced result. In segmentation tasks, F1 is especially useful for assessing model performance under class imbalance, such as small lesions against large background regions. A higher F1 indicates that the model achieves a good balance between accuracy and completeness. The formula is given as:</p>
<disp-formula id="E16"><label>(16)</label><mml:math id="M69"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtext class="textrm" mathvariant="normal">F1</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x000B7;</mml:mo><mml:mtext class="textrm" mathvariant="normal">Precision</mml:mtext><mml:mo>&#x000B7;</mml:mo><mml:mtext class="textrm" mathvariant="normal">Recall</mml:mtext></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">Precision</mml:mtext><mml:mo>&#x0002B;</mml:mo><mml:mtext class="textrm" mathvariant="normal">Recall</mml:mtext></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>2</mml:mn><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>IoU is one of the most widely used metrics in image segmentation. It measures the overlap between the predicted region and the GT. It is defined as the ratio of the intersection area to the union area of the prediction and the GT. IoU directly reflects how well the predicted boundary matches the actual boundary. A higher IoU means the predicted region aligns more closely with the GT, indicating better segmentation accuracy. Compared to F1, IoU is more sensitive to small differences and is suitable for evaluating boundary localization. The formula is given as:</p>
<disp-formula id="E17"><label>(17)</label><mml:math id="M70"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtext class="textrm" mathvariant="normal">IoU</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></sec>
<sec>
<title>4.1.3 Implementations</title>
<p>We use NVIDIA GeForce RTX 4090 GPU to train and inference the model. The network framework is Pytorch. EPOCH is set to 150, and Batch is 4. The optimizer is Adam that uses momentum strategy to steadily update the model parameters. We employ warm-up and cosine annealing schedulers to achieve slow startup in the early stages and fine convergence in the later. The initial learning rate is 1e-3 and gradually decays to 1e-5.</p></sec></sec>
<sec>
<title>4.2 Comparison with SOTA methods</title>
<p>To ensure a more comprehensive and reliable evaluation of model performance, we compare our method against state-of-the-art (SOTA) approaches from the past four years across different network architectures. These comparisons highlight the advantages of our model. Specifically, we select representative CNN-based methods including MsRED (<xref ref-type="bibr" rid="B25">25</xref>), MFSNet (<xref ref-type="bibr" rid="B11">11</xref>), DCSAU (<xref ref-type="bibr" rid="B27">27</xref>), and U-Net V2 (<xref ref-type="bibr" rid="B28">28</xref>); Transformer-based methods including BAT (<xref ref-type="bibr" rid="B30">30</xref>), FAT-Net (<xref ref-type="bibr" rid="B31">31</xref>), SSFormer (<xref ref-type="bibr" rid="B32">32</xref>), and CASF-Net (<xref ref-type="bibr" rid="B36">36</xref>); and a recent Mamba-based method, U-vm-unet (<xref ref-type="bibr" rid="B52">52</xref>). Extensive comparisons are conducted on four public datasets.</p>
<sec>
<title>4.2.1 GLAS</title>
<sec>
<title>4.2.1.1 Qualitative comparisons</title>
<p>As in <xref ref-type="table" rid="T1">Table 1</xref>, we compare several representative methods from recent years on the GLAS dataset. The results show that Ours achieves the highest scores in F1, IoU, and Precision, and ranks second in Recall, slightly behind MFSNet. Ours reaches 82.07 in IoU, which shows a clear advantage over other methods. This indicates that the predicted lesion regions by Ours have better overlap with the GT and more accurate boundary localization. The Precision score reaches 91.01, suggesting that Ours effectively reduces false positives, which is important in scenarios where over-segmentation should be avoided. Considering that the GLAS dataset contains complex gland structures, a high proportion of small targets, and blurry boundaries, IoU and Precision are key metrics to evaluate real segmentation quality. Some methods achieve higher Recall but perform worse in IoU and Precision, which may be caused by over-segmentation. In contrast, Ours maintains high Recall while achieving high accuracy, showing strong boundary modeling ability and overall robustness.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Qualitative comparison results on the GLAS dataset.</p></caption>
<table frame="box" rules="all">
<thead>
<tr>
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="center"><bold>F1(%)</bold></th>
<th valign="top" align="center"><bold>IoU(%)</bold></th>
<th valign="top" align="center"><bold>Precision(%)</bold></th>
<th valign="top" align="center"><bold>Recall(%)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">BAT (<xref ref-type="bibr" rid="B30">30</xref>)</td>
<td valign="top" align="center">83.93</td>
<td valign="top" align="center">73.47</td>
<td valign="top" align="center">84.85</td>
<td valign="top" align="center">84.43</td>
</tr> <tr>
<td valign="top" align="left">FAT-Net (<xref ref-type="bibr" rid="B31">31</xref>)</td>
<td valign="top" align="center">86.45</td>
<td valign="top" align="center">76.14</td>
<td valign="top" align="center"><inline-formula><mml:math id="M71"><mml:mrow><mml:mstyle mathcolor="#0070c0"><mml:mtext>88.23</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center">84.75</td>
</tr> <tr>
<td valign="top" align="left">MsRED (<xref ref-type="bibr" rid="B25">25</xref>)</td>
<td valign="top" align="center">85.92</td>
<td valign="top" align="center">75.32</td>
<td valign="top" align="center">87.20</td>
<td valign="top" align="center">84.69</td>
</tr> <tr>
<td valign="top" align="left">MFSNet (<xref ref-type="bibr" rid="B11">11</xref>)</td>
<td valign="top" align="center">86.33</td>
<td valign="top" align="center">75.95</td>
<td valign="top" align="center">81.70</td>
<td valign="top" align="center"><inline-formula><mml:math id="M72"><mml:mrow><mml:mstyle mathcolor="#ff0000"><mml:mtext>89.20</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
</tr> <tr>
<td valign="top" align="left">SSFormer (<xref ref-type="bibr" rid="B32">32</xref>)</td>
<td valign="top" align="center">71.60</td>
<td valign="top" align="center">59.13</td>
<td valign="top" align="center">74.17</td>
<td valign="top" align="center">74.00</td>
</tr> <tr>
<td valign="top" align="left">CASF-Net (<xref ref-type="bibr" rid="B36">36</xref>)</td>
<td valign="top" align="center">85.83</td>
<td valign="top" align="center">76.08</td>
<td valign="top" align="center">88.05</td>
<td valign="top" align="center">75.20</td>
</tr> <tr>
<td valign="top" align="left">DCSAU (<xref ref-type="bibr" rid="B27">27</xref>)</td>
<td valign="top" align="center">88.28</td>
<td valign="top" align="center">79.03</td>
<td valign="top" align="center">87.67</td>
<td valign="top" align="center">88.32</td>
</tr> <tr>
<td valign="top" align="left">U-vm-unet (<xref ref-type="bibr" rid="B52">52</xref>)</td>
<td valign="top" align="center">82.07</td>
<td valign="top" align="center">69.60</td>
<td valign="top" align="center">74.39</td>
<td valign="top" align="center">86.60</td>
</tr> <tr>
<td valign="top" align="left">U-Net V2 (<xref ref-type="bibr" rid="B28">28</xref>)</td>
<td valign="top" align="center"><inline-formula><mml:math id="M73"><mml:mrow><mml:mstyle mathcolor="#0070c0"><mml:mtext>88.90</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M74"><mml:mrow><mml:mstyle mathcolor="#0070c0"><mml:mtext>80.86</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center">87.31</td>
<td valign="top" align="center">89.16</td>
</tr> <tr>
<td valign="top" align="left">Ours</td>
<td valign="top" align="center"><inline-formula><mml:math id="M75"><mml:mrow><mml:mstyle mathcolor="#ff0000"><mml:mtext>89.73</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M76"><mml:mrow><mml:mstyle mathcolor="#ff0000"><mml:mtext>82.07</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M77"><mml:mrow><mml:mstyle mathcolor="#ff0000"><mml:mtext>91.01</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M78"><mml:mrow><mml:mstyle mathcolor="#0070c0"><mml:mtext>89.18</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>Red indicates the best performance, and blue indicates the second best.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>4.2.1.2 Quantitative comparisons</title>
<p><xref ref-type="fig" rid="F5">Figure 5</xref> shows the visual comparison results on the GLAS dataset. In sample (a), the lesion cell has clear boundaries and appears hollow due to structural damage. SSFormer and U-vm-unet make obvious errors in this case, leading to inaccurate boundary prediction and incorrect segmentation of the cell structure. In samples (c) and (d), the lesion boundaries are blurry. Most methods fail to extract the target contours correctly and show severe missegmentation. In contrast, although Ours also shows some boundary errors, it preserves the overall shape of the target more completely.</p>
<fig position="float" id="F5">
<label>Figure 5</label>
<caption><p>Quantitative comparison results on the GLAS dataset. Green regions indicate areas missed with respect to the GT, while red regions represent incorrectly predicted areas compared to the GT.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmed-12-1661984-g0005.tif">
<alt-text>Microscopic images of biological tissues compared across different segmentation models. Each row (a-f) shows the original tissue image followed by results from models: BAT, FAT, MsRED, MFSNet, SSFormer, CASF-Net, DCSAU, U-vm-unet, U-Net V2, Ours, and ground truth (GT). The segmentation results highlight distinct regions in black, white, red, and green, illustrating varying model accuracies and segmentations against the ground truth.</alt-text>
</graphic>
</fig>
<p>In samples (e) and (f), the white regions represent the internal cell structure and the external background, respectively. These two samples come from different experimental conditions. For sample (e), methods like U-Net V2 miss part of the structure on the left side and mistakenly classify it as background. In sample (f), these methods show incomplete cell boundaries. In comparison, Ours gives results that are closer to the ground truth in both samples, showing better generalization. However, it is worth noting that Ours still makes a mistake in identifying the cell on the right side of sample (e), which suggests that there is still room to improve robustness across different environments.</p>
<p>To evaluate model performance on small targets and in noisy environments, we conducted local zoom-in comparisons on representative samples, as shown in <xref ref-type="fig" rid="F6">Figure 6</xref>. DCSAU, U-vm-unet, and U-Net V2 often misidentify background textures as lesion regions, especially when boundaries are blurred or targets are irregular. This suggests limited robustness to noise and weak discrimination in challenging cases. In contrast, Ours better distinguishes true lesions from noisy backgrounds and successfully detects small, low-contrast targets. Despite minor boundary errors, it shows stronger resistance to noise and improved sensitivity to fine-grained lesion structures.</p>
<fig position="float" id="F6">
<label>Figure 6</label>
<caption><p>Detail comparison on the GLAS dataset.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmed-12-1661984-g0006.tif">
<alt-text>Comparison of medical imaging results using different methods. The left column shows the original tissue images in pink and purple hues. Subsequent columns labeled DCSAU, U-vm-unet, U-Net V2, and Ours display segmented results, highlighting areas in white, red, and green on a black background. Regions in blue boxes show zoomed-in areas for detailed analysis.</alt-text>
</graphic>
</fig>
</sec></sec>
<sec>
<title>4.2.2 ISIC2016</title>
<sec>
<title>4.2.2.1 Qualitative comparisons</title>
<p>On the ISIC2016 dataset, we compare our method with several representative approaches from recent years and evaluate segmentation performance from multiple aspects. As in <xref ref-type="table" rid="T2">Table 2</xref>, Ours achieves the best results across all four metrics: F1, IoU, Precision, and Recall, demonstrating strong overall performance. Specifically, the F1 reaches 94.14 and the IoU reaches 89.40, which shows a clear improvement over other methods. This indicates that our model provides a better balance between segmentation accuracy and region coverage, and can more precisely recover lesion shapes. The Precision reaches 95.48, showing stable control over false positives and helping reduce the misclassification of normal skin areas. The Recall reaches 93.45, ensuring high detection rates for lesion regions, which is important in clinical settings where missed detections must be minimized. The ISIC2016 dataset contains many benign and malignant skin lesions with blurry boundaries and similar textures, making segmentation more challenging. Compared to Ours, U-Net V2 achieves a similar Recall but lower Precision, which may cause over-segmentation. DCSAU shows good Precision, but its Recall is lower, which leads to missed lesion areas. In contrast, Ours maintains a better balance across all four metrics, indicating stronger segmentation ability under challenges such as background similarity, boundary ambiguity, and class imbalance commonly found in dermoscopic images.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Qualitative comparison results on the ISIC2016 dataset.</p></caption>
<table frame="box" rules="all">
<thead>
<tr>
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="center"><bold>F1(%)</bold></th>
<th valign="top" align="center"><bold>IoU(%)</bold></th>
<th valign="top" align="center"><bold>Precision(%)</bold></th>
<th valign="top" align="center"><bold>Recall(%)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">BAT (<xref ref-type="bibr" rid="B30">30</xref>)</td>
<td valign="top" align="center">91.22</td>
<td valign="top" align="center">84.99</td>
<td valign="top" align="center">93.36</td>
<td valign="top" align="center">91.32</td>
</tr> <tr>
<td valign="top" align="left">FAT-Net (<xref ref-type="bibr" rid="B31">31</xref>)</td>
<td valign="top" align="center">91.58</td>
<td valign="top" align="center">85.42</td>
<td valign="top" align="center">92.36</td>
<td valign="top" align="center">92.79</td>
</tr> <tr>
<td valign="top" align="left">MsRED (<xref ref-type="bibr" rid="B25">25</xref>)</td>
<td valign="top" align="center">91.61</td>
<td valign="top" align="center">85.51</td>
<td valign="top" align="center">93.37</td>
<td valign="top" align="center">91.90</td>
</tr> <tr>
<td valign="top" align="left">MFSNet (<xref ref-type="bibr" rid="B11">11</xref>)</td>
<td valign="top" align="center">92.57</td>
<td valign="top" align="center">86.17</td>
<td valign="top" align="center">93.85</td>
<td valign="top" align="center">91.33</td>
</tr> <tr>
<td valign="top" align="left">SSFormer (<xref ref-type="bibr" rid="B32">32</xref>)</td>
<td valign="top" align="center">91.37</td>
<td valign="top" align="center">85.63</td>
<td valign="top" align="center">90.18</td>
<td valign="top" align="center">93.22</td>
</tr> <tr>
<td valign="top" align="left">CASF-Net (<xref ref-type="bibr" rid="B36">36</xref>)</td>
<td valign="top" align="center">91.46</td>
<td valign="top" align="center">85.50</td>
<td valign="top" align="center">92.26</td>
<td valign="top" align="center">88.22</td>
</tr> <tr>
<td valign="top" align="left">DCSAU (<xref ref-type="bibr" rid="B27">27</xref>)</td>
<td valign="top" align="center">92.72</td>
<td valign="top" align="center">86.42</td>
<td valign="top" align="center">91.42</td>
<td valign="top" align="center"><inline-formula><mml:math id="M79"><mml:mrow><mml:mstyle mathcolor="#ff0000"><mml:mtext>94.05</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
</tr> <tr>
<td valign="top" align="left">U-vm-unet (<xref ref-type="bibr" rid="B52">52</xref>)</td>
<td valign="top" align="center">92.79</td>
<td valign="top" align="center">86.54</td>
<td valign="top" align="center">93.92</td>
<td valign="top" align="center">91.68</td>
</tr> <tr>
<td valign="top" align="left">U-Net V2 (<xref ref-type="bibr" rid="B28">28</xref>)</td>
<td valign="top" align="center"><inline-formula><mml:math id="M80"><mml:mrow><mml:mstyle mathcolor="#0070c0"><mml:mtext>93.02</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M81"><mml:mrow><mml:mstyle mathcolor="#0070c0"><mml:mtext>86.96</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M82"><mml:mrow><mml:mstyle mathcolor="#ff0000"><mml:mtext>96.83</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center">93.14</td>
</tr> <tr>
<td valign="top" align="left">Ours</td>
<td valign="top" align="center"><inline-formula><mml:math id="M83"><mml:mrow><mml:mstyle mathcolor="#ff0000"><mml:mtext>94.14</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M84"><mml:mrow><mml:mstyle mathcolor="#ff0000"><mml:mtext>89.40</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M85"><mml:mrow><mml:mstyle mathcolor="#0070c0"><mml:mtext>95.48</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M86"><mml:mrow><mml:mstyle mathcolor="#0070c0"><mml:mtext>93.45</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>Red indicates the best performance, and blue indicates the second best.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>4.2.2.2 Quantitative comparisons</title>
<p><xref ref-type="fig" rid="F7">Figure 7</xref> presents the visual comparison results on the ISIC2016 dataset. In clinical diagnosis, accurate boundary identification of lesions is important for evaluating the development stage and malignancy of the disease. However, in samples (a)&#x02013;(d), many baseline methods show varying degrees of boundary errors, such as incomplete contours or blurred edges. In contrast, Ours performs more consistently in boundary modeling and produces results that are closer to the ground truth, which is of higher clinical value.</p>
<fig position="float" id="F7">
<label>Figure 7</label>
<caption><p>Quantitative comparison results on the ISIC2016 dataset. Green regions indicate areas missed with respect to the GT, while red regions represent incorrectly predicted areas compared to the GT.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmed-12-1661984-g0007.tif">
<alt-text>Comparison grid of skin lesion images segmented by various models: BAT, FAT, MsRED, MFSNet, SSFormer, CASF-Net, DCSAU, U-vm-unet, U-Net V2, and Ours, with ground truth (GT) provided. Each row (a-f) shows a skin lesion image followed by the segmented output from each model, highlighting differences in segmentation accuracy.</alt-text>
</graphic>
</fig>
<p>In sample (e), there is a dark skin area in the lower left region with texture similar to the lesion. FAT, U-vm-unet, and U-Net V2 all misclassify this area as a lesion, resulting in obvious false segmentation. MFSNet successfully captures the main region but misses parts near the boundary, which affects the overall contour quality.</p>
<p>Sample (f) contains a lesion with complex boundaries and fine internal structure. The lesion is located near the image edge, and the background interference is strong. These factors make boundary detection more difficult. Most baseline methods show shifted or broken contours in this case. Although Ours also makes some errors, its prediction is still the closest to the ground truth and better preserves both the overall shape and boundary continuity.</p>
<p>To further evaluate the model&#x00027;s ability to handle blurry boundaries, we selected a group of representative samples and performed local zoom-in comparisons, as shown in <xref ref-type="fig" rid="F8">Figure 8</xref>. The results show that CASF-Net and U-Net V2 produce relatively coarse boundary predictions. Their outputs often show broken or expanded contours, which do not match the ground truth accurately. In contrast, Ours shows better alignment with the ground truth boundaries and performs more stably in preserving fine structural details. These results further demonstrate that our method has stronger fine-grained boundary perception and achieves higher localization accuracy for targets with unclear edges.</p>
<fig position="float" id="F8">
<label>Figure 8</label>
<caption><p>Detail comparison on the ISIC2016 dataset.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmed-12-1661984-g0008.tif">
<alt-text>Comparison of skin lesion segmentation methods. The top row shows the original lesion image followed by results from CASF-Net, U-vm-unet, U-Net V2, and another method labeled &#x0201C;Ours.&#x0201D; The bottom row features segmented outputs highlighting contours in red against a black background. Each method is framed for focus on segmentation accuracy.</alt-text>
</graphic>
</fig>
</sec></sec>
<sec>
<title>4.2.3 PH2</title>
<sec>
<title>4.2.3.1 Qualitative comparisons</title>
<p><xref ref-type="table" rid="T3">Table 3</xref> shows the test results on the PH2 dataset, which is used as an external validation set. The model is trained on the ISIC2016 dataset. As shown, Ours achieves the highest scores in the two key metrics, F1 and IoU, with values of 94.44 and 89.70 respectively. These results clearly outperform other methods and demonstrate strong overall segmentation ability and good generalization performance across datasets.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Qualitative comparison results on the PH2 dataset.</p></caption>
<table frame="box" rules="all">
<thead>
<tr>
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="center"><bold>F1(%)</bold></th>
<th valign="top" align="center"><bold>IoU(%)</bold></th>
<th valign="top" align="center"><bold>Precision(%)</bold></th>
<th valign="top" align="center"><bold>Recall(%)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">BAT (<xref ref-type="bibr" rid="B30">30</xref>)</td>
<td valign="top" align="center">89.24</td>
<td valign="top" align="center">81.62</td>
<td valign="top" align="center"><inline-formula><mml:math id="M87"><mml:mrow><mml:mstyle mathcolor="#ff0000"><mml:mtext>96.33</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center">84.99</td>
</tr> <tr>
<td valign="top" align="left">FAT-Net (<xref ref-type="bibr" rid="B31">31</xref>)</td>
<td valign="top" align="center">90.66</td>
<td valign="top" align="center">83.54</td>
<td valign="top" align="center">86.13</td>
<td valign="top" align="center"><inline-formula><mml:math id="M88"><mml:mrow><mml:mstyle mathcolor="#ff0000"><mml:mtext>97.14</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
</tr> <tr>
<td valign="top" align="left">MsRED (<xref ref-type="bibr" rid="B25">25</xref>)</td>
<td valign="top" align="center">88.61</td>
<td valign="top" align="center">80.65</td>
<td valign="top" align="center">84.35</td>
<td valign="top" align="center">95.97</td>
</tr> <tr>
<td valign="top" align="left">MFSNet (<xref ref-type="bibr" rid="B11">11</xref>)</td>
<td valign="top" align="center"><inline-formula><mml:math id="M89"><mml:mrow><mml:mstyle mathcolor="#0070c0"><mml:mtext>91.42</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M90"><mml:mrow><mml:mstyle mathcolor="#0070c0"><mml:mtext>84.19</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center">89.12</td>
<td valign="top" align="center">93.84</td>
</tr> <tr>
<td valign="top" align="left">SSFormer (<xref ref-type="bibr" rid="B32">32</xref>)</td>
<td valign="top" align="center">90.77</td>
<td valign="top" align="center">83.98</td>
<td valign="top" align="center">89.04</td>
<td valign="top" align="center">94.65</td>
</tr> <tr>
<td valign="top" align="left">CASF-Net (<xref ref-type="bibr" rid="B36">36</xref>)</td>
<td valign="top" align="center">90.85</td>
<td valign="top" align="center">83.86</td>
<td valign="top" align="center">86.92</td>
<td valign="top" align="center"><inline-formula><mml:math id="M91"><mml:mrow><mml:mstyle mathcolor="#0070c0"><mml:mtext>96.60</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
</tr> <tr>
<td valign="top" align="left">DCSAU (<xref ref-type="bibr" rid="B27">27</xref>)</td>
<td valign="top" align="center">87.33</td>
<td valign="top" align="center">77.51</td>
<td valign="top" align="center">90.27</td>
<td valign="top" align="center">84.58</td>
</tr> <tr>
<td valign="top" align="left">U-vm-unet (<xref ref-type="bibr" rid="B52">52</xref>)</td>
<td valign="top" align="center">86.87</td>
<td valign="top" align="center">76.79</td>
<td valign="top" align="center">86.43</td>
<td valign="top" align="center">87.32</td>
</tr> <tr>
<td valign="top" align="left">U-Net V2 (<xref ref-type="bibr" rid="B28">28</xref>)</td>
<td valign="top" align="center">90.70</td>
<td valign="top" align="center">82.98</td>
<td valign="top" align="center">92.88</td>
<td valign="top" align="center">95.28</td>
</tr> <tr>
<td valign="top" align="left">Ours</td>
<td valign="top" align="center"><inline-formula><mml:math id="M92"><mml:mrow><mml:mstyle mathcolor="#ff0000"><mml:mtext>94.44</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M93"><mml:mrow><mml:mstyle mathcolor="#ff0000"><mml:mtext>89.70</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M94"><mml:mrow><mml:mstyle mathcolor="#0070c0"><mml:mtext>93.73</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center">95.37</td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>Red indicates the best performance, and blue indicates the second best.</p>
</table-wrap-foot>
</table-wrap>
<p>Although the Recall of Ours is not the highest among all methods, it remains at a high level. It is worth noting that the Recall of Ours is slightly lower than that of FAT-Net&#x00027;s 97.14 and CASF-Net&#x00027;s 96.60. This may be due to the fact that lesions in the PH2 dataset are more regular in shape and have relatively clearer boundaries. FAT-Net and CASF-Net tend to enlarge the predicted regions to increase the recall rate. However, this strategy often leads to lower precision and causes a drop in both IoU and F1. In contrast, Ours keeps a good balance. It maintains a reasonable recall while avoiding over-segmentation, which helps improve boundary accuracy and overall model stability.</p></sec>
<sec>
<title>4.2.3.2 Quantitative comparisons</title>
<p><xref ref-type="fig" rid="F9">Figure 9</xref> shows the segmentation results of several samples from the PH2 dataset. Overall, most methods can outline the general shape of the lesion, but there are still clear differences in boundary details and the handling of interference regions. In samples (a) and (b), where the lesion boundaries are relatively clear, U-Net V2 and CASF-Net produce coarser edges. In contrast, Ours generates contours that better align with the ground truth, with smoother and more complete boundaries, especially in the transition areas around the lesion. In sample (f), the lesion is large and structurally complex. Methods such as BAT and U-Net V2 show varying degrees of over-segmentation, with a large number of false positive areas (in red). Although Ours also has some prediction errors, its boundaries are more compact and the over-segmentation is significantly reduced.</p>
<fig position="float" id="F9">
<label>Figure 9</label>
<caption><p>Quantitative comparison results on the PH2 dataset. Green regions indicate areas missed with respect to the GT, while red regions represent incorrectly predicted areas compared to the GT.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmed-12-1661984-g0009.tif">
<alt-text>Comparison of skin lesion segmentation techniques is displayed. Column headers list methods: BAT, FAT, MsRED, MFSNet, SSFormer, CASF-Net, DCSAU, U-vm-unet, U-Net V2, Ours, GT. Each row (a-f) shows the original image followed by segmented outputs from each technique, highlighting differences in border detection and shape accuracy.</alt-text>
</graphic>
</fig>
<p>We also select a group of samples for local zoom-in comparison, as shown in <xref ref-type="fig" rid="F10">Figure 10</xref>. In these samples, the lesion regions are located within a liquid environment, and bubbles above the lesions introduce interference. This causes CASF-Net, U-vm-unet, and U-Net V2 to produce severe misclassifications. Although Ours also shows some boundary inaccuracies due to the blurred edges, its prediction remains the closest to the ground truth.</p>
<fig position="float" id="F10">
<label>Figure 10</label>
<caption><p>Detail comparison on the PH2 dataset.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmed-12-1661984-g0010.tif">
<alt-text>Comparison of image segmentation results using different methods: the original image is shown with the area of interest highlighted. Results from CASF-Net, U-vm-unet, U-Net V2, and the proposed method are displayed, each showing segmentation in red and white, highlighting varying accuracy and precision levels.</alt-text>
</graphic>
</fig>
</sec></sec>
<sec>
<title>4.2.4 ISIC2017</title>
<sec>
<title>4.2.4.1 Qualitative comparisons</title>
<p><xref ref-type="table" rid="T4">Table 4</xref> shows the evaluation results on the ISIC2017 dataset. Ours ranks first in three key metrics: F1, IoU, and Recall, with scores of 88.10, 80.06, and 94.84, respectively. These results show that our model achieves strong overall segmentation quality and high lesion detection sensitivity. In particular, the Recall score is significantly higher than other methods, indicating that our model is more sensitive to lesion regions and can reduce missed detections. This is useful for clinical applications that require high recall. Compared to methods such as U-Net V2 and MFSNet, Ours maintains a high Recall while achieving a better balance in IoU and F1, showing better boundary modeling ability and practical value.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Qualitative comparison results on the ISIC2017 dataset.</p></caption>
<table frame="box" rules="all">
<thead>
<tr>
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="center"><bold>F1(%)</bold></th>
<th valign="top" align="center"><bold>IoU(%)</bold></th>
<th valign="top" align="center"><bold>Precision(%)</bold></th>
<th valign="top" align="center"><bold>Recall(%)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">BAT (<xref ref-type="bibr" rid="B30">30</xref>)</td>
<td valign="top" align="center">84.85</td>
<td valign="top" align="center"><inline-formula><mml:math id="M95"><mml:mrow><mml:mstyle mathcolor="#0070c0"><mml:mtext>76.23</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center">86.64</td>
<td valign="top" align="center"><inline-formula><mml:math id="M96"><mml:mrow><mml:mstyle mathcolor="#0070c0"><mml:mtext>88.75</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
</tr> <tr>
<td valign="top" align="left">FAT-Net (<xref ref-type="bibr" rid="B31">31</xref>)</td>
<td valign="top" align="center">84.79</td>
<td valign="top" align="center">76.06</td>
<td valign="top" align="center">89.08</td>
<td valign="top" align="center">85.93</td>
</tr> <tr>
<td valign="top" align="left">MsRED (<xref ref-type="bibr" rid="B25">25</xref>)</td>
<td valign="top" align="center">84.43</td>
<td valign="top" align="center">75.79</td>
<td valign="top" align="center">91.21</td>
<td valign="top" align="center">83.61</td>
</tr> <tr>
<td valign="top" align="left">MFSNet (<xref ref-type="bibr" rid="B11">11</xref>)</td>
<td valign="top" align="center">85.42</td>
<td valign="top" align="center">74.55</td>
<td valign="top" align="center"><inline-formula><mml:math id="M97"><mml:mrow><mml:mstyle mathcolor="#0070c0"><mml:mtext>91.91</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center">79.79</td>
</tr> <tr>
<td valign="top" align="left">SSFormer (<xref ref-type="bibr" rid="B32">32</xref>)</td>
<td valign="top" align="center">83.43</td>
<td valign="top" align="center">71.30</td>
<td valign="top" align="center">81.51</td>
<td valign="top" align="center">85.54</td>
</tr> <tr>
<td valign="top" align="left">CASF-Net (<xref ref-type="bibr" rid="B36">36</xref>)</td>
<td valign="top" align="center">84.20</td>
<td valign="top" align="center">72.71</td>
<td valign="top" align="center">85.14</td>
<td valign="top" align="center">84.51</td>
</tr> <tr>
<td valign="top" align="left">DCSAU (<xref ref-type="bibr" rid="B27">27</xref>)</td>
<td valign="top" align="center"><inline-formula><mml:math id="M98"><mml:mrow><mml:mstyle mathcolor="#0070c0"><mml:mtext>85.92</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center">75.32</td>
<td valign="top" align="center">83.93</td>
<td valign="top" align="center">88.01</td>
</tr> <tr>
<td valign="top" align="left">U-vm-unet (<xref ref-type="bibr" rid="B52">52</xref>)</td>
<td valign="top" align="center">85.26</td>
<td valign="top" align="center">74.93</td>
<td valign="top" align="center">89.51</td>
<td valign="top" align="center">81.39</td>
</tr> <tr>
<td valign="top" align="left">U-Net V2 (<xref ref-type="bibr" rid="B28">28</xref>)</td>
<td valign="top" align="center">85.00</td>
<td valign="top" align="center">73.90</td>
<td valign="top" align="center"><inline-formula><mml:math id="M99"><mml:mrow><mml:mstyle mathcolor="#ff0000"><mml:mtext>96.26</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center">82.86</td>
</tr> <tr>
<td valign="top" align="left">Ours</td>
<td valign="top" align="center"><inline-formula><mml:math id="M100"><mml:mrow><mml:mstyle mathcolor="#ff0000"><mml:mtext>88.10</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M101"><mml:mrow><mml:mstyle mathcolor="#ff0000"><mml:mtext>80.06</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center">83.13</td>
<td valign="top" align="center"><inline-formula><mml:math id="M102"><mml:mrow><mml:mstyle mathcolor="#ff0000"><mml:mtext>94.84</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>Red indicates the best performance, and blue indicates the second best.</p>
</table-wrap-foot>
</table-wrap>
<p>However, in terms of Precision, Ours performs relatively lower, with a score of 83.13, which is clearly below methods like U-Net V2&#x00027;s 96.26 and MFSNet&#x00027;s 91.91. The ISIC2017 dataset contains more complex lesions with blurry boundaries and irregular shapes. While trying to capture lesion regions more completely, the model may also include neutral areas near the lesion boundary or non-lesion areas with similar appearance. This increases the false positive rate and leads to a lower Precision score.</p></sec>
<sec>
<title>4.2.4.2 Quantitative comparisons</title>
<p>As shown in <xref ref-type="fig" rid="F11">Figure 11</xref>, the samples in the ISIC2017 dataset often have more blurred boundary information. This leads to boundary prediction errors across all compared methods. In sample (b), the lesion boundaries are highly similar to the surrounding skin texture, causing all models to misidentify the boundary. In sample (d), although the lesion boundary is relatively clear, the surrounding skin is more complex. As a result, U-vm-unet mistakenly includes the ruler at the bottom as part of the lesion. In sample (e), the lesion gradually darkens from left to right. Most methods can accurately detect the boundary on the right where the contrast is high, but fail to identify the blurry boundary on the left. In contrast, Ours achieves a result that is closest to the ground truth.</p>
<fig position="float" id="F11">
<label>Figure 11</label>
<caption><p>Quantitative comparison results on the ISIC2017 dataset. Green regions indicate areas missed with respect to the GT, while red regions represent incorrectly predicted areas compared to the GT.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmed-12-1661984-g0011.tif">
<alt-text>Various skin lesion images are displayed in the first column. Subsequent columns show segmentation results using different methods, labeled from BAT to GT, across six rows (a to f). Each method contrasts lesion boundaries against a black background with red, green, or white overlays, illustrating differences in segmentation accuracy and technique.</alt-text>
</graphic>
</fig>
<p>In <xref ref-type="fig" rid="F12">Figure 12</xref>, we present a local zoom-in comparison. The blurred and small-sized lesion increases the difficulty of segmentation. Compared with DCSAU and two other methods, Ours shows better performance in boundary prediction. However, Ours is still affected by the surrounding environment and mistakenly identifies hair in the lower-left area as part of the boundary. This indicates that there is still room for improvement in handling fine-grained features.</p>
<fig position="float" id="F12">
<label>Figure 12</label>
<caption><p>Detail comparison on the ISIC2017 dataset.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmed-12-1661984-g0012.tif">
<alt-text>Comparison of skin lesion segmentation methods. Two rows show different lesion images. Columns include original images, and results from DCSAU, U-vm-unet, U-Net V2, and Ours. Segmented areas are marked in white, green, and red with blue bounding boxes.</alt-text>
</graphic>
</fig></sec></sec></sec>
<sec>
<title>4.3 Ablation studies</title>
<sec>
<title>4.3.1 GLAS</title>
<p>To evaluate the contribution of each module in the model, we conducted a systematic ablation study on the GLAS dataset. The quantitative results obtained after removing different components are presented in <xref ref-type="table" rid="T5">Table 5</xref>, and the corresponding visual segmentation outputs are shown in <xref ref-type="fig" rid="F13">Figure 13</xref>, offering a clear view of how each module affects the final performance.</p>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>Ablation studies results on the GLAS dataset.</p></caption>
<table frame="box" rules="all">
<thead>
<tr>
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="center"><bold>F1(%)</bold></th>
<th valign="top" align="center"><bold>IoU(%)</bold></th>
<th valign="top" align="center"><bold>Precision(%)</bold></th>
<th valign="top" align="center"><bold>Recall(%)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">(1) w/o FIIR</td>
<td valign="top" align="center">79.68</td>
<td valign="top" align="center">67.90</td>
<td valign="top" align="center">81.94</td>
<td valign="top" align="center">80.65</td>
</tr> <tr>
<td valign="top" align="left">(2) w/o <inline-formula><mml:math id="M103"><mml:msubsup><mml:mrow><mml:mi>&#x003A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> summary</td>
<td valign="top" align="center">84.58</td>
<td valign="top" align="center">75.42</td>
<td valign="top" align="center">78.18</td>
<td valign="top" align="center"><inline-formula><mml:math id="M104"><mml:mrow><mml:mstyle mathcolor="#0070c0"><mml:mtext>96.10</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
</tr> <tr>
<td valign="top" align="left">(3) w/o <inline-formula><mml:math id="M105"><mml:msubsup><mml:mrow><mml:mi>&#x003A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> concat</td>
<td valign="top" align="center">76.27</td>
<td valign="top" align="center">64.35</td>
<td valign="top" align="center">66.37</td>
<td valign="top" align="center"><inline-formula><mml:math id="M106"><mml:mrow><mml:mstyle mathcolor="#ff0000"><mml:mtext>96.31</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
</tr> <tr>
<td valign="top" align="left">(4) w/o <inline-formula><mml:math id="M107"><mml:msubsup><mml:mrow><mml:mi>&#x003A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> summary</td>
<td valign="top" align="center">88.64</td>
<td valign="top" align="center">80.31</td>
<td valign="top" align="center">89.77</td>
<td valign="top" align="center">87.79</td>
</tr> <tr>
<td valign="top" align="left">(5) w/o <inline-formula><mml:math id="M108"><mml:msubsup><mml:mrow><mml:mi>&#x003A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> concat</td>
<td valign="top" align="center"><inline-formula><mml:math id="M109"><mml:mrow><mml:mstyle mathcolor="#0070c0"><mml:mtext>89.60</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M110"><mml:mrow><mml:mstyle mathcolor="#0070c0"><mml:mtext>81.91</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M111"><mml:mrow><mml:mstyle mathcolor="#0070c0"><mml:mtext>89.87</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center">88.72</td>
</tr> <tr>
<td valign="top" align="left">(6) w/o <italic>E</italic><sub><italic>M</italic></sub></td>
<td valign="top" align="center">85.01</td>
<td valign="top" align="center">75.05</td>
<td valign="top" align="center">84.18</td>
<td valign="top" align="center">87.77</td>
</tr> <tr>
<td valign="top" align="left">(7) Ours</td>
<td valign="top" align="center"><inline-formula><mml:math id="M112"><mml:mrow><mml:mstyle mathcolor="#ff0000"><mml:mtext>89.73</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M113"><mml:mrow><mml:mstyle mathcolor="#ff0000"><mml:mtext>82.07</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M114"><mml:mrow><mml:mstyle mathcolor="#ff0000"><mml:mtext>91.01</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center">89.18</td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>Red indicates the best performance, and blue indicates the second best.</p>
</table-wrap-foot>
</table-wrap>
<fig position="float" id="F13">
<label>Figure 13</label>
<caption><p>Quantitative comparative results of ablation experiments on the GLAS dataset.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmed-12-1661984-g0013.tif">
<alt-text>Histology image comparison showing two primary images on the left. The top row contains six segmentation results labeled (1) to (7), displaying variations with different red and green highlight patterns. The final image in each row is labeled &#x0201C;GT&#x0201D; and shows a black and white segmentation.</alt-text>
</graphic>
</fig>
<p>As can be seen from the visual results, w/o FIIR (1) leads to evident deficiencies along the object boundaries and causes incomplete structural predictions. This demonstrates that FIIR plays an important role in enhancing pixel-level detail and supporting the extraction of informative segmentation features. When the adaptive feature modulation module <inline-formula><mml:math id="M115"><mml:msubsup><mml:mrow><mml:mi>&#x003A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is removed, and then replaced with feature summary (2) or concat (3), the model tends to produce over-segmentation. This is reflected in the increased number of false positives in the predicted maps. Although the recall remains relatively high, reaching 96.10 and 96.31 respectively, the precision drops significantly to 78.18 and 66.37, suggesting that the model becomes less capable of regulating foreground and background responses effectively.</p>
<p>Moreover, we adopt summary (4) and concat (5) operations to replace the bi-directional fusion module <inline-formula><mml:math id="M116"><mml:msubsup><mml:mrow><mml:mi>&#x003A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>. The predicted structures remain mostly intact, but the boundaries are less precise, indicating that this module still contributes to enhancing local detail and structural consistency. Further, the removal of the multi-branch vision mamba module <italic>E</italic><sub><italic>M</italic></sub> (6) results in a decrease in both IoU and F1, and the predicted boundaries become less distinct. This shows that <italic>E</italic><sub><italic>M</italic></sub> plays a critical role in aggregating hierarchical features and is particularly helpful in capturing complex object shapes.</p>
<p>Among all the configurations, the complete model (7) achieves the best overall performance. It obtains an F1 of 89.73, an IoU of 82.07, a precision of 91.01, and a recall of 89.18. Its visual results are also the most aligned with the ground truth annotations. These observations confirm that the synergy between the proposed modules leads to significant improvements in both segmentation accuracy and visual quality.</p></sec>
<sec>
<title>4.3.2 ISIC2016</title>
<p>We further validate the effect of each model component by conducting ablation experiments on the ISIC2016 dataset. <xref ref-type="table" rid="T6">Table 6</xref> reports the numerical performance under different ablation settings, while <xref ref-type="fig" rid="F14">Figure 14</xref> illustrates the corresponding segmentation outputs for visual comparison. w/o FIIR (1) leads to a noticeable decline in IoU and F1, which drops to 84.33 and 90.81, respectively. Despite the recall and precision being reasonably balanced, the visual outputs exhibit weaker boundary fidelity, particularly in areas with low contrast, where the predicted masks tend to deviate from the lesion margins. Interestingly, we adopt w/o <inline-formula><mml:math id="M117"><mml:msubsup><mml:mrow><mml:mi>&#x003A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> concat (3) to produce the highest recall at 98.48, suggesting that the model becomes more permissive in capturing lesion pixels. However, this also comes at the cost of increased false positives, as reflected in the relatively lower precision and the presence of redundant red areas in the predicted masks. w/o <inline-formula><mml:math id="M118"><mml:msubsup><mml:mrow><mml:mi>&#x003A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> summary (2) causes the prediction accuracy to decrease, reinforcing that the absence of the modulation structure compromises the foreground-background balancing mechanism.</p>
<table-wrap position="float" id="T6">
<label>Table 6</label>
<caption><p>Ablation studies results on the ISIC2016 dataset.</p></caption>
<table frame="box" rules="all">
<thead>
<tr>
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="center"><bold>F1(%)</bold></th>
<th valign="top" align="center"><bold>IoU(%)</bold></th>
<th valign="top" align="center"><bold>Precision(%)</bold></th>
<th valign="top" align="center"><bold>Recall(%)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">(1) w/o FIIR</td>
<td valign="top" align="center">90.81</td>
<td valign="top" align="center">84.33</td>
<td valign="top" align="center">91.76</td>
<td valign="top" align="center">91.87</td>
</tr> <tr>
<td valign="top" align="left">(2) w/o <inline-formula><mml:math id="M119"><mml:msubsup><mml:mrow><mml:mi>&#x003A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> summary</td>
<td valign="top" align="center">91.64</td>
<td valign="top" align="center">85.45</td>
<td valign="top" align="center">86.82</td>
<td valign="top" align="center"><inline-formula><mml:math id="M120"><mml:mrow><mml:mstyle mathcolor="#0070c0"><mml:mtext>97.09</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
</tr> <tr>
<td valign="top" align="left">(3) w/o <inline-formula><mml:math id="M121"><mml:msubsup><mml:mrow><mml:mi>&#x003A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> concat</td>
<td valign="top" align="center"><inline-formula><mml:math id="M123"><mml:mrow><mml:mstyle mathcolor="#0070c0"><mml:mtext>93.75</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M124"><mml:mrow><mml:mstyle mathcolor="#0070c0"><mml:mtext>88.86</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center">89.19</td>
<td valign="top" align="center"><inline-formula><mml:math id="M125"><mml:mrow><mml:mstyle mathcolor="#ff0000"><mml:mtext>98.48</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
</tr> <tr>
<td valign="top" align="left">(4) w/o <inline-formula><mml:math id="M126"><mml:msubsup><mml:mrow><mml:mi>&#x003A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> summary</td>
<td valign="top" align="center">91.51</td>
<td valign="top" align="center">85.14</td>
<td valign="top" align="center"><inline-formula><mml:math id="M127"><mml:mrow><mml:mstyle mathcolor="#0070c0"><mml:mtext>95.4895.48</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center">88.16</td>
</tr> <tr>
<td valign="top" align="left">(5) w/o <inline-formula><mml:math id="M128"><mml:msubsup><mml:mrow><mml:mi>&#x003A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> concat</td>
<td valign="top" align="center">88.77</td>
<td valign="top" align="center">81.19</td>
<td valign="top" align="center">89.25</td>
<td valign="top" align="center">91.17</td>
</tr> <tr>
<td valign="top" align="left">(6) w/o <italic>E</italic><sub><italic>M</italic></sub></td>
<td valign="top" align="center">90.79</td>
<td valign="top" align="center">84.04</td>
<td valign="top" align="center">95.23</td>
<td valign="top" align="center">87.37</td>
</tr> <tr>
<td valign="top" align="left">(7) Ours</td>
<td valign="top" align="center"><inline-formula><mml:math id="M129"><mml:mrow><mml:mstyle mathcolor="#ff0000"><mml:mtext>94.14</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M130"><mml:mrow><mml:mstyle mathcolor="#ff0000"><mml:mtext>89.40</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M131"><mml:mrow><mml:mstyle mathcolor="#ff0000"><mml:mtext>95.48</mml:mtext></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center">93.45</td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>Red indicates the best performance, and blue indicates the second best.</p>
</table-wrap-foot>
</table-wrap>
<fig position="float" id="F14">
<label>Figure 14</label>
<caption><p>Quantitative comparative results of ablation experiments on the ISIC2016 dataset.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmed-12-1661984-g0014.tif">
<alt-text>Two rows of images show segmented analysis of skin lesions. The first column displays the original lesion images. Columns labeled (1) to (7) display various segmentation results, with white and green indicating different segmentation areas on a black background. The last column labeled GT shows ground truth segmentation in white. Top and bottom rows contain similar patterns.</alt-text>
</graphic>
</fig>
<p>To verify the effectiveness of the bi-directional fusion module <inline-formula><mml:math id="M132"><mml:msubsup><mml:mrow><mml:mi>&#x003A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, we use <inline-formula><mml:math id="M133"><mml:msubsup><mml:mrow><mml:mi>&#x003A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> summary (4) and <inline-formula><mml:math id="M134"><mml:msubsup><mml:mrow><mml:mi>&#x003A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> concat (4) instead of <inline-formula><mml:math id="M135"><mml:msubsup><mml:mrow><mml:mi>&#x003A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>. Specifically, w/o <inline-formula><mml:math id="M136"><mml:msubsup><mml:mrow><mml:mi>&#x003A6;</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> concat (5) has a clearer negative effect, with IoU decreasing to 81.19, accompanied by more pronounced boundary irregularities in the visualization. w/o <italic>E</italic><sub><italic>M</italic></sub> on IoU and F1 metrics scores lower than the full model. This suggests that although the primary structure still functions, the lack of high-low feature interaction leads to reduced segmentation confidence near ambiguous regions. With all components intact, the full model (7) achieves the strongest performance across all metrics F1 reaches 94.14, IoU improves to 89.40, and both precision and recall are maximized. The output masks are tightly aligned with the lesion contours, even under challenging conditions such as blurry or low-contrast boundaries, confirming the complementary nature of all proposed modules.</p></sec></sec></sec>
<sec sec-type="conclusions" id="s5">
<title>5 Conclusion</title>
<p>In this paper, we propose a multi-interactive feature embedding learning method for medical image segmentation. The core idea is to establish information interaction between the reconstruction task and the segmentation task, thus achieving superior segmentation performance. Specifically, an adaptive feature modulation module can efficiently fuse foreground and background features, thereby extracting pixel-level fine-grained features. Then, a bi-directional fusion module integrates important feature information between two different tasks, enhancing semantic understanding and detail retention. Finally, a multi-branch visual mamba effectively captures structural details by extracting multi-scale features in parallel, thus improving the model capability in terms of local texture and global semantics. Extensive experiments demonstrate that the proposed method can accurately segment the lesion region compared to other state-of-the-art segmentation methods.</p></sec>
</body>
<back>
<sec sec-type="data-availability" id="s6">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.</p>
</sec>
<sec sec-type="ethics-statement" id="s7">
<title>Ethics statement</title>
<p>Ethical approval was not required for the studies on humans in accordance with the local legislation and institutional requirements because only commercially available established cell lines were used. Written informed consent was obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article.</p>
</sec>
<sec sec-type="author-contributions" id="s8">
<title>Author contributions</title>
<p>YH: Conceptualization, Data curation, Investigation, Methodology, Software, Writing &#x02013; original draft, Writing &#x02013; review &#x00026; editing. YL: Funding acquisition, Resources, Supervision, Validation, Visualization, Writing &#x02013; original draft, Writing &#x02013; review &#x00026; editing.</p>
</sec>
<sec sec-type="funding-information" id="s9">
<title>Funding</title>
<p>The author(s) declare that financial support was received for the research and/or publication of this article. This work was funded by National Natural Science Foundation of China (Youth Fund, 81904324).</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="ai-statement" id="s10">
<title>Generative AI statement</title>
<p>The author(s) declare that no Gen AI was used in the creation of this manuscript.</p>
<p>Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.</p></sec>
<sec sec-type="disclaimer" id="s11">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<label>1.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Al-Masni</surname> <given-names>MA</given-names></name> <name><surname>Al-Shamiri</surname> <given-names>AK</given-names></name> <name><surname>Hussain</surname> <given-names>D</given-names></name> <name><surname>Gu</surname> <given-names>YH</given-names></name></person-group>. <article-title>A unified multi-task learning model with joint reverse optimization for simultaneous skin lesion segmentation and diagnosis</article-title>. <source>Bioengineering</source>. (<year>2024</year>) <volume>11</volume>:<fpage>1173</fpage>. <pub-id pub-id-type="doi">10.3390/bioengineering11111173</pub-id><pub-id pub-id-type="pmid">39593832</pub-id></citation></ref>
<ref id="B2">
<label>2.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Al-Absi</surname> <given-names>AA</given-names></name> <name><surname>Fu</surname> <given-names>R</given-names></name> <name><surname>Ebrahim</surname> <given-names>N</given-names></name> <name><surname>Al-Absi</surname> <given-names>MA</given-names></name> <name><surname>Kang</surname> <given-names>DK</given-names></name></person-group>. <article-title>Brain tumour segmentation and grading using local and global context-aggregated attention network architecture</article-title>. <source>Bioengineering</source>. (<year>2025</year>) <volume>12</volume>:<fpage>552</fpage>. <pub-id pub-id-type="doi">10.3390/bioengineering12050552</pub-id><pub-id pub-id-type="pmid">40428171</pub-id></citation></ref>
<ref id="B3">
<label>3.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dong</surname> <given-names>Z</given-names></name> <name><surname>Yuan</surname> <given-names>G</given-names></name> <name><surname>Hua</surname> <given-names>Z</given-names></name> <name><surname>Li</surname> <given-names>J</given-names></name></person-group>. <article-title>Diffusion model-based text-guided enhancement network for medical image segmentation</article-title>. <source>Expert Syst Appl</source>. (<year>2024</year>) <volume>249</volume>:<fpage>123549</fpage>. <pub-id pub-id-type="doi">10.1016/j.eswa.2024.123549</pub-id></citation>
</ref>
<ref id="B4">
<label>4.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ding</surname> <given-names>W</given-names></name> <name><surname>Li</surname> <given-names>Z</given-names></name></person-group>. <article-title>Curriculum consistency learning and multi-scale contrastive constraint in semi-supervised medical image segmentation</article-title>. <source>Bioengineering</source>. (<year>2023</year>) <volume>11</volume>:<fpage>10</fpage>. <pub-id pub-id-type="doi">10.3390/bioengineering11010010</pub-id><pub-id pub-id-type="pmid">38247886</pub-id></citation></ref>
<ref id="B5">
<label>5.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fu</surname> <given-names>Z</given-names></name> <name><surname>Li</surname> <given-names>J</given-names></name> <name><surname>Hua</surname> <given-names>Z</given-names></name> <name><surname>Fan</surname> <given-names>L</given-names></name></person-group>. <article-title>Deep supervision feature refinement attention network for medical image segmentation</article-title>. <source>Eng Appl Artif Intell</source>. (<year>2023</year>) <volume>125</volume>:<fpage>106666</fpage>. <pub-id pub-id-type="doi">10.1016/j.engappai.2023.106666</pub-id></citation>
</ref>
<ref id="B6">
<label>6.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Antonelli</surname> <given-names>M</given-names></name> <name><surname>Reinke</surname> <given-names>A</given-names></name> <name><surname>Bakas</surname> <given-names>S</given-names></name> <name><surname>Farahani</surname> <given-names>K</given-names></name> <name><surname>Kopp-Schneider</surname> <given-names>A</given-names></name> <name><surname>Landman</surname> <given-names>BA</given-names></name> <etal/></person-group>. <article-title>The medical segmentation decathlon</article-title>. <source>Nat Commun</source>. (<year>2022</year>) <volume>13</volume>:<fpage>4128</fpage>. <pub-id pub-id-type="doi">10.1038/s41467-022-30695-9</pub-id><pub-id pub-id-type="pmid">35840566</pub-id></citation></ref>
<ref id="B7">
<label>7.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Han</surname> <given-names>Z</given-names></name> <name><surname>Jian</surname> <given-names>M</given-names></name> <name><surname>Wang</surname> <given-names>GG</given-names></name></person-group>. <article-title>ConvUNeXt: an efficient convolution neural network for medical image segmentation</article-title>. <source>Knowl Based Syst</source>. (<year>2022</year>) <volume>253</volume>:<fpage>109512</fpage>. <pub-id pub-id-type="doi">10.1016/j.knosys.2022.109512</pub-id></citation>
</ref>
<ref id="B8">
<label>8.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>M&#x000FC;ller</surname> <given-names>D</given-names></name> <name><surname>Kramer</surname> <given-names>F</given-names></name></person-group>. <article-title>MIScnn: a framework for medical image segmentation with convolutional neural networks and deep learning</article-title>. <source>BMC Med Imaging</source>. (<year>2021</year>) <volume>21</volume>:<fpage>1</fpage>&#x02013;<lpage>11</lpage>. <pub-id pub-id-type="doi">10.1186/s12880-020-00543-7</pub-id><pub-id pub-id-type="pmid">33461500</pub-id></citation></ref>
<ref id="B9">
<label>9.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ronneberger</surname> <given-names>O</given-names></name> <name><surname>Fischer</surname> <given-names>P</given-names></name> <name><surname>Brox</surname> <given-names>T</given-names></name></person-group>. <article-title>U-net: convolutional networks for biomedical image segmentation</article-title>. In: <source>Medical Image Computing and Computer-assisted Intervention-MICCAI 2015: 18th International Conference, Munich, Germany, October 5&#x02013;9, 2015, Proceedings, Part III 18</source>. <publisher-loc>Springer</publisher-loc>: <publisher-name>New York</publisher-name> (<year>2015</year>). p. <fpage>234</fpage>&#x02013;<lpage>241</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-24574-4_28</pub-id></citation>
</ref>
<ref id="B10">
<label>10.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhou</surname> <given-names>Z</given-names></name> <name><surname>Rahman Siddiquee</surname> <given-names>MM</given-names></name> <name><surname>Tajbakhsh</surname> <given-names>N</given-names></name> <name><surname>Liang</surname> <given-names>J</given-names></name></person-group>. <article-title>Unet&#x0002B;&#x0002B;: a nested u-net architecture for medical image segmentation</article-title>. In: <source>Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, proceedings 4</source>. <publisher-loc>Springer</publisher-loc>: <publisher-name>New York</publisher-name> (<year>2018</year>). p. <fpage>3</fpage>&#x02013;<lpage>11</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-00889-5_1</pub-id><pub-id pub-id-type="pmid">32613207</pub-id></citation></ref>
<ref id="B11">
<label>11.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Basak</surname> <given-names>H</given-names></name> <name><surname>Kundu</surname> <given-names>R</given-names></name> <name><surname>Sarkar</surname> <given-names>R</given-names></name></person-group>. <article-title>MFSNet: a multi focus segmentation network for skin lesion segmentation</article-title>. <source>Pattern Recognit</source>. (<year>2022</year>) <volume>128</volume>:<fpage>108673</fpage>. <pub-id pub-id-type="doi">10.1016/j.patcog.2022.108673</pub-id></citation>
</ref>
<ref id="B12">
<label>12.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ma</surname> <given-names>J</given-names></name> <name><surname>Yuan</surname> <given-names>G</given-names></name> <name><surname>Guo</surname> <given-names>C</given-names></name> <name><surname>Gang</surname> <given-names>X</given-names></name> <name><surname>Zheng</surname> <given-names>M</given-names></name></person-group>. <article-title>SW-UNet: a U-net fusing sliding window transformer block with CNN for segmentation of lung nodules</article-title>. <source>Front Med</source>. (<year>2023</year>) <volume>10</volume>:<fpage>1273441</fpage>. <pub-id pub-id-type="doi">10.3389/fmed.2023.1273441</pub-id><pub-id pub-id-type="pmid">37841008</pub-id></citation></ref>
<ref id="B13">
<label>13.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Amin</surname> <given-names>J</given-names></name> <name><surname>Azhar</surname> <given-names>M</given-names></name> <name><surname>Arshad</surname> <given-names>H</given-names></name> <name><surname>Zafar</surname> <given-names>A</given-names></name> <name><surname>Kim</surname> <given-names>SH</given-names></name></person-group>. <article-title>Skin-lesion segmentation using boundary-aware segmentation network and classification based on a mixture of convolutional and transformer neural networks</article-title>. <source>Front Med</source>. (<year>2025</year>) <volume>12</volume>:<fpage>1524146</fpage>. <pub-id pub-id-type="doi">10.3389/fmed.2025.1524146</pub-id><pub-id pub-id-type="pmid">40130244</pub-id></citation></ref>
<ref id="B14">
<label>14.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>X</given-names></name> <name><surname>Tan</surname> <given-names>H</given-names></name> <name><surname>Wang</surname> <given-names>W</given-names></name> <name><surname>Chen</surname> <given-names>Z</given-names></name></person-group>. <article-title>Deep learning based retinal vessel segmentation and hypertensive retinopathy quantification using heterogeneous features cross-attention neural network</article-title>. <source>Front Med</source>. (<year>2024</year>) <volume>11</volume>:<fpage>1377479</fpage>. <pub-id pub-id-type="doi">10.3389/fmed.2024.1377479</pub-id><pub-id pub-id-type="pmid">38841586</pub-id></citation></ref>
<ref id="B15">
<label>15.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vaswani</surname> <given-names>A</given-names></name> <name><surname>Shazeer</surname> <given-names>N</given-names></name> <name><surname>Parmar</surname> <given-names>N</given-names></name> <name><surname>Uszkoreit</surname> <given-names>J</given-names></name> <name><surname>Jones</surname> <given-names>L</given-names></name> <name><surname>Gomez</surname> <given-names>AN</given-names></name> <etal/></person-group>. <article-title>Attention is all you need</article-title>. <source>Adv Neural Inf Process Syst</source>. (<year>2017</year>) <fpage>30</fpage>.</citation>
</ref>
<ref id="B16">
<label>16.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Meng</surname> <given-names>W</given-names></name> <name><surname>Liu</surname> <given-names>S</given-names></name> <name><surname>Wang</surname> <given-names>H</given-names></name></person-group>. <article-title>AFC-Unet: attention-fused full-scale CNN-transformer unet for medical image segmentation</article-title>. <source>Biomed Signal Process Control</source>. (<year>2025</year>) <volume>99</volume>:<fpage>106839</fpage>. <pub-id pub-id-type="doi">10.1016/j.bspc.2024.106839</pub-id></citation>
</ref>
<ref id="B17">
<label>17.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ren</surname> <given-names>S</given-names></name> <name><surname>Li</surname> <given-names>X</given-names></name></person-group>. <article-title>HResFormer: hybrid residual transformer for volumetric medical image segmentation</article-title>. <source>IEEE Trans Neural Netw Learn Syst</source>. (<year>2025</year>) <volume>36</volume>:<fpage>10558</fpage>&#x02013;<lpage>66</lpage>. <pub-id pub-id-type="doi">10.1109/TNNLS.2024.3519634</pub-id><pub-id pub-id-type="pmid">40031181</pub-id></citation></ref>
<ref id="B18">
<label>18.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>X</given-names></name> <name><surname>Gao</surname> <given-names>P</given-names></name> <name><surname>Yu</surname> <given-names>T</given-names></name> <name><surname>Wang</surname> <given-names>F</given-names></name> <name><surname>Yuan</surname> <given-names>RY</given-names></name></person-group>. <article-title>CSWin-UNet: transformer UNet with cross-shaped windows for medical image segmentation</article-title>. <source>Inf Fusion</source>. (<year>2025</year>) <volume>113</volume>:<fpage>102634</fpage>. <pub-id pub-id-type="doi">10.1016/j.inffus.2024.102634</pub-id></citation>
</ref>
<ref id="B19">
<label>19.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>J</given-names></name> <name><surname>Mei</surname> <given-names>J</given-names></name> <name><surname>Li</surname> <given-names>X</given-names></name> <name><surname>Lu</surname> <given-names>Y</given-names></name> <name><surname>Yu</surname> <given-names>Q</given-names></name> <name><surname>Wei</surname> <given-names>Q</given-names></name> <etal/></person-group>. <article-title>TransUNet: rethinking the u-net architecture design for medical image segmentation through the lens of transformers</article-title>. <source>Med Image Anal</source>. (<year>2024</year>) <volume>97</volume>:<fpage>103280</fpage>. <pub-id pub-id-type="doi">10.1016/j.media.2024.103280</pub-id><pub-id pub-id-type="pmid">39096845</pub-id></citation></ref>
<ref id="B20">
<label>20.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>Y</given-names></name> <name><surname>Liu</surname> <given-names>H</given-names></name> <name><surname>Hu</surname> <given-names>Q</given-names></name></person-group>. <article-title>Transfuse: fusing transformers and CNNS for medical image segmentation</article-title>. In: <source>Medical Image Computing and Computer Assisted Intervention-MICCAI 2021: 24th International Conference, Strasbourg, France, September 27-October 1, 2021, Proceedings, Part I 24</source>. <publisher-loc>Springer</publisher-loc>: <publisher-name>New York</publisher-name> (<year>2021</year>). p. <fpage>14</fpage>&#x02013;<lpage>24</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-87193-2_2</pub-id></citation>
</ref>
<ref id="B21">
<label>21.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gu</surname> <given-names>A</given-names></name> <name><surname>Dao</surname> <given-names>T</given-names></name></person-group>. <article-title>Mamba: linear-time sequence modeling with selective state spaces</article-title>. <source>arXiv preprint arXiv:231200752</source>. (<year>2023</year>). <pub-id pub-id-type="doi">10.48550/arXiv.2312.00752</pub-id></citation>
</ref>
<ref id="B22">
<label>22.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>Y</given-names></name> <name><surname>Tian</surname> <given-names>Y</given-names></name> <name><surname>Zhao</surname> <given-names>Y</given-names></name> <name><surname>Yu</surname> <given-names>H</given-names></name> <name><surname>Xie</surname> <given-names>L</given-names></name> <name><surname>Wang</surname> <given-names>Y</given-names></name> <etal/></person-group>. <article-title>VMamba: visual state space model</article-title>. <source>Adv Neural Inf Process Syst</source>. (<year>2024</year>) <volume>37</volume>:<fpage>103031</fpage>&#x02013;<lpage>63</lpage>.</citation>
</ref>
<ref id="B23">
<label>23.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Oktay</surname> <given-names>O</given-names></name> <name><surname>Schlemper</surname> <given-names>J</given-names></name> <name><surname>Folgoc</surname> <given-names>LL</given-names></name> <name><surname>McDonagh</surname> <given-names>S</given-names></name> <name><surname>Kainz</surname> <given-names>B</given-names></name> <name><surname>Glocker</surname> <given-names>B</given-names></name> <etal/></person-group>. <article-title>Attention u-net: learning where to look for the pancreas</article-title>. <source>arXiv preprint arXiv:180403999</source>. (<year>2018</year>). <pub-id pub-id-type="doi">10.48550/arXiv.1804.03999</pub-id></citation>
</ref>
<ref id="B24">
<label>24.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Isensee</surname> <given-names>F</given-names></name> <name><surname>Jaeger</surname> <given-names>PF</given-names></name> <name><surname>Kohl</surname> <given-names>SA</given-names></name> <name><surname>Petersen</surname> <given-names>J</given-names></name> <name><surname>Maier-Hein</surname> <given-names>KH</given-names></name></person-group>. <article-title>nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation</article-title>. <source>Nat Methods</source>. (<year>2021</year>) <volume>18</volume>:<fpage>203</fpage>&#x02013;<lpage>11</lpage>. <pub-id pub-id-type="doi">10.1038/s41592-020-01008-z</pub-id><pub-id pub-id-type="pmid">33288961</pub-id></citation></ref>
<ref id="B25">
<label>25.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dai</surname> <given-names>D</given-names></name> <name><surname>Dong</surname> <given-names>C</given-names></name> <name><surname>Xu</surname> <given-names>S</given-names></name> <name><surname>Yan</surname> <given-names>Q</given-names></name> <name><surname>Li</surname> <given-names>Z</given-names></name> <name><surname>Zhang</surname> <given-names>C</given-names></name> <etal/></person-group>. <article-title>Ms RED: a novel multi-scale residual encoding and decoding network for skin lesion segmentation</article-title>. <source>Med Image Anal</source>. (<year>2022</year>) <volume>75</volume>:<fpage>102293</fpage>. <pub-id pub-id-type="doi">10.1016/j.media.2021.102293</pub-id><pub-id pub-id-type="pmid">34800787</pub-id></citation></ref>
<ref id="B26">
<label>26.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wei</surname> <given-names>S</given-names></name> <name><surname>Hu</surname> <given-names>Z</given-names></name> <name><surname>Tan</surname> <given-names>L</given-names></name></person-group>. <article-title>Res-ECA-UNet&#x0002B;&#x0002B;: an automatic segmentation model for ovarian tumor ultrasound images based on residual networks and channel attention mechanism</article-title>. <source>Front Med</source>. (<year>2025</year>) <volume>12</volume>:<fpage>1589356</fpage>. <pub-id pub-id-type="doi">10.3389/fmed.2025.1589356</pub-id><pub-id pub-id-type="pmid">40470046</pub-id></citation></ref>
<ref id="B27">
<label>27.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xu</surname> <given-names>Q</given-names></name> <name><surname>Ma</surname> <given-names>Z</given-names></name> <name><surname>He</surname> <given-names>N</given-names></name> <name><surname>Duan</surname> <given-names>W</given-names></name></person-group>. <article-title>DCSAU-Net: a deeper and more compact split-attention U-Net for medical image segmentation</article-title>. <source>Comput Biol Med</source>. (<year>2023</year>) <volume>154</volume>:<fpage>106626</fpage>. <pub-id pub-id-type="doi">10.1016/j.compbiomed.2023.106626</pub-id><pub-id pub-id-type="pmid">36736096</pub-id></citation></ref>
<ref id="B28">
<label>28.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Peng</surname> <given-names>Y</given-names></name> <name><surname>Chen</surname> <given-names>DZ</given-names></name> <name><surname>Sonka</surname> <given-names>M</given-names></name></person-group>. <article-title>U-net v2: rethinking the skip connections of u-net for medical image segmentation</article-title>. In: <source>2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI)</source>. <publisher-loc>Houston, TX</publisher-loc>: <publisher-name>IEEE</publisher-name> (<year>2025</year>). p. <fpage>1</fpage>&#x02013;<lpage>5</lpage>. <pub-id pub-id-type="doi">10.1109/ISBI60581.2025.10980742</pub-id></citation>
</ref>
<ref id="B29">
<label>29.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Feng</surname> <given-names>S</given-names></name> <name><surname>Wang</surname> <given-names>H</given-names></name> <name><surname>Han</surname> <given-names>C</given-names></name> <name><surname>Liu</surname> <given-names>Z</given-names></name> <name><surname>Zhang</surname> <given-names>H</given-names></name> <name><surname>Lan</surname> <given-names>R</given-names></name> <etal/></person-group>. <article-title>Weakly supervised gland segmentation with class semantic consistency and purified labels filtration</article-title>. <source>Proc AAAI Conf Artif Intell</source>. (<year>2025</year>) <volume>39</volume>:<fpage>2987</fpage>&#x02013;<lpage>95</lpage>. <pub-id pub-id-type="doi">10.1609/aaai.v39i3.32306</pub-id></citation>
</ref>
<ref id="B30">
<label>30.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>J</given-names></name> <name><surname>Wei</surname> <given-names>L</given-names></name> <name><surname>Wang</surname> <given-names>L</given-names></name> <name><surname>Zhou</surname> <given-names>Q</given-names></name> <name><surname>Zhu</surname> <given-names>L</given-names></name> <name><surname>Qin</surname> <given-names>J</given-names></name></person-group>. <article-title>Boundary-aware transformers for skin lesion segmentation</article-title>. In: <source>Medical Image Computing and Computer Assisted Intervention-mICCAI 2021: 24th International Conference, Strasbourg, France, September 27-October 1, 2021, Proceedings, Part I 24</source>. <publisher-loc>Springer</publisher-loc>: <publisher-name>New York</publisher-name> (<year>2021</year>). p. <fpage>206</fpage>&#x02013;<lpage>16</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-87193-2_20</pub-id></citation>
</ref>
<ref id="B31">
<label>31.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>H</given-names></name> <name><surname>Chen</surname> <given-names>S</given-names></name> <name><surname>Chen</surname> <given-names>G</given-names></name> <name><surname>Wang</surname> <given-names>W</given-names></name> <name><surname>Lei</surname> <given-names>B</given-names></name> <name><surname>Wen</surname> <given-names>Z</given-names></name></person-group>. <article-title>FAT-Net: feature adaptive transformers for automated skin lesion segmentation</article-title>. <source>Med Image Anal</source>. (<year>2022</year>) <volume>76</volume>:<fpage>102327</fpage>. <pub-id pub-id-type="doi">10.1016/j.media.2021.102327</pub-id><pub-id pub-id-type="pmid">34923250</pub-id></citation></ref>
<ref id="B32">
<label>32.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Shi</surname> <given-names>W</given-names></name> <name><surname>Xu</surname> <given-names>J</given-names></name> <name><surname>Gao</surname> <given-names>P</given-names></name></person-group>. <article-title>SSformer: a lightweight transformer for semantic segmentation</article-title>. In: <source>2022 IEEE 24th International Workshop on Multimedia Signal Processing (MMSP)</source>. <publisher-loc>Shanghai</publisher-loc>: <publisher-name>IEEE</publisher-name> (<year>2022</year>). p. <fpage>1</fpage>&#x02013;<lpage>5</lpage>. <pub-id pub-id-type="doi">10.1109/MMSP55362.2022.9949177</pub-id></citation>
</ref>
<ref id="B33">
<label>33.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sun</surname> <given-names>J</given-names></name></person-group>. <article-title>MedFusion-TransNet: multi-modal fusion via transformer for enhanced medical image segmentation</article-title>. <source>Front Med</source>. (<year>2025</year>) <volume>12</volume>:<fpage>1557449</fpage>. <pub-id pub-id-type="doi">10.3389/fmed.2025.1557449</pub-id><pub-id pub-id-type="pmid">40395236</pub-id></citation></ref>
<ref id="B34">
<label>34.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bakkouri</surname> <given-names>I</given-names></name> <name><surname>Bakkouri</surname> <given-names>S</given-names></name></person-group>. <article-title>UGS-M3F: unified gated swin transformer with multi-feature fully fusion for retinal blood vessel segmentation</article-title>. <source>BMC Med Imaging</source>. (<year>2025</year>) <volume>25</volume>:<fpage>77</fpage>. <pub-id pub-id-type="doi">10.1186/s12880-025-01616-1</pub-id><pub-id pub-id-type="pmid">40050753</pub-id></citation></ref>
<ref id="B35">
<label>35.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zeng</surname> <given-names>L</given-names></name> <name><surname>Zhu</surname> <given-names>M</given-names></name> <name><surname>Wu</surname> <given-names>K</given-names></name> <name><surname>Li</surname> <given-names>Z</given-names></name></person-group>. <article-title>Medical image segmentation via sparse coding decoder</article-title>. In: <source>ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>. <publisher-loc>IEEE</publisher-loc>: <publisher-name>Hyderabad, India</publisher-name> (<year>2025</year>). p. <fpage>1</fpage>&#x02013;<lpage>5</lpage>. <pub-id pub-id-type="doi">10.1109/ICASSP49660.2025.10889260</pub-id></citation>
</ref>
<ref id="B36">
<label>36.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zheng</surname> <given-names>J</given-names></name> <name><surname>Liu</surname> <given-names>H</given-names></name> <name><surname>Feng</surname> <given-names>Y</given-names></name> <name><surname>Xu</surname> <given-names>J</given-names></name> <name><surname>Zhao</surname> <given-names>L</given-names></name></person-group>. <article-title>CASF-Net: cross-attention and cross-scale fusion network for medical image segmentation</article-title>. <source>Comput Methods Programs Biomed</source>. (<year>2023</year>) <volume>229</volume>:<fpage>107307</fpage>. <pub-id pub-id-type="doi">10.1016/j.cmpb.2022.107307</pub-id><pub-id pub-id-type="pmid">36571889</pub-id></citation></ref>
<ref id="B37">
<label>37.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Iqbal</surname> <given-names>S</given-names></name> <name><surname>Khan</surname> <given-names>TM</given-names></name> <name><surname>Naqvi</surname> <given-names>SS</given-names></name> <name><surname>Naveed</surname> <given-names>A</given-names></name> <name><surname>Meijering</surname> <given-names>E</given-names></name></person-group>. <article-title>TBConvL-Net: a hybrid deep learning architecture for robust medical image segmentation</article-title>. <source>Pattern Recognit</source>. (<year>2025</year>) <volume>158</volume>:<fpage>111028</fpage>. <pub-id pub-id-type="doi">10.1016/j.patcog.2024.111028</pub-id></citation>
</ref>
<ref id="B38">
<label>38.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Perera</surname> <given-names>S</given-names></name> <name><surname>Erzurumlu</surname> <given-names>Y</given-names></name> <name><surname>Gulati</surname> <given-names>D</given-names></name> <name><surname>Yilmaz</surname> <given-names>A</given-names></name></person-group>. <article-title>MobileUNETR: a lightweight end-to-end hybrid vision transformer for efficient medical image segmentation</article-title>. In: <source>European Conference on Computer Vision</source>. <publisher-loc>Springer</publisher-loc>: <publisher-name>New York</publisher-name> (<year>2025</year>). p. <fpage>281</fpage>&#x02013;<lpage>99</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-031-91721-9_18</pub-id></citation>
</ref>
<ref id="B39">
<label>39.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Beliy</surname> <given-names>R</given-names></name> <name><surname>Gaziv</surname> <given-names>G</given-names></name> <name><surname>Hoogi</surname> <given-names>A</given-names></name> <name><surname>Strappini</surname> <given-names>F</given-names></name> <name><surname>Golan</surname> <given-names>T</given-names></name> <name><surname>Irani</surname> <given-names>M</given-names></name></person-group>. <article-title>From voxels to pixels and back: self-supervision in natural-image reconstruction from FMRI</article-title>. <source>Adv Neural Inf Process Syst</source>. (<year>2019</year>) <fpage>32</fpage>.</citation>
</ref>
<ref id="B40">
<label>40.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>Y</given-names></name> <name><surname>Hao</surname> <given-names>J</given-names></name> <name><surname>Zhou</surname> <given-names>B</given-names></name></person-group>. <article-title>Dual-domain multi-path self-supervised diffusion model for accelerated MRI reconstruction</article-title>. <source>arXiv preprint arXiv:250318836</source>. (<year>2025</year>). <pub-id pub-id-type="doi">10.48550/arXiv.2503.18836</pub-id></citation>
</ref>
<ref id="B41">
<label>41.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>L</given-names></name> <name><surname>Bentley</surname> <given-names>P</given-names></name> <name><surname>Mori</surname> <given-names>K</given-names></name> <name><surname>Misawa</surname> <given-names>K</given-names></name> <name><surname>Fujiwara</surname> <given-names>M</given-names></name> <name><surname>Rueckert</surname> <given-names>D</given-names></name></person-group>. <article-title>Self-supervised learning for medical image analysis using image context restoration</article-title>. <source>Med Image Anal</source>. (<year>2019</year>) <volume>58</volume>:<fpage>101539</fpage>. <pub-id pub-id-type="doi">10.1016/j.media.2019.101539</pub-id><pub-id pub-id-type="pmid">31374449</pub-id></citation></ref>
<ref id="B42">
<label>42.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>Z</given-names></name> <name><surname>Xu</surname> <given-names>R</given-names></name> <name><surname>Liu</surname> <given-names>M</given-names></name> <name><surname>Yan</surname> <given-names>Z</given-names></name> <name><surname>Zuo</surname> <given-names>W</given-names></name></person-group>. <article-title>Self-supervised image restoration with blurry and noisy Pairs</article-title>. <source>Adv Neural Inf Process Syst</source>. (<year>2022</year>) <volume>35</volume>:<fpage>29179</fpage>&#x02013;<lpage>91</lpage>.</citation>
</ref>
<ref id="B43">
<label>43.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Thakkar</surname> <given-names>JD</given-names></name> <name><surname>Bhatt</surname> <given-names>JS</given-names></name> <name><surname>Patra</surname> <given-names>SK</given-names></name></person-group>. <article-title>Self-supervised learning for medical image restoration: investigation and finding</article-title>. In: <source>International Conference on Machine Intelligence and Signal Processing</source>. <publisher-loc>Springer</publisher-loc>: <publisher-name>New York</publisher-name> (<year>2022</year>). p. <fpage>541</fpage>&#x02013;<lpage>552</lpage>. <pub-id pub-id-type="doi">10.1007/978-981-99-0047-3_46</pub-id></citation>
</ref>
<ref id="B44">
<label>44.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>S</given-names></name> <name><surname>Meng</surname> <given-names>W</given-names></name> <name><surname>Liu</surname> <given-names>C</given-names></name> <name><surname>Long</surname> <given-names>C</given-names></name> <name><surname>He</surname> <given-names>S</given-names></name></person-group>. <article-title>S4 FD: self-supervision-enhanced semisupervised fault diagnosis for complex industrial processes</article-title>. <source>IEEE Trans Ind Inform</source>. (<year>2025</year>) <volume>21</volume>:<fpage>3585</fpage>&#x02013;<lpage>94</lpage>. <pub-id pub-id-type="doi">10.1109/TII.2024.3523590</pub-id></citation>
</ref>
<ref id="B45">
<label>45.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mammadov</surname> <given-names>A</given-names></name> <name><surname>Folgoc</surname> <given-names>LL</given-names></name> <name><surname>Adam</surname> <given-names>J</given-names></name> <name><surname>Buronfosse</surname> <given-names>A</given-names></name> <name><surname>Hayem</surname> <given-names>G</given-names></name> <name><surname>Hocquet</surname> <given-names>G</given-names></name> <etal/></person-group>. <article-title>Self-supervision enhances instance-based multiple instance learning methods in digital pathology: a benchmark study</article-title>. <source>arXiv preprint arXiv:250501109</source>. (<year>2025</year>). <pub-id pub-id-type="doi">10.1117/1.JMI.12.6.061404</pub-id><pub-id pub-id-type="pmid">40475245</pub-id></citation></ref>
<ref id="B46">
<label>46.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Koohbanani</surname> <given-names>NA</given-names></name> <name><surname>Unnikrishnan</surname> <given-names>B</given-names></name> <name><surname>Khurram</surname> <given-names>SA</given-names></name> <name><surname>Krishnaswamy</surname> <given-names>P</given-names></name> <name><surname>Rajpoot</surname> <given-names>N</given-names></name></person-group>. <article-title>Self-path: self-supervision for classification of pathology images with limited annotations</article-title>. <source>IEEE Trans Med Imaging</source>. (<year>2021</year>) <volume>40</volume>:<fpage>2845</fpage>&#x02013;<lpage>56</lpage>. <pub-id pub-id-type="doi">10.1109/TMI.2021.3056023</pub-id><pub-id pub-id-type="pmid">33523807</pub-id></citation></ref>
<ref id="B47">
<label>47.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhou</surname> <given-names>B</given-names></name> <name><surname>Dey</surname> <given-names>N</given-names></name> <name><surname>Schlemper</surname> <given-names>J</given-names></name> <name><surname>Salehi</surname> <given-names>SSM</given-names></name> <name><surname>Liu</surname> <given-names>C</given-names></name> <name><surname>Duncan</surname> <given-names>JS</given-names></name> <etal/></person-group>. <article-title>DSFormer: a dual-domain self-supervised transformer for accelerated multi-contrast MRI reconstruction</article-title>. In: <source>Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)</source>. <publisher-loc>Los Alamitos, CA</publisher-loc>: <publisher-name>IEEE</publisher-name> (<year>2023</year>). p. <fpage>4966</fpage>&#x02013;<lpage>75</lpage>. <pub-id pub-id-type="doi">10.1109/WACV56688.2023.00494</pub-id></citation>
</ref>
<ref id="B48">
<label>48.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhuang</surname> <given-names>J</given-names></name> <name><surname>Wu</surname> <given-names>L</given-names></name> <name><surname>Wang</surname> <given-names>Q</given-names></name> <name><surname>Fei</surname> <given-names>P</given-names></name> <name><surname>Vardhanabhuti</surname> <given-names>V</given-names></name> <name><surname>Luo</surname> <given-names>L</given-names></name> <etal/></person-group>. <article-title>MiM: mask in mask self-supervised pre-training for 3D medical image analysis</article-title>. <source>IEEE Trans Med Imaging</source>. (<year>2025</year>). <pub-id pub-id-type="doi">10.1109/TMI.2025.3564382</pub-id><pub-id pub-id-type="pmid">40279226</pub-id></citation></ref>
<ref id="B49">
<label>49.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>Z</given-names></name> <name><surname>Zhang</surname> <given-names>Y</given-names></name> <name><surname>Wang</surname> <given-names>B</given-names></name> <name><surname>Yang</surname> <given-names>Y</given-names></name> <name><surname>Cai</surname> <given-names>L</given-names></name></person-group>. <article-title>SFma-Unet: a mamba-based spatial-frequency fusion network for medical image segmentation</article-title>. In: <source>ICASSP 2025&#x02013;2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>. <publisher-loc>Hyderabad</publisher-loc>: <publisher-name>IEEE</publisher-name> (<year>2025</year>). p. <fpage>1</fpage>&#x02013;<lpage>5</lpage>. <pub-id pub-id-type="doi">10.1109/ICASSP49660.2025.10889117</pub-id></citation>
</ref>
<ref id="B50">
<label>50.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>S</given-names></name> <name><surname>Lin</surname> <given-names>Y</given-names></name> <name><surname>Liu</surname> <given-names>D</given-names></name> <name><surname>Wang</surname> <given-names>P</given-names></name> <name><surname>Zhou</surname> <given-names>B</given-names></name> <name><surname>Si</surname> <given-names>F</given-names></name></person-group>. <article-title>Frequency-enhanced lightweight vision mamba network for medical image segmentation</article-title>. <source>IEEE Trans Instrum Meas</source>. (<year>2025</year>) <volume>74</volume>:<fpage>1</fpage>&#x02013;<lpage>12</lpage>. <pub-id pub-id-type="doi">10.1109/TIM.2025.3527526</pub-id></citation>
</ref>
<ref id="B51">
<label>51.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sun</surname> <given-names>J</given-names></name> <name><surname>Chen</surname> <given-names>K</given-names></name> <name><surname>Wu</surname> <given-names>X</given-names></name> <name><surname>Xu</surname> <given-names>Z</given-names></name> <name><surname>Wang</surname> <given-names>S</given-names></name> <name><surname>Zhang</surname> <given-names>Y</given-names></name></person-group>. <article-title>MSM-UNet: a medical image segmentation method based on wavelet transform and multi-scale Mamba-UNet</article-title>. <source>Expert Syst Appl</source>. (<year>2025</year>) <volume>288</volume>:<fpage>128241</fpage>. <pub-id pub-id-type="doi">10.1016/j.eswa.2025.128241</pub-id></citation>
</ref>
<ref id="B52">
<label>52.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>R</given-names></name> <name><surname>Liu</surname> <given-names>Y</given-names></name> <name><surname>Liang</surname> <given-names>P</given-names></name> <name><surname>Chang</surname> <given-names>Q</given-names></name></person-group>. <article-title>Ultralight vm-unet: parallel vision mamba significantly reduces parameters for skin lesion segmentation</article-title>. <source>arXiv preprint arXiv:240320035</source>. (<year>2024</year>). <pub-id pub-id-type="doi">10.1016/j.patter.2025.101298</pub-id></citation>
</ref>
<ref id="B53">
<label>53.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cheng</surname> <given-names>Z</given-names></name> <name><surname>Guo</surname> <given-names>J</given-names></name> <name><surname>Zhang</surname> <given-names>J</given-names></name> <name><surname>Qi</surname> <given-names>L</given-names></name> <name><surname>Zhou</surname> <given-names>L</given-names></name> <name><surname>Shi</surname> <given-names>Y</given-names></name> <etal/></person-group>. <article-title>Mamba-sea: a mamba-based framework with global-to-local sequence augmentation for generalizable medical image segmentation</article-title>. <source>IEEE Trans Med Imaging</source>. (<year>2025</year>). <pub-id pub-id-type="doi">10.1109/TMI.2025.3564765</pub-id><pub-id pub-id-type="pmid">40305245</pub-id></citation></ref>
<ref id="B54">
<label>54.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>M</given-names></name> <name><surname>Yu</surname> <given-names>Y</given-names></name> <name><surname>Jin</surname> <given-names>S</given-names></name> <name><surname>Gu</surname> <given-names>L</given-names></name> <name><surname>Ling</surname> <given-names>T</given-names></name> <name><surname>Tao</surname> <given-names>X</given-names></name></person-group>. <article-title>VM-UNET-V2: rethinking vision mamba UNet for medical image segmentation</article-title>. In: <source>International Symposium on Bioinformatics Research and Applications</source>. <publisher-loc>Springer</publisher-loc>: <publisher-name>New York</publisher-name> (<year>2024</year>). p. <fpage>335</fpage>&#x02013;<lpage>46</lpage>. <pub-id pub-id-type="doi">10.1007/978-981-97-5128-0_27</pub-id></citation>
</ref>
<ref id="B55">
<label>55.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>G</given-names></name> <name><surname>Huang</surname> <given-names>Q</given-names></name> <name><surname>Wang</surname> <given-names>W</given-names></name> <name><surname>Liu</surname> <given-names>L</given-names></name></person-group>. <article-title>Selective and multi-scale fusion mamba for medical image segmentation</article-title>. <source>Expert Syst Appl</source>. (<year>2025</year>) <volume>261</volume>:<fpage>125518</fpage>. <pub-id pub-id-type="doi">10.1016/j.eswa.2024.125518</pub-id></citation>
</ref>
<ref id="B56">
<label>56.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Khan</surname> <given-names>A</given-names></name> <name><surname>Asad</surname> <given-names>M</given-names></name> <name><surname>Benning</surname> <given-names>M</given-names></name> <name><surname>Roney</surname> <given-names>C</given-names></name> <name><surname>Slabaugh</surname> <given-names>G</given-names></name></person-group>. <article-title>CAMS: convolution and attention-free mamba-based cardiac image segmentation</article-title>. In: <source>2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)</source>. <publisher-loc>Tucson, AZ</publisher-loc>: <publisher-name>IEEE</publisher-name> (<year>2025</year>). p. <fpage>1893</fpage>&#x02013;<lpage>903</lpage>. <pub-id pub-id-type="doi">10.1109/WACV61041.2025.00191</pub-id></citation>
</ref>
<ref id="B57">
<label>57.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sirinukunwattana</surname> <given-names>K</given-names></name> <name><surname>Pluim</surname> <given-names>JPW</given-names></name> <name><surname>Chen</surname> <given-names>H</given-names></name> <name><surname>Qi</surname> <given-names>X</given-names></name> <name><surname>Heng</surname> <given-names>PA</given-names></name> <name><surname>Guo</surname> <given-names>YB</given-names></name> <etal/></person-group>. <article-title>Gland segmentation in colon histology images: the glas challenge contest</article-title>. <source>Med Image Anal</source>. (<year>2017</year>) <volume>35</volume>:<fpage>489</fpage>&#x02013;<lpage>502</lpage>. <pub-id pub-id-type="doi">10.1016/j.media.2016.08.008</pub-id><pub-id pub-id-type="pmid">27614792</pub-id></citation></ref>
<ref id="B58">
<label>58.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gutman</surname> <given-names>D</given-names></name> <name><surname>Codella</surname> <given-names>N</given-names></name> <name><surname>Celebi</surname> <given-names>ME</given-names></name> <name><surname>Helba</surname> <given-names>B</given-names></name> <name><surname>Marchetti</surname> <given-names>M</given-names></name> <name><surname>Mishra</surname> <given-names>N</given-names></name> <etal/></person-group>. <article-title>ISIC challenge 2016: skin lesion analysis towards melanoma detection</article-title>. <source>arXiv preprint arXiv:160501397</source>. (<year>2016</year>). <pub-id pub-id-type="doi">10.48550/arXiv.1605.01397</pub-id></citation>
</ref>
<ref id="B59">
<label>59.</label>
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Berseth</surname> <given-names>M</given-names></name></person-group>. <article-title>ISIC 2017-skin lesion analysis towards melanoma detection</article-title>. <source>arXiv preprint arXiv:170300523</source>. (<year>2017</year>). <pub-id pub-id-type="doi">10.48550/arXiv.1703.00523</pub-id></citation>
</ref>
<ref id="B60">
<label>60.</label>
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Conoci</surname> <given-names>S</given-names></name> <name><surname>Rundo</surname> <given-names>F</given-names></name> <name><surname>Petralta</surname> <given-names>S</given-names></name> <name><surname>Battiato</surname> <given-names>S</given-names></name></person-group>. <article-title>Advanced skin lesion discrimination pipeline for early melanoma cancer diagnosis towards PoC devices</article-title>. In: <source>2017 European Conference on Circuit Theory and Design (ECCTD)</source>. <publisher-loc>Catania</publisher-loc>: <publisher-name>IEEE</publisher-name> (<year>2017</year>). p. <fpage>1</fpage>&#x02013;<lpage>4</lpage>. <pub-id pub-id-type="doi">10.1109/ECCTD.2017.8093310</pub-id></citation>
</ref>
</ref-list>
</back>
</article>