<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Environ. Sci.</journal-id>
<journal-title>Frontiers in Environmental Science</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Environ. Sci.</abbrev-journal-title>
<issn pub-type="epub">2296-665X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">1515752</article-id>
<article-id pub-id-type="doi">10.3389/fenvs.2024.1515752</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Environmental Science</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>A vision-language model for predicting potential distribution land of soybean double cropping</article-title>
<alt-title alt-title-type="left-running-head">Gao et al.</alt-title>
<alt-title alt-title-type="right-running-head">
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/fenvs.2024.1515752">10.3389/fenvs.2024.1515752</ext-link>
</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Gao</surname>
<given-names>Bei</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/2876499/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
<role content-type="https://credit.niso.org/contributor-roles/Writing - review &#x26; editing/"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Liu</surname>
<given-names>Yuefeng</given-names>
</name>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Li</surname>
<given-names>Yanli</given-names>
</name>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Li</surname>
<given-names>Hongmei</given-names>
</name>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Li</surname>
<given-names>Meirong</given-names>
</name>
</contrib>
<contrib contrib-type="author">
<name>
<surname>He</surname>
<given-names>Wenli</given-names>
</name>
</contrib>
</contrib-group>
<aff>
<institution>Shaanxi Meteorological Service Center of Agricultural Remote Sensing and Economic Crops</institution>, <addr-line>Xi&#x2019;an</addr-line>, <country>China</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/2848045/overview">Minghan Cheng</ext-link>, Yangzhou University, China</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/2304995/overview">Beata Calka</ext-link>, Military University of Technology in Warsaw, Poland</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1626908/overview">Yao Zhaosheng</ext-link>, Yangzhou University, China</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Yuefeng Liu, <email>godflys@163.com</email>
</corresp>
</author-notes>
<pub-date pub-type="epub">
<day>10</day>
<month>01</month>
<year>2025</year>
</pub-date>
<pub-date pub-type="collection">
<year>2024</year>
</pub-date>
<volume>12</volume>
<elocation-id>1515752</elocation-id>
<history>
<date date-type="received">
<day>24</day>
<month>10</month>
<year>2024</year>
</date>
<date date-type="accepted">
<day>23</day>
<month>12</month>
<year>2024</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2025 Gao, Liu, Li, Li, Li and He.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Gao, Liu, Li, Li, Li and He</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<sec>
<title>Introduction</title>
<p>Accurately predicting suitable areas for double-cropped soybeans under changing climatic conditions is critical for ensuring food security anc optimizing land use. Traditional methods, relying on single-modal approaches such as remote sensing imagery or climate data in isolation, often fail to capture the complex interactions among environmental factors, leading to suboptimal predictions. Moreover, these approaches lack the ability to integrate multi-scale data and contextual information, limiting their applicability in diverse and dynamic environments.</p>
</sec>
<sec>
<title>Methods</title>
<p>To address these challenges, we propose AgriCLIP, anovel remote sensing vision-language model that integrates remote sensing imagery with textual data, such as climate reports and agricultural practices, to predict potential distribution areas of double-cropped soybeans under climate change. AgriCLIP employs advanced techniques including multi-scale data processing, self-supervised learning, and cross-modality feature fusion enabling comprehensive analysis of factors influencing crop suitability.</p>
</sec>
<sec>
<title>Results and discussion</title>
<p>Extensive evaluations on four diverse remote sensing datasets-RSICap RSIEval, MillionAID, and HRSID-demonstrate AgriCLIP&#x2019;s superior performance over state-of-the-art models. Notably, AgriCLIP achieves a 97.54% accuracy or the RSICap dataset and outperforms competitors across metrics such as recall F1 score, and AUC. Its efficiency is further highlighted by reduced computation a demands compared to baseline methods. AgriCLIP&#x2019;s ability to seamlessly integrate visual and contextual information not only advances prediction accuracy but also provides interpretable insights for agricultural planning and climate adaptation strategies, offering a robust and scalable solution for addressing the challenges of food security in the context of global climate change.</p>
</sec>
</abstract>
<kwd-group>
<kwd>AgriCLIP</kwd>
<kwd>remote sensing</kwd>
<kwd>vision-language model</kwd>
<kwd>climate change</kwd>
<kwd>double-cropped soybeans</kwd>
<kwd>predicting distribution areas</kwd>
</kwd-group>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Big Data, AI, and the Environment</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>Remote sensing image segmentation is a critical task in the field of remote sensing and geographic information systems, providing essential information for land cover classification, environmental monitoring, and urban planning (<xref ref-type="bibr" rid="B42">Zhou et al., 2024</xref>). The segmentation of remote sensing images is not only necessary for the accurate interpretation of vast amounts of data but also crucial for the effective management and utilization of natural resources. Given the increasing availability and resolution of remote sensing data, the need for advanced segmentation techniques has become more pronounced (<xref ref-type="bibr" rid="B39">Yuan et al., 2023</xref>). These techniques not only allow for the precise delineation of objects and regions within an image but also enable the extraction of meaningful patterns and features that are vital for a wide range of applications (<xref ref-type="bibr" rid="B35">Xu et al., 2021</xref>). Moreover, with the growing challenges posed by climate change, deforestation, and urbanization, the ability to monitor and analyze changes in the Earth&#x2019;s surface with high accuracy is more important than ever. This necessity has driven significant advancements in the field, leading to the development of various methods over the years, each with its strengths and limitations (<xref ref-type="bibr" rid="B24">Qi et al., 2022</xref>).</p>
<p>In early research on remote sensing image segmentation, traditional three-dimensional reconstruction techniques were widely used. These methods aimed to reconstruct the spatial structure of the Earth&#x2019;s surface through stereoscopic image pairs or photogrammetric techniques (<xref ref-type="bibr" rid="B3">Bigolin and Talamini, 2024</xref>). By leveraging geometric principles, traditional 3D reconstruction methods could segment images based on the relative positions and orientations of objects, providing detailed and accurate representations of the terrain (<xref ref-type="bibr" rid="B18">Li et al., 2024</xref>). However, these methods were computationally complex and required precise calibration and alignment of images, making them less practical for large-scale or real-time applications (<xref ref-type="bibr" rid="B16">Jung et al., 2024</xref>). Additionally, traditional 3D reconstruction techniques faced challenges in handling complex and heterogeneous landscapes, particularly when mixed pixels and uneven illumination conditions were present, which could significantly reduce the accuracy of the results (<xref ref-type="bibr" rid="B31">Tovihoudji et al., 2024</xref>). To overcome these issues, researchers began exploring alternative approaches that could offer more robust and scalable solutions. Compared to the limitations of manual and semi-automated methods, these emerging approaches demonstrated superior performance in processing large-scale data and achieving real-time capabilities, paving the way for further advancements in remote sensing image segmentation (<xref ref-type="bibr" rid="B16">Jung et al., 2024</xref>). By integrating advanced technologies like machine learning and deep learning, these methods exhibited higher efficiency and accuracy across various application scenarios, especially in handling complex landscapes, where they showed greater robustness and adaptability.</p>
<p>In response to the limitations of traditional 3D reconstruction methods, the field gradually shifted towards statistical learning and machine learning-based approaches. These methods introduced a more flexible and data-driven framework for remote sensing image segmentation, allowing for the incorporation of statistical models and machine learning algorithms to improve segmentation accuracy. Statistical learning methods, such as Markov Random Fields (MRF) and Conditional Random Fields (CRF), were employed to model the spatial dependencies between neighboring pixels, enabling more accurate segmentation by considering the contextual information within the image (<xref ref-type="bibr" rid="B27">Shaar et al., 2024</xref>). Machine learning algorithms, including Support Vector Machines (SVM), Random Forests, and k-Nearest Neighbors (k-NN), were also utilized to classify pixels based on their spectral and spatial features, offering improved performance over traditional methods (<xref ref-type="bibr" rid="B20">Ling et al., 2022</xref>). Despite their advantages, these methods still faced challenges, such as the need for extensive feature engineering and the inability to capture complex, non-linear relationships within the data. Furthermore, the performance of machine learning-based segmentation methods heavily depended on the quality and quantity of the training data, which could be a limiting factor in scenarios where labeled data was scarce or expensive to obtain (<xref ref-type="bibr" rid="B26">Rai et al., 2020</xref>).</p>
<p>To address the limitations of statistical learning and traditional machine learning methods, the advent of deep learning and pre-trained models brought a paradigm shift in remote sensing image segmentation. Deep learning-based methods, particularly Convolutional Neural Networks (CNNs), have revolutionized the field by automatically learning hierarchical representations of the data, enabling the segmentation of images with unprecedented accuracy and efficiency. Unlike traditional methods, deep learning approaches do not require manual feature extraction, as they can learn complex features directly from the raw pixel values through multiple layers of abstraction (<xref ref-type="bibr" rid="B41">Zhou et al., 2023</xref>). The introduction of pre-trained models, such as U-Net (<xref ref-type="bibr" rid="B2">Benchabana et al., 2023</xref>), ResNet (<xref ref-type="bibr" rid="B11">Gomes et al., 2021</xref>), and more recently, Vision Transformers (ViTs) (<xref ref-type="bibr" rid="B7">Dong et al., 2022</xref>), has further enhanced the segmentation capabilities by leveraging large-scale datasets and transfer learning techniques. These models have demonstrated remarkable performance in various remote sensing tasks, including land cover classification, object detection, and change detection, significantly reducing the need for extensive labeled datasets and improving generalization to new and unseen environments (<xref ref-type="bibr" rid="B19">Li et al., 2023</xref>). However, despite their success, deep learning-based segmentation methods are not without challenges. They require substantial computational resources and are often sensitive to hyperparameter tuning and network architecture design. Moreover, the black-box nature of deep learning models can make them difficult to interpret, which is a critical consideration in applications where explainability is as important as accuracy (<xref ref-type="bibr" rid="B40">Zhao et al., 2021</xref>).</p>
<p>To address the limitations of the aforementioned models, particularly their challenges in handling the complex and dynamic nature of environmental factors in agricultural tasks, we propose AgriCLIP: A Remote Sensing Vision-Language Model for Predicting Potential Distribution Areas of Double-Cropped Soybeans Under Climate Change. Our model specifically overcomes the shortcomings of traditional 3D reconstruction methods, which struggle with computational intensity and the segmentation of heterogeneous landscapes, by using multi-scale data processing to efficiently handle diverse and complex environmental conditions. Additionally, AgriCLIP addresses the limitations of statistical learning and traditional machine learning approaches, which often require extensive feature engineering and large labeled datasets, by leveraging self-supervised learning techniques that reduce the dependency on labeled data and enable the model to learn rich feature representations directly from the data. Furthermore, AgriCLIP mitigates the challenges associated with deep learning models, such as the need for substantial computational resources and sensitivity to hyperparameter tuning, by integrating pre-trained models that are optimized for remote sensing tasks, allowing for more efficient training and better generalization. Importantly, our model also tackles the issue of the black-box nature of deep learning approaches by combining visual and textual data, making the predictions more interpretable and contextually grounded. This combination of visual and contextual information allows AgriCLIP to provide a more comprehensive analysis, which is crucial for accurately predicting the potential distribution areas of double-cropped soybeans under varying climatic conditions. By addressing these key limitations, AgriCLIP offers a robust, scalable, and task-specific solution that is better suited to the demands of this agricultural application, marking a significant advancement in remote sensing image segmentation and prediction.<list list-type="simple">
<list-item>
<p>
<inline-formula id="inf1">
<mml:math id="m1">
<mml:mrow>
<mml:mo>&#x2022;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> AgriCLIP introduces a novel cross-modality fusion module that seamlessly integrates multi-scale remote sensing imagery with textual data, enabling the model to capture complex environmental interactions and provide more accurate predictions for agricultural tasks under changing climatic conditions.</p>
</list-item>
<list-item>
<p>
<inline-formula id="inf2">
<mml:math id="m2">
<mml:mrow>
<mml:mo>&#x2022;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> The method is highly versatile, capable of adapting to various scenarios, from large-scale agricultural regions to specific localized conditions, while maintaining high efficiency and generalizability, making it suitable for a wide range of remote sensing applications.</p>
</list-item>
<list-item>
<p>
<inline-formula id="inf3">
<mml:math id="m3">
<mml:mrow>
<mml:mo>&#x2022;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> Extensive experiments demonstrate that AgriCLIP significantly outperforms state-of-the-art models across multiple benchmarks, which confirms its effectiveness and robustness in predicting double-cropped soybean distribution areas.</p>
</list-item>
</list>
</p>
</sec>
<sec id="s2">
<title>2 Related work</title>
<sec id="s2-1">
<title>2.1 Object-based segmentation</title>
<p>Object-Based Image Analysis (OBIA) has been extensively utilized for remote sensing image segmentation, offering a structured approach that groups pixels into meaningful objects for analysis. OBIA&#x2019;s strength lies in its ability to incorporate spatial context and relationships, enabling the segmentation of high-resolution images where individual objects like buildings, roads, or vegetation clusters consist of multiple pixels with similar characteristics (<xref ref-type="bibr" rid="B8">Du et al., 2020</xref>; <xref ref-type="bibr" rid="B17">Junior et al., 2023</xref>). This makes OBIA particularly valuable for tasks requiring detailed spatial and contextual information (<xref ref-type="bibr" rid="B1">Azhand et al., 2024</xref>). Recent advancements in OBIA have highlighted its flexibility across different scales and data types, which aligns closely with the goals of this study (<xref ref-type="bibr" rid="B15">Huang et al., 2020</xref>). However, the challenges of parameter sensitivity and manual intervention remain significant, necessitating further development in automated and scalable segmentation techniques (<xref ref-type="bibr" rid="B6">Cui et al., 2023</xref>; <xref ref-type="bibr" rid="B23">Norman et al., 2021</xref>).</p>
</sec>
<sec id="s2-2">
<title>2.2 Hybrid GIS and remote sensing</title>
<p>The integration of multimodal data has become an increasingly important approach in remote sensing image segmentation, allowing for the combination of different types of information to improve segmentation accuracy and robustness. Multimodal models leverage the strengths of various data sources, such as optical images, LiDAR data, synthetic aperture radar (SAR), and textual information, to provide a more comprehensive understanding of the environment (<xref ref-type="bibr" rid="B28">Sun et al., 2021</xref>). This approach is particularly valuable in remote sensing, where no single data source can fully capture the complexities of the Earth&#x2019;s surface (<xref ref-type="bibr" rid="B13">He et al., 2023</xref>). Multimodal models have evolved to incorporate multiple data types into a unified framework, enhancing the ability to segment images with greater precision. For instance, combining optical imagery with LiDAR data allows for the integration of spectral and elevation information, leading to more accurate segmentation in complex terrains (<xref ref-type="bibr" rid="B22">Luo et al., 2024</xref>). Similarly, the fusion of SAR and optical data can provide complementary information, where SAR captures structural features that are often obscured in optical images due to weather conditions or lighting (<xref ref-type="bibr" rid="B25">Quan et al., 2024</xref>). In recent years, the incorporation of textual data, such as climate reports or land use descriptions, has further advanced the capabilities of multimodal models, enabling the interpretation of remote sensing images in contextually rich environments (<xref ref-type="bibr" rid="B36">Yan et al., 2023</xref>). The main advantage of multimodal models lies in their ability to capture and integrate diverse aspects of the observed scene, leading to more informed segmentation decisions (<xref ref-type="bibr" rid="B4">Cheng et al., 2021</xref>). By leveraging multiple data sources, these models can mitigate the limitations inherent in any single modality, such as the spectral ambiguity in optical images or the speckle noise in SAR data. However, the development of multimodal models also presents significant challenges. One of the primary difficulties is the alignment and synchronization of different data types, which often come in varying resolutions, formats, and coordinate systems (<xref ref-type="bibr" rid="B33">Wang et al., 2022</xref>). Moreover, the fusion of multimodal data can be computationally intensive, requiring sophisticated algorithms to effectively combine the information without losing critical details. Another challenge is the design of models that can effectively learn from and generalize across multimodal inputs, which often involves complex architectures and extensive training (<xref ref-type="bibr" rid="B9">Gammans et al., 2024</xref>).</p>
</sec>
<sec id="s2-3">
<title>2.3 Multimodal models</title>
<p>Multimodal models have emerged as powerful tools for integrating diverse data sources, including optical imagery, LiDAR, synthetic aperture radar (SAR), and textual information, to enhance segmentation accuracy. These models are particularly relevant for addressing the limitations of single-modality approaches, which often struggle to capture the full complexity of environmental features (<xref ref-type="bibr" rid="B28">Sun et al., 2021</xref>; <xref ref-type="bibr" rid="B13">He et al., 2023</xref>). For example, combining optical and LiDAR data allows for the integration of spectral and elevation information, a key requirement for robust segmentation in heterogeneous landscapes (<xref ref-type="bibr" rid="B22">Luo et al., 2024</xref>). The incorporation of textual data, such as climate reports or land-use descriptions, has further expanded the capabilities of multimodal models, providing contextually rich interpretations of remote sensing images (<xref ref-type="bibr" rid="B36">Yan et al., 2023</xref>). These techniques align with the methodological framework of this study, where cross-modality feature fusion is employed to achieve more accurate predictions (<xref ref-type="bibr" rid="B4">Cheng et al., 2021</xref>). The advantages of multimodal models include their ability to mitigate the limitations of individual modalities and their potential for delivering context-aware segmentation (<xref ref-type="bibr" rid="B25">Quan et al., 2024</xref>).However, the challenges of data alignment, computational demands, and architectural complexity remain areas of active research (<xref ref-type="bibr" rid="B33">Wang et al., 2022</xref>; <xref ref-type="bibr" rid="B10">Gao et al., 2024</xref>). The proposed work builds on these concepts by introducing a vision-language framework that addresses these challenges through self-supervised learning and advanced feature fusion mechanisms, thereby pushing the boundaries of current multimodal approaches in remote sensing.</p>
</sec>
</sec>
<sec sec-type="methods" id="s3">
<title>3 Methodology</title>
<sec id="s3-1">
<title>3.1 Overview</title>
<p>In this work, we propose an advanced remote sensing vision-language model, designed specifically for predicting potential distribution areas of double-cropped soybeans under the changing climate conditions. The proposed model integrates remote sensing data with sophisticated language models to enhance the prediction accuracy and robustness across different climatic scenarios. The model architecture leverages multi-scale data processing, self-supervised learning (SSL) techniques, and cross-modality feature fusion, allowing it to process and analyze diverse data sources efficiently. The overall data flow is structured into several key modules: data preprocessing, feature extraction, and prediction, all of which are intricately connected through a shared representation learning framework (As shown in <xref ref-type="fig" rid="F1">Figure 1</xref>).</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>Diagram of the structure of AgriCLIP. The diagram shows the data flow of image and text inputs. Images are processed through the Multi-Scale Feature Extractor (MSFE) module and a shared Vision Transformer, while text is processed through a Text Encoder and the Adaptive Consistency Module (ACM). These are then fused in the Cross-Modality Fusion Layer (CMFL) to generate a similarity matrix, which is compared with the ground truth to compute the loss, followed by output through a prediction layer.</p>
</caption>
<graphic xlink:href="fenvs-12-1515752-g001.tif"/>
</fig>
<p>The data preprocessing module handles various input formats and resolutions, ensuring that the model can effectively integrate remote sensing images and climate-related textual data. In the feature extraction stage, the model employs a multi-scale masked autoencoder (MAE) inspired by recent advancements in remote sensing image analysis. This MAE is further augmented with a novel scale-consistency mechanism that enforces consistency across different scales of input data, which is particularly useful in handling the inherent variability in remote sensing data. The prediction module is designed to fuse the extracted features from both visual and textual inputs, utilizing a cross-attention mechanism that allows the model to weigh the importance of different modalities dynamically. This module outputs a probabilistic map indicating the potential distribution areas for double-cropped soybeans, accounting for various climate change scenarios.</p>
<p>In the following sections, we delve into the specific components of our model. <xref ref-type="sec" rid="s3-2">Section 3.2</xref> details the preliminaries, where we formalize the problem and set the mathematical foundation. <xref ref-type="sec" rid="s3-3">Section 3.3</xref> introduces the new model architecture, highlighting the innovations that differentiate it from existing approaches. Finally, in <xref ref-type="sec" rid="s3-4">Section 3.4</xref>, we discuss the integration of domain-specific strategies that enhance the model&#x2019;s predictive capabilities.</p>
</sec>
<sec id="s3-2">
<title>3.2 Preliminaries</title>
<p>In this section, we formalize the problem of predicting potential distribution areas for double-cropped soybeans under climate change using a remote sensing vision-language model. Let <inline-formula id="inf4">
<mml:math id="m4">
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> represent the dataset, where <inline-formula id="inf5">
<mml:math id="m5">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>H</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>W</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> denotes a remote sensing image of height <inline-formula id="inf6">
<mml:math id="m6">
<mml:mrow>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, width <inline-formula id="inf7">
<mml:math id="m7">
<mml:mrow>
<mml:mi>W</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, and <inline-formula id="inf8">
<mml:math id="m8">
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> spectral channels. <inline-formula id="inf9">
<mml:math id="m9">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> corresponds to the associated textual data, providing contextual information such as climate conditions, soil types, and agricultural practices. The label <inline-formula id="inf10">
<mml:math id="m10">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:mrow>
<mml:mn>0,1</mml:mn>
</mml:mrow>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> indicates the presence or absence of double-cropped soybeans in the corresponding geographical area.</p>
<p>The goal is to learn a function <inline-formula id="inf11">
<mml:math id="m11">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>:</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold">T</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2192;</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> parameterized by <inline-formula id="inf12">
<mml:math id="m12">
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, where <inline-formula id="inf13">
<mml:math id="m13">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> is the predicted probability of double-cropping soybeans in a given area, based on both the remote sensing image <inline-formula id="inf14">
<mml:math id="m14">
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and the textual data <inline-formula id="inf15">
<mml:math id="m15">
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. The function <inline-formula id="inf16">
<mml:math id="m16">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is trained to minimize a loss function <inline-formula id="inf17">
<mml:math id="m17">
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> over the dataset <inline-formula id="inf18">
<mml:math id="m18">
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. To achieve this, we adopt a multi-modal fusion strategy where the remote sensing images and textual data are processed through separate feature extractors, denoted as <inline-formula id="inf19">
<mml:math id="m19">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
<mml:mo>;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf20">
<mml:math id="m20">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi>
<mml:mo>;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, respectively. These feature extractors map the inputs to a shared latent space <inline-formula id="inf21">
<mml:math id="m21">
<mml:mrow>
<mml:mi mathvariant="script">Z</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, such that <inline-formula id="inf22">
<mml:math id="m22">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>:</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>H</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>W</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2192;</mml:mo>
<mml:mi mathvariant="script">Z</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf23">
<mml:math id="m23">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>:</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mi>T</mml:mi>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mo>&#xd7;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2192;</mml:mo>
<mml:mi mathvariant="script">Z</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, where <inline-formula id="inf24">
<mml:math id="m24">
<mml:mrow>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mi>T</mml:mi>
<mml:mo stretchy="false">&#x7c;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> represents the length of the textual input and <inline-formula id="inf25">
<mml:math id="m25">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> the dimensionality of the text embedding. The fused features in the latent space <inline-formula id="inf26">
<mml:math id="m26">
<mml:mrow>
<mml:mi mathvariant="script">Z</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> are then used to make the final prediction, <inline-formula id="inf27">
<mml:math id="m27">
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>&#x3c3;</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x22a4;</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, where <inline-formula id="inf28">
<mml:math id="m28">
<mml:mrow>
<mml:mi>&#x3c3;</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mo>&#x22c5;</mml:mo>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is the sigmoid activation function and <inline-formula id="inf29">
<mml:math id="m29">
<mml:mrow>
<mml:mi mathvariant="bold">W</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> represents the weights for the linear combination of features. Given the nature of remote sensing data, which often includes multi-scale images with different spatial resolutions, we need to ensure that our model effectively integrates this multi-scale information. Let <inline-formula id="inf30">
<mml:math id="m30">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> denote the image at scale <inline-formula id="inf31">
<mml:math id="m31">
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, where <inline-formula id="inf32">
<mml:math id="m32">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> represents the different scales. The model is designed to handle these multi-scale inputs by enforcing scale consistency in the feature space. Specifically, the loss function <inline-formula id="inf33">
<mml:math id="m33">
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> includes a term that penalizes discrepancies between features extracted at different scales, ensuring that the learned representations are consistent and robust across various resolutions. Additionally, the textual data <inline-formula id="inf34">
<mml:math id="m34">
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is processed using a transformer-based model that captures the contextual dependencies within the text, allowing the model to weigh different parts of the textual input according to their relevance to the prediction task. The final prediction is then based on a cross-attention mechanism that aligns the visual and textual features, ensuring that the model&#x2019;s predictions are informed by both modalities in a coherent manner. The model is trained using a combination of supervised learning, based on the labeled examples in <inline-formula id="inf35">
<mml:math id="m35">
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, and self-supervised learning, leveraging unlabeled data through techniques such as masked language modeling for the textual data and masked image modeling for the remote sensing images. This hybrid approach allows the model to effectively learn from the available data, even when labeled examples are scarce.</p>
<p>Formally, the training objective can be expressed as <xref ref-type="disp-formula" rid="e1">Formula 1</xref>:<disp-formula id="e1">
<mml:math id="m36">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2a;</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>arg</mml:mi>
<mml:munder>
<mml:mrow>
<mml:mi>min</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:munderover>
</mml:mstyle>
<mml:mi mathvariant="script">L</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3bb;</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>consistency</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3b2;</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>self</mml:mtext>
<mml:mo>-</mml:mo>
<mml:mtext>supervised</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(1)</label>
</disp-formula>where <inline-formula id="inf36">
<mml:math id="m37">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>consistency</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> enforces the scale consistency across multi-scale inputs, and <inline-formula id="inf37">
<mml:math id="m38">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>self</mml:mtext>
<mml:mo>-</mml:mo>
<mml:mtext>supervised</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> incorporates the self-supervised objectives for learning robust feature representations. The coefficients <inline-formula id="inf38">
<mml:math id="m39">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf39">
<mml:math id="m40">
<mml:mrow>
<mml:mi>&#x3b2;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> are hyperparameters designed to balance the contributions of different terms in the loss function. These coefficients play a critical role in controlling the trade-offs between the objectives represented in the formula. The values of <inline-formula id="inf40">
<mml:math id="m41">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf41">
<mml:math id="m42">
<mml:mrow>
<mml:mi>&#x3b2;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> were determined empirically through a systematic hyperparameter tuning process. Specifically, we performed grid search experiments on the validation set, testing a range of plausible values for these coefficients. The goal was to identify the combination of <inline-formula id="inf42">
<mml:math id="m43">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf43">
<mml:math id="m44">
<mml:mrow>
<mml:mi>&#x3b2;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> that optimizes the model&#x2019;s performance across key evaluation metrics such as accuracy, F1 score, and recall.</p>
</sec>
<sec id="s3-3">
<title>3.3 Adaptive multi-scale consistency network</title>
<p>In this subsection, we introduce the Adaptive Multi-Scale Consistency Network (AMSCN), a novel model architecture designed to address the challenges of multi-scale data fusion in remote sensing applications, specifically for predicting the distribution of double-cropped soybeans under varying climate conditions. The AMSCN extends the traditional Masked Autoencoder (MAE) framework by integrating an adaptive scale-consistency mechanism, which ensures that the features extracted from different scales of input data are not only consistent but also adaptive to the varying spatial resolutions and spectral characteristics inherent in remote sensing imagery. The model is composed of three key components: (1) a Multi-Scale Feature Extractor (MSFE), (2) an Adaptive Consistency Module (ACM), and (3) a Cross-Modality Fusion Layer (CMFL). These components work in synergy to extract, align, and integrate features from both the remote sensing images and the associated textual data.</p>
<sec id="s3-3-1">
<title>3.3.1 Multi-scale feature extractor (MSFE)</title>
<p>The Multi-Scale Feature Extractor (MSFE) is a critical component for processing remote sensing images at various scales. These multi-scale inputs are denoted as <inline-formula id="inf44">
<mml:math id="m45">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>, where each <inline-formula id="inf45">
<mml:math id="m46">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> represents an image at scale <inline-formula id="inf46">
<mml:math id="m47">
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. For each scale <inline-formula id="inf47">
<mml:math id="m48">
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, the MSFE utilizes a shared Vision Transformer (ViT) backbone, allowing the model to process different scale inputs while maintaining computational efficiency. The shared architecture enables the extraction of <italic>scale-invariant features</italic>, which are crucial in remote sensing tasks due to the diverse resolutions present in such images.</p>
<p>The feature extraction process for each scale <inline-formula id="inf48">
<mml:math id="m49">
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> can be formalized as follows. Given an input image <inline-formula id="inf49">
<mml:math id="m50">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>, the MSFE processes it through a feature extractor <inline-formula id="inf50">
<mml:math id="m51">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, which is parameterized by <inline-formula id="inf51">
<mml:math id="m52">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> for each scale. The result is a fixed-dimensional vector <inline-formula id="inf52">
<mml:math id="m53">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>, which encodes the scale-specific information (<xref ref-type="disp-formula" rid="e2">Formula 2</xref>):<disp-formula id="e2">
<mml:math id="m54">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
<mml:mo>;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>
<label>(2)</label>
</disp-formula>
</p>
<p>This embedding <inline-formula id="inf53">
<mml:math id="m55">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> contains the key features extracted from each image at its corresponding scale.</p>
<p>The backbone of the MSFE is based on a shared transformer architecture, inspired by Vision Transformers (ViTs). This shared transformer consists of multiple stages, as illustrated in Figure (a). At each stage, the input image is progressively reduced in resolution through a Patch Embedding layer, while the number of feature channels is increased. The transformer architecture processes these embedded patches through a set of shared transformer blocks at each stage (As shown in <xref ref-type="fig" rid="F2">Figure 2</xref>).</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>Structure diagram of Shared Transformer. First, in Figure <bold>(A)</bold>, the image is gradually extracted through Patch Embedding and shared Transformer modules in multiple stages, the feature resolution is gradually reduced, and the number of channels is increased. At the same time, through the window space transformation and attention mechanism in Figure <bold>(B)</bold>, the model can effectively process information of different scales. Finally, in Figure <bold>(C)</bold>, the features are further processed through multi-layer operations, and the final output is the result for tasks such as classification.</p>
</caption>
<graphic xlink:href="fenvs-12-1515752-g002.tif"/>
</fig>
<p>The transformer operations in the shared layers can be mathematically described as follows. For a given input sequence <inline-formula id="inf54">
<mml:math id="m56">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">patch</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, the self-attention mechanism computes a weighted sum of all positions (<xref ref-type="disp-formula" rid="e3">Formula 3</xref>):<disp-formula id="e3">
<mml:math id="m57">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">attn</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>Softmax</mml:mtext>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="bold">Q</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
<mml:mi mathvariant="bold">V</mml:mi>
</mml:mrow>
</mml:math>
<label>(3)</label>
</disp-formula>where <inline-formula id="inf55">
<mml:math id="m58">
<mml:mrow>
<mml:mi mathvariant="bold">Q</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf56">
<mml:math id="m59">
<mml:mrow>
<mml:mi mathvariant="bold">K</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, and <inline-formula id="inf57">
<mml:math id="m60">
<mml:mrow>
<mml:mi mathvariant="bold">V</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> represent the queries, keys, and values, which are linear transformations of the input patches. After the attention calculation, a feed-forward network (FFN) is applied to each position in the sequence (<xref ref-type="disp-formula" rid="e4">Formula 4</xref>):<disp-formula id="e4">
<mml:math id="m61">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>FFN</mml:mtext>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">attn</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>
<label>(4)</label>
</disp-formula>
</p>
<p>This process is repeated for <inline-formula id="inf58">
<mml:math id="m62">
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> transformer blocks at each stage, progressively refining the features as the resolution decreases, but the number of channels increases.</p>
<p>In addition to the shared transformer architecture, the MSFE incorporates a multi-scale attention mechanism, as depicted in Figure (b). The attention mechanism operates over <italic>windowed patches</italic> of the image. The pooling operation first divides the image into default windows. These windows are projected into a latent space, where the spatial relationships between the patches are captured by a spatial transform mechanism.</p>
<p>The attention mechanism for a window is given by <xref ref-type="disp-formula" rid="e5">Formula 5</xref>:<disp-formula id="e5">
<mml:math id="m63">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>Softmax</mml:mtext>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">Q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">K</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">V</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
<label>(5)</label>
</disp-formula>
</p>
<p>The spatial transformation adjusts the window positions and allows the model to integrate information from multiple scales, ensuring that features from different regions of the image are processed appropriately.</p>
<p>Finally, the output features from the MSFE are passed through multiple layers, as illustrated in Figure (c). These layers include layer normalization, multi-head attention, and feed-forward layers. The final output <inline-formula id="inf59">
<mml:math id="m64">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> is the representation used for task-specific outputs, such as classification, segmentation, or detection.</p>
<p>The final output <inline-formula id="inf60">
<mml:math id="m65">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> is computed through repeated applications of the following operations (<xref ref-type="disp-formula" rid="e6">Formula 6</xref>):<disp-formula id="e6">
<mml:math id="m66">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>FFN</mml:mtext>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mtext>Norm</mml:mtext>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mtext>VSA</mml:mtext>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>
<label>(6)</label>
</disp-formula>
</p>
<p>Here, VSA refers to the Vision Self-Attention module, and FFN represents the feed-forward network. The normalized outputs are added to the input via a residual connection to stabilize training. The output <inline-formula id="inf61">
<mml:math id="m67">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> is passed to a task-specific classification head for downstream tasks (As shown in <xref ref-type="fig" rid="F3">Figure 3</xref>).</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Structure diagram of Vision Self-Attention. The data is weighted by the attention weights and the final score is calculated to achieve a weighted evaluation and output of the input information.</p>
</caption>
<graphic xlink:href="fenvs-12-1515752-g003.tif"/>
</fig>
<p>The MSFE leverages a shared transformer architecture across multiple scales to efficiently capture features at different levels of resolution. The multi-scale attention mechanism further enhances the model&#x2019;s ability to process complex, large-scale remote sensing data.</p>
</sec>
<sec id="s3-3-2">
<title>3.3.2 Adaptive consistency module (ACM)</title>
<p>The Adaptive Consistency Module (ACM) is a key component of the Adaptive Multi-Scale Consistent Network (AMSCN), designed to ensure consistency across features extracted from multiple scales. In remote sensing or vision tasks where images can be captured at different resolutions, it becomes crucial to align feature representations across these scales. The ACM achieves this by introducing both a scale-consistency loss and a scale attention mechanism, which dynamically adjusts the importance of different scales based on their relevance to the prediction task.</p>
<p>The primary function of the ACM is to enforce consistency between features extracted from different scales. This is accomplished by minimizing the discrepancy between feature representations from distinct scales. To achieve this, the ACM introduces a scale-consistency loss, denoted as <inline-formula id="inf62">
<mml:math id="m68">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>scale</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, which encourages features from different scales to be similar while maintaining the ability to differentiate scale-specific information when necessary.</p>
<p>Given the feature representations <inline-formula id="inf63">
<mml:math id="m69">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf64">
<mml:math id="m70">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> for two different scales <inline-formula id="inf65">
<mml:math id="m71">
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf66">
<mml:math id="m72">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>, the scale-consistency loss is defined as the <italic>mean squared error</italic> (MSE) between these features (<xref ref-type="disp-formula" rid="e7">Formula 7</xref>):<disp-formula id="e7">
<mml:math id="m73">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>scale</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfrac>
<mml:mstyle displaystyle="true">
<mml:munder>
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x2260;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:munder>
</mml:mstyle>
<mml:msup>
<mml:mrow>
<mml:mfenced open="&#x2016;" close="&#x2016;">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2212;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
<label>(7)</label>
</disp-formula>where <inline-formula id="inf67">
<mml:math id="m74">
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> denotes the total number of scales. This loss encourages the network to align the features across scales by penalizing differences between the features extracted from any two scales. The normalization factor <inline-formula id="inf68">
<mml:math id="m75">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</inline-formula> ensures that the scale-consistency loss is independent of the number of scales.</p>
<p>This formulation promotes the learning of robust, scale-invariant features while still allowing the model to capture unique scale-specific information as needed for particular tasks.</p>
<p>In addition to ensuring feature consistency across scales, the ACM dynamically adjusts the importance of each scale during the feature fusion process through a Scale Attention Mechanism. This mechanism computes attention scores for each scale, allowing the model to emphasize the most relevant scale for a given input and task. The scale attention score <inline-formula id="inf69">
<mml:math id="m76">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>&#x3b1;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> for scale <inline-formula id="inf70">
<mml:math id="m77">
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is computed using the following softmax formulation (<xref ref-type="disp-formula" rid="e8">Formula 8</xref>):<disp-formula id="e8">
<mml:math id="m78">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>&#x3b1;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>exp</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:munderover>
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mo>&#x2061;</mml:mo>
<mml:mi>exp</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<label>(8)</label>
</disp-formula>where <inline-formula id="inf71">
<mml:math id="m79">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is a learnable weight matrix applied to the feature representation <inline-formula id="inf72">
<mml:math id="m80">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> of scale <inline-formula id="inf73">
<mml:math id="m81">
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. This weight matrix transforms the features into a score space, which is then normalized using the softmax function to obtain the attention weights <inline-formula id="inf74">
<mml:math id="m82">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>&#x3b1;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>. These attention weights determine the contribution of each scale to the final feature representation.</p>
<p>Once the attention scores <inline-formula id="inf75">
<mml:math id="m83">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>&#x3b1;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> are computed for each scale, the final scale-consistent feature representation is obtained by taking a weighted sum of the scale-specific features (<xref ref-type="disp-formula" rid="e9">Formula 9</xref>):<disp-formula id="e9">
<mml:math id="m84">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>final</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:munderover>
</mml:mstyle>
<mml:msup>
<mml:mrow>
<mml:mi>&#x3b1;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
<label>(9)</label>
</disp-formula>
</p>
<p>Here, <inline-formula id="inf76">
<mml:math id="m85">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> represents the feature vector corresponding to scale <inline-formula id="inf77">
<mml:math id="m86">
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, and <inline-formula id="inf78">
<mml:math id="m87">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>&#x3b1;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> is the attention score computed for that scale. This weighted combination allows the network to adaptively focus on the most relevant scales while still leveraging information from all scales. By dynamically adjusting the importance of each scale based on the input, the ACM ensures that the final feature representation <inline-formula id="inf79">
<mml:math id="m88">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>final</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is both robust and flexible, capturing important multi-scale patterns.</p>
<p>To improve the robustness of the ACM, additional regularization terms can be introduced to further align the features across scales while preserving discriminative power. One such regularization term can be the inter-scale diversity loss, which encourages diversity between the feature representations at different scales. This can be defined as <xref ref-type="disp-formula" rid="e10">Formula 10</xref>:<disp-formula id="e10">
<mml:math id="m89">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>div</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfrac>
<mml:mstyle displaystyle="true">
<mml:munder>
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x2260;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:munder>
</mml:mstyle>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
<mml:mo>&#x22c5;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">&#x2016;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
<mml:mo stretchy="false">&#x2016;</mml:mo>
<mml:mo stretchy="false">&#x2016;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
<mml:mo stretchy="false">&#x2016;</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>
<label>(10)</label>
</disp-formula>
</p>
<p>This term ensures that while the features from different scales are aligned, they still maintain a level of diversity, which is crucial for capturing unique scale-specific information. By combining the scale-consistency loss <inline-formula id="inf80">
<mml:math id="m90">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>scale</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> with the inter-scale diversity loss <inline-formula id="inf81">
<mml:math id="m91">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>div</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, the model can achieve a balanced representation that is both consistent and diverse across scales.</p>
</sec>
<sec id="s3-3-3">
<title>3.3.3 Cross-modality fusion layer (CMFL)</title>
<p>The Cross-Modality Fusion Layer (CMFL) is a crucial component of the Adaptive Multi-Scale Consistent Network (AMSCN) that integrates scale-consistent visual features with contextual information from associated textual data. In applications such as remote sensing, visual data (e.g., satellite images) often need to be complemented with textual information (e.g., crop reports, weather conditions, or geographic descriptions). The CMFL is designed to perform this cross-modal fusion effectively, using a transformer-based approach to align and merge the information from these two modalities.</p>
<p>The textual data, denoted as <inline-formula id="inf82">
<mml:math id="m92">
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, is first processed by a transformer-based text encoder <inline-formula id="inf83">
<mml:math id="m93">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi>
<mml:mo>;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, parameterized by <inline-formula id="inf84">
<mml:math id="m94">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>. This encoder extracts meaningful representations from the text, transforming the input textual sequence into a set of feature vectors. The output of the text encoder is a sequence of textual features <inline-formula id="inf85">
<mml:math id="m95">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>, where <inline-formula id="inf86">
<mml:math id="m96">
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> indexes the tokens in the textual sequence. Formally, this process can be written as <xref ref-type="disp-formula" rid="e11">Formula 11</xref>:<disp-formula id="e11">
<mml:math id="m97">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi>
<mml:mo>;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>
<label>(11)</label>
</disp-formula>where <inline-formula id="inf87">
<mml:math id="m98">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mi>T</mml:mi>
<mml:mo stretchy="false">&#x7c;</mml:mo>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf88">
<mml:math id="m99">
<mml:mrow>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mi>T</mml:mi>
<mml:mo stretchy="false">&#x7c;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> is the length of the textual sequence.</p>
<p>The CMFL employs a cross-attention mechanism to align the visual features, extracted by the visual backbone, with the contextual information from the textual data. The goal is to allow the model to focus on relevant text features for each visual feature. The visual features, denoted as <inline-formula id="inf89">
<mml:math id="m100">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>final</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>, are the scale-consistent features obtained from the Multi-Scale Feature Extractor (MSFE). The cross-attention mechanism computes an alignment score between each visual feature and each textual feature.</p>
<p>Let <inline-formula id="inf90">
<mml:math id="m101">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf91">
<mml:math id="m102">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> represent the query vector for the <inline-formula id="inf92">
<mml:math id="m103">
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>-th visual feature and the key vector for the <inline-formula id="inf93">
<mml:math id="m104">
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>-th textual feature, respectively. These are computed as follows (<xref ref-type="disp-formula" rid="e12">Formula 12</xref>):<disp-formula id="e12">
<mml:math id="m105">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>final</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:mspace width="1em"/>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
<label>(12)</label>
</disp-formula>
</p>
<p>Here, <inline-formula id="inf94">
<mml:math id="m106">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf95">
<mml:math id="m107">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> are learnable weight matrices used to project the visual and textual features into a shared latent space. The <italic>cross-attention score</italic> <inline-formula id="inf96">
<mml:math id="m108">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>&#x3b2;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> between the <inline-formula id="inf97">
<mml:math id="m109">
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>-th visual feature and the <inline-formula id="inf98">
<mml:math id="m110">
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>-th textual feature is then computed using the dot product followed by a softmax normalization (<xref ref-type="disp-formula" rid="e13">Formula 13</xref>):<disp-formula id="e13">
<mml:math id="m111">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>&#x3b2;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>exp</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x22a4;</mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:munderover>
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mi>T</mml:mi>
<mml:mo stretchy="false">&#x7c;</mml:mo>
</mml:mrow>
</mml:munderover>
<mml:mo>&#x2061;</mml:mo>
<mml:mi>exp</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x22a4;</mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<label>(13)</label>
</disp-formula>
</p>
<p>This attention score represents the relevance of the <inline-formula id="inf99">
<mml:math id="m112">
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>-th textual feature for the <inline-formula id="inf100">
<mml:math id="m113">
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>-th visual feature, allowing the model to attend to the most relevant parts of the text for each visual feature.</p>
<p>Once the cross-attention scores <inline-formula id="inf101">
<mml:math id="m114">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>&#x3b2;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> are computed, the final fused feature for the <inline-formula id="inf102">
<mml:math id="m115">
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>-th visual feature is obtained by taking a weighted sum of the value vectors corresponding to the textual features. The value vector <inline-formula id="inf103">
<mml:math id="m116">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> for each textual feature is computed as <xref ref-type="disp-formula" rid="e14">Formula 14</xref>:<disp-formula id="e14">
<mml:math id="m117">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
<label>(14)</label>
</disp-formula>where <inline-formula id="inf104">
<mml:math id="m118">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is another learnable weight matrix. The final fused feature <inline-formula id="inf105">
<mml:math id="m119">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>fused</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> for the <inline-formula id="inf106">
<mml:math id="m120">
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>-th visual feature is then obtained by summing the value vectors weighted by the cross-attention scores (<xref ref-type="disp-formula" rid="e15">Formula 15</xref>):<disp-formula id="e15">
<mml:math id="m121">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>fused</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mi>T</mml:mi>
<mml:mo stretchy="false">&#x7c;</mml:mo>
</mml:mrow>
</mml:munderover>
</mml:mstyle>
<mml:msubsup>
<mml:mrow>
<mml:mi>&#x3b2;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
<label>(15)</label>
</disp-formula>
</p>
<p>This fusion process ensures that each visual feature is enhanced by the relevant textual information, resulting in a more contextually informed representation.</p>
<p>After the cross-modality fusion, the fused features <inline-formula id="inf107">
<mml:math id="m122">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>fused</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> are passed through a final prediction layer to generate the output for the task at hand. For example, in the context of predicting the probability of double-cropped soybeans in a target area, a binary classification layer can be applied, resulting in a predicted probability <inline-formula id="inf108">
<mml:math id="m123">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
<p>The overall training objective for the AMSCN involves minimizing a combined loss function, which consists of three main components: Prediction loss <inline-formula id="inf109">
<mml:math id="m124">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>pred</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>: This is the binary cross-entropy loss for the prediction task, which penalizes incorrect predictions of the target label. - Scale-consistency loss <inline-formula id="inf110">
<mml:math id="m125">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>scale</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>: This loss ensures that features from different scales are aligned and consistent. - Cross-modality alignment loss <inline-formula id="inf111">
<mml:math id="m126">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>cross</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>: This loss encourages effective alignment between the visual and textual features during the cross-attention fusion process.</p>
<p>The total loss function is expressed as <xref ref-type="disp-formula" rid="e16">Formula 16</xref>:<disp-formula id="e16">
<mml:math id="m127">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>AMSCN</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>pred</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3bb;</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>scale</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3b3;</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>cross</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
<label>(16)</label>
</disp-formula>where <inline-formula id="inf112">
<mml:math id="m128">
<mml:mrow>
<mml:mi>&#x3bb;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf113">
<mml:math id="m129">
<mml:mrow>
<mml:mi>&#x3b3;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> are hyperparameters that control the relative contributions of the scale-consistency and cross-modality alignment losses, respectively. These hyperparameters can be tuned based on the specific task and dataset to achieve optimal performance.</p>
</sec>
</sec>
<sec id="s3-4">
<title>3.4 Hybrid learning and robust optimization</title>
<p>To further refine the Adaptive Multi-Scale Consistency Network (AMSCN) and bolster its predictive performance, we incorporate strategic enhancements that leverage hybrid learning techniques and robust optimization. These enhancements are designed to improve the model&#x2019;s generalization capabilities, especially in the face of incomplete or noisy data, which is common in real-world remote sensing and climate scenarios.</p>
<sec id="s3-4-1">
<title>3.4.1 Hybrid learning approach</title>
<p>The AMSCN (Attention-based Multi-Scale Convolutional Network) employs a hybrid learning strategy that leverages both supervised learning and self-supervised learning (SSL) to maximize the effective use of labeled and unlabeled data. This combination allows the model to excel in scenarios where labeled data is limited, which is often the case in remote sensing applications. By utilizing this dual approach, the model can improve its generalization and robustness across varying geographical and climatic conditions, crucial for tasks like identifying suitable regions for double-cropping soybeans.</p>
<p>The supervised component of this hybrid strategy is guided by the binary cross-entropy loss function, denoted as <xref ref-type="disp-formula" rid="e17">Formula 17</xref>:<disp-formula id="e17">
<mml:math id="m130">
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
<mml:mtext>pred</mml:mtext>
<mml:mo>&#x3d;</mml:mo>
<mml:mo>&#x2212;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>&#x2211;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2061;</mml:mo>
<mml:mi>log</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mi>log</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(17)</label>
</disp-formula>where <inline-formula id="inf114">
<mml:math id="m131">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> represents the ground truth label indicating whether a region is suitable for double-cropping, and <inline-formula id="inf115">
<mml:math id="m132">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the model&#x2019;s prediction for the <inline-formula id="inf116">
<mml:math id="m133">
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>-th sample in the dataset <inline-formula id="inf117">
<mml:math id="m134">
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, which consists of <inline-formula id="inf118">
<mml:math id="m135">
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> labeled samples. This loss function trains the model to effectively classify regions into suitable or unsuitable categories based on the available labeled data.</p>
<p>In contrast, the self-supervised learning (SSL) component uses a masked image modeling (MIM) strategy inspired by the Masked Autoencoder (MAE) framework. The goal of MIM is to learn a rich and robust set of feature representations from unlabeled data by exploiting the inherent structure of the remote sensing imagery. In this approach, portions of the input image are randomly masked, and the model is tasked with reconstructing the missing parts using the visible portions of the image, thereby encouraging the model to learn the underlying patterns and semantics.</p>
<p>The self-supervised loss function, <inline-formula id="inf119">
<mml:math id="m136">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>SSL</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, is formulated as (<xref ref-type="disp-formula" rid="e18">Formula 18</xref>):<disp-formula id="e18">
<mml:math id="m137">
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
<mml:mtext>SSL</mml:mtext>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>M</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>&#x2211;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>M</mml:mi>
</mml:mrow>
</mml:msup>
<mml:msup>
<mml:mrow>
<mml:mfenced open="|" close="|">
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mtext>masked</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2212;</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mtext>reconstructed</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(18)</label>
</disp-formula>where <inline-formula id="inf120">
<mml:math id="m138">
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mtext>masked</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> represents the masked version of the <inline-formula id="inf121">
<mml:math id="m139">
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>-th input image, <inline-formula id="inf122">
<mml:math id="m140">
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mtext>reconstructed</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> denotes the corresponding reconstruction produced by the model, and <inline-formula id="inf123">
<mml:math id="m141">
<mml:mrow>
<mml:mi>M</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the total number of masked samples used for self-supervised learning. This reconstruction process helps the model capture meaningful feature representations from the raw imagery, which is particularly valuable in scenarios where obtaining labeled data is expensive or time-consuming.</p>
<p>By integrating both supervised and self-supervised objectives into the training process, the AMSCN effectively learns from a mix of labeled and unlabeled data. The overall loss function for training the model can thus be expressed as a weighted sum of the two components (<xref ref-type="disp-formula" rid="e19">Formula 19</xref>):<disp-formula id="e19">
<mml:math id="m142">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>total</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>&#x3b3;</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>pred</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3b4;</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>SSL</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(19)</label>
</disp-formula>where <inline-formula id="inf124">
<mml:math id="m143">
<mml:mrow>
<mml:mi>&#x3b3;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf125">
<mml:math id="m144">
<mml:mrow>
<mml:mi>&#x3b4;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> are hyperparameters that balance the contributions of the supervised and self-supervised losses during training. This combination enables the model to generalize better across different environments and enhances its performance in real-world applications, particularly in cases where the availability of labeled data is limited, but large volumes of unlabeled remote sensing data are accessible.</p>
</sec>
<sec id="s3-4-2">
<title>3.4.2 Robust optimization techniques</title>
<p>To improve the resilience of the AMSCN (Attention-based Multi-Scale Convolutional Network) against the noise and uncertainties often present in remote sensing data, we incorporate several robust optimization techniques into the training process. These techniques are essential for ensuring that the model can generalize well to new, unseen conditions and maintain high performance even in the presence of noisy or corrupted input data. A key method utilized in this context is adversarial training, a strategy designed to improve the model&#x2019;s robustness by exposing it to deliberately perturbed input data.</p>
<p>Adversarial training operates by introducing adversarial noise, denoted as <inline-formula id="inf126">
<mml:math id="m145">
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, into the original input images. These perturbed inputs are referred to as adversarial examples and are generated by adding the noise <inline-formula id="inf127">
<mml:math id="m146">
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> to the original input images <inline-formula id="inf128">
<mml:math id="m147">
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, yielding adversarial inputs <inline-formula id="inf129">
<mml:math id="m148">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>adv</mml:mtext>
</mml:mrow>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:mi mathvariant="bold">X</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. The perturbation <inline-formula id="inf130">
<mml:math id="m149">
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is carefully crafted to maximize the model&#x2019;s prediction error, typically by following the gradient of the model&#x2019;s loss with respect to the input data. This adversarial noise is often subtle enough to be imperceptible to human observers but can significantly impact the model&#x2019;s predictions.</p>
<p>Formally, the adversarial training objective can be expressed as <xref ref-type="disp-formula" rid="e20">Formula 20</xref>:<disp-formula id="e20">
<mml:math id="m150">
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
<mml:mtext>adv</mml:mtext>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>&#x2211;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mi mathvariant="script">L</mml:mi>
<mml:mtext>pred</mml:mtext>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mi>&#x3b8;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>adv</mml:mtext>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(20)</label>
</disp-formula>where <inline-formula id="inf131">
<mml:math id="m151">
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
<mml:mtext>pred</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula> is the binary cross-entropy loss function used for the main classification task, <inline-formula id="inf132">
<mml:math id="m152">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> denotes the model with parameters <inline-formula id="inf133">
<mml:math id="m153">
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf134">
<mml:math id="m154">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>adv</mml:mtext>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> represents the adversarial input for the <inline-formula id="inf135">
<mml:math id="m155">
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>-th sample, <inline-formula id="inf136">
<mml:math id="m156">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the associated temporal data or additional features (such as climatic or geographical information), and <inline-formula id="inf137">
<mml:math id="m157">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the ground truth label for the <inline-formula id="inf138">
<mml:math id="m158">
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>-th sample in the dataset. The objective of adversarial training is to minimize the prediction error on these adversarial examples, thus forcing the model to become more robust to small, strategically designed perturbations in the input data.</p>
<p>The adversarial noise <inline-formula id="inf139">
<mml:math id="m159">
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is typically generated by maximizing the loss function with respect to the input, using a method such as the Fast Gradient Sign Method (FGSM), which computes <inline-formula id="inf140">
<mml:math id="m160">
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> as follows (<xref ref-type="disp-formula" rid="e21">Formula 21</xref>):<disp-formula id="e21">
<mml:math id="m161">
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>&#x3f5;</mml:mi>
<mml:mo>&#x22c5;</mml:mo>
<mml:mtext>sign</mml:mtext>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x2207;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="script">L</mml:mi>
<mml:mtext>pred</mml:mtext>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mi>&#x3b8;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold">T</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:mi>y</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(21)</label>
</disp-formula>where <inline-formula id="inf141">
<mml:math id="m162">
<mml:mrow>
<mml:mi>&#x3f5;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is a small scalar controlling the magnitude of the perturbation, <inline-formula id="inf142">
<mml:math id="m163">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x2207;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the gradient of the loss function with respect to the input image <inline-formula id="inf143">
<mml:math id="m164">
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, and <inline-formula id="inf144">
<mml:math id="m165">
<mml:mrow>
<mml:mtext>sign</mml:mtext>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mo>&#x22c5;</mml:mo>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> denotes the sign of the gradient. This perturbation is added to the input image <inline-formula id="inf145">
<mml:math id="m166">
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> to generate the adversarial example <inline-formula id="inf146">
<mml:math id="m167">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>adv</mml:mtext>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>, and the model is then trained to correctly classify these perturbed inputs.</p>
<p>By incorporating adversarial training, the AMSCN learns to be less sensitive to small, potentially adversarial changes in the input data, enhancing its robustness and generalization capabilities. This approach is particularly valuable in remote sensing tasks, where data can be subject to various sources of noise, such as sensor errors, atmospheric conditions, and data preprocessing artifacts. The adversarial training process ensures that the model develops feature representations that are more stable and less influenced by these noise sources.</p>
<p>Moreover, the total training loss for the AMSCN can be modified to include both the standard prediction loss and the adversarial loss, leading to an overall objective function defined as (<xref ref-type="disp-formula" rid="e22">Formula 22</xref>):<disp-formula id="e22">
<mml:math id="m168">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>total</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>&#x3bc;</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>pred</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3bd;</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>adv</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(22)</label>
</disp-formula>where <inline-formula id="inf147">
<mml:math id="m169">
<mml:mrow>
<mml:mi>&#x3bc;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf148">
<mml:math id="m170">
<mml:mrow>
<mml:mi>&#x3bd;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> are weighting coefficients that control the relative importance of the standard and adversarial losses during training. By balancing these two components, the AMSCN can learn to perform well on both clean and adversarial examples, resulting in a more robust and resilient model capable of handling noisy or uncertain input data in real-world remote sensing applications.</p>
<p>Another strategic enhancement involves the use of ensemble learning to quantify and reduce prediction uncertainty. We train multiple instances of the AMSCN with varying initializations and hyperparameters, generating an ensemble of models <inline-formula id="inf149">
<mml:math id="m171">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>. The final prediction is obtained by averaging the outputs of the ensemble models (<xref ref-type="disp-formula" rid="e23">Formula 23</xref>):<disp-formula id="e23">
<mml:math id="m172">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mtext>ensemble</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:munderover>
</mml:mstyle>
<mml:msubsup>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msubsup>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="bold">X</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold">T</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<label>(23)</label>
</disp-formula>where <inline-formula id="inf150">
<mml:math id="m173">
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the number of models in the ensemble. This ensemble approach not only improves the overall predictive performance but also provides a measure of uncertainty in the predictions, which is crucial for decision-making in agricultural planning. The variance among the predictions from different ensemble members serves as an indicator of uncertainty, allowing stakeholders to assess the confidence in the model&#x2019;s predictions.</p>
</sec>
</sec>
</sec>
<sec id="s4">
<title>4 Experiments</title>
<p>To enhance the clarity and transparency of the methods section, we provide detailed information about the data sources, collection process, and geographic coverage. Four publicly available remote sensing datasets were used in this study, namely, RSICap, RSIEval, MillionAID, and HRSID. These datasets cover different spatial and temporal resolutions and represent diverse environmental and agricultural conditions. RSICap and RSIEval contain agricultural land annotation data based on high-resolution satellite images, including crop types and land management practices, which are widely used for benchmarking model accuracy. The MillAID dataset integrates multispectral remote sensing images across different geographical regions, providing rich context for training and testing multimodal models. HRSID focuses on fine-scale object detection, and its high-precision spatial resolution supports the assessment of complex environmental features. In this section, we evaluate the performance of the proposed Adaptive Multi-Scale Consistency Network (AMSCN) on four diverse and challenging remote sensing datasets: RSICap (<xref ref-type="bibr" rid="B37">Ye et al., 2022</xref>), RSIEval (<xref ref-type="bibr" rid="B14">Hu et al., 2023</xref>), MillionAID (<xref ref-type="bibr" rid="B21">Long et al., 2021</xref>), and HRSID (<xref ref-type="bibr" rid="B34">Wei et al., 2020</xref>). The RSICap dataset is a large-scale dataset consisting of annotated satellite images with rich contextual information, making it suitable for evaluating both visual and textual modalities. The RSIEval dataset, known for its high-resolution satellite images, focuses on fine-grained classification tasks, providing a rigorous test for the model&#x2019;s ability to handle detailed and varied visual features. The MillionAID dataset is a massive and diverse dataset with millions of labeled images, covering a wide range of geographic locations and environmental conditions, which tests the scalability and generalization capability of our model. Lastly, the HRSID dataset specializes in high-resolution ship detection, posing a unique challenge due to the small size and varied orientations of the objects, thus assessing the model&#x2019;s precision in detecting and classifying small objects within large-scale imagery.</p>
<p>To ensure a rigorous evaluation, we designed our experiments with a comprehensive training and validation strategy. Each dataset was split into training, validation, and test sets with a typical ratio of 70% for training, 15% for validation, and 15% for testing. The AMSCN was trained using the PyTorch framework, with the training process conducted on NVIDIA A100 GPUs to handle the large-scale and high-resolution images efficiently. The model was optimized using the AdamW optimizer with a learning rate initially set to <inline-formula id="inf151">
<mml:math id="m174">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:msup>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> and decayed by a factor of 0.1 after every 10 epochs. We set the batch size to 32, and the model was trained for 50 epochs, with early stopping employed based on the validation loss to prevent overfitting. Data augmentation techniques, including random cropping, flipping, and scaling, were applied to enhance the model&#x2019;s generalization. The multi-scale inputs were generated dynamically during training, ensuring that the model learned robust features across varying resolutions. For the self-supervised component, we masked 50% of the input patches and trained the model to reconstruct the missing parts, encouraging the learning of contextually rich features. The cross-modality features were fused using a cross-attention mechanism, and the final predictions were made using a fully connected layer followed by a sigmoid activation function (<xref ref-type="statement" rid="Algorithm_1">Algorithm 1</xref>).</p>
<p>
<statement content-type="algorithm" id="Algorithm_1">
<label>Algorithm 1</label>
<p>AgriCLIP: Training and Evaluation.<list list-type="simple">
<list-item>
<p>
<bold>Input:</bold> Training Data <inline-formula id="inf152">
<mml:math id="m175">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>train</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, Validation Data <inline-formula id="inf153">
<mml:math id="m176">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>val</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, Test Data <inline-formula id="inf154">
<mml:math id="m177">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>test</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
</list-item>
<list-item>
<p>
<bold>Output:</bold> Trained Model <inline-formula id="inf155">
<mml:math id="m178">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
</list-item>
<list-item>
<p>Initialize model parameters <inline-formula id="inf156">
<mml:math id="m179">
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>;</p>
</list-item>
<list-item>
<p>Set learning rate <inline-formula id="inf157">
<mml:math id="m180">
<mml:mrow>
<mml:mi>&#x3ba;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
<mml:msup>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>;</p>
</list-item>
<list-item>
<p>Set batch size <inline-formula id="inf158">
<mml:math id="m181">
<mml:mrow>
<mml:mi>B</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>32</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>;</p>
</list-item>
<list-item>
<p>Set epochs <inline-formula id="inf159">
<mml:math id="m182">
<mml:mrow>
<mml:mi>E</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>50</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>;</p>
</list-item>
<list-item>
<p>Initialize evaluation metrics <inline-formula id="inf160">
<mml:math id="m183">
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>l</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf161">
<mml:math id="m184">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf162">
<mml:math id="m185">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mn>1</mml:mn>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf163">
<mml:math id="m186">
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi>U</mml:mi>
<mml:mi>C</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>;</p>
</list-item>
<list-item>
<p>
<bold>for</bold> <inline-formula id="inf164">
<mml:math id="m187">
<mml:mrow>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> <bold>to</bold>
<inline-formula id="inf165">
<mml:math id="m188">
<mml:mrow>
<mml:mi>E</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> <bold>do</bold>
</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;for</bold> <italic>each batch</italic> <inline-formula id="inf166">
<mml:math id="m189">
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>train</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> <bold>do</bold>
</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;&#x2003;</bold>Obtain multi-scale inputs <inline-formula id="inf167">
<mml:math id="m190">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>ms</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> from <inline-formula id="inf168">
<mml:math id="m191">
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>;</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;&#x2003;</bold>Mask 50% of patches in <inline-formula id="inf169">
<mml:math id="m192">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>ms</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, get <inline-formula id="inf170">
<mml:math id="m193">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>masked</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>;</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;&#x2003;</bold>
<inline-formula id="inf171">
<mml:math id="m194">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>reconstructed</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2190;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>masked</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>;</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;&#x2003;</bold>Compute reconstruction loss: <inline-formula id="inf172">
<mml:math id="m195">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>rec</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mo stretchy="false">&#x2016;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>masked</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>reconstructed</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">&#x2016;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>;</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;&#x2003;</bold>Extract features from text: <inline-formula id="inf173">
<mml:math id="m196">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>features</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2190;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>;</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;&#x2003;</bold>Fuse features using cross-attention: <inline-formula id="inf174">
<mml:math id="m197">
<mml:mrow>
<mml:mi>Z</mml:mi>
<mml:mo>&#x2190;</mml:mo>
<mml:mtext>CrossAttention</mml:mtext>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>ms</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>features</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>;</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;&#x2003;</bold>Compute prediction <inline-formula id="inf175">
<mml:math id="m198">
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>&#x3c3;</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x22a4;</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mi>Z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>;</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;&#x2003;</bold>Compute prediction loss: <inline-formula id="inf176">
<mml:math id="m199">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>pred</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mo>&#x2212;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:msubsup>
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2061;</mml:mo>
<mml:mi>log</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mi>log</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>;</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;&#x2003;</bold>Compute total loss: <inline-formula id="inf177">
<mml:math id="m200">
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>pred</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3bb;</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>rec</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>;</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;&#x2003;</bold>Update parameters: <inline-formula id="inf178">
<mml:math id="m201">
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
<mml:mo>&#x2190;</mml:mo>
<mml:mi>&#x3b8;</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>&#x3ba;</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>&#x2207;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>;</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;end</bold>
</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;if</bold> <italic>validation loss</italic> <inline-formula id="inf179">
<mml:math id="m202">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>val</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> <italic>does not improve</italic> <bold>then</bold>
</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;&#x2003;</bold>Reduce learning rate: <inline-formula id="inf180">
<mml:math id="m203">
<mml:mrow>
<mml:mi>&#x3ba;</mml:mi>
<mml:mo>&#x2190;</mml:mo>
<mml:mi>&#x3ba;</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>&#x3b4;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>;</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;&#x2003;if</bold> <italic>no improvement for 5 epochs</italic> <bold>then</bold>
</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;&#x2003;&#x2003;break</bold>;</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;&#x2003;end</bold>
</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;end</bold>
</p>
</list-item>
<list-item>
<p>
<bold>end</bold>
</p>
</list-item>
<list-item>
<p>
<bold>while</bold> <inline-formula id="inf181">
<mml:math id="m204">
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>val</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> <bold>do</bold>
</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;</bold>Compute predictions <inline-formula id="inf182">
<mml:math id="m205">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>;</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;</bold>Update metrics: <inline-formula id="inf183">
<mml:math id="m206">
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf184">
<mml:math id="m207">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf185">
<mml:math id="m208">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf186">
<mml:math id="m209">
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi>U</mml:mi>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>;</p>
</list-item>
<list-item>
<p>
<bold>end</bold>
</p>
</list-item>
<list-item>
<p>
<bold>for</bold> <italic>each batch</italic> <inline-formula id="inf187">
<mml:math id="m210">
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>test</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> <bold>do</bold>
</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;</bold>Compute predictions <inline-formula id="inf188">
<mml:math id="m211">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>;</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;</bold>Evaluate <inline-formula id="inf189">
<mml:math id="m212">
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf190">
<mml:math id="m213">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf191">
<mml:math id="m214">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf192">
<mml:math id="m215">
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi>U</mml:mi>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>;</p>
</list-item>
<list-item>
<p>
<bold>end</bold>
</p>
</list-item>
<list-item>
<p>Save final model <inline-formula id="inf193">
<mml:math id="m216">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>;</p>
</list-item>
<list-item>
<p>
<bold>End</bold>
</p>
</list-item>
</list>
</p>
</statement>
</p>
<sec id="s4-1">
<title>4.1 Comparison with state-of-the-art methods</title>
<p>The experimental results comparing the Adaptive Multi-Scale Consistency Network (AMSCN) with state-of-the-art methods on the RSICap and RSIEval datasets are summarized in <xref ref-type="table" rid="T1">Table 1</xref> and <xref ref-type="fig" rid="F4">Figure 4</xref>. AMSCN consistently outperforms competing models across all metrics. Specifically, AMSCN achieves an accuracy of 97.54% on the RSICap dataset, which is higher than the 96.34% accuracy achieved by the closest competitor, Scale-MAE. Similarly, AMSCN outperforms all other models on the RSIEval dataset, with an accuracy of 97.23%, demonstrating its robustness across different datasets. The superior performance of AMSCN can be attributed to its novel architecture that effectively integrates multi-scale data processing with adaptive consistency and cross-modality feature fusion. The Multi-Scale Feature Extractor (MSFE) ensures that features extracted from different scales are consistent and adaptive to varying spatial resolutions, which is critical for accurately predicting the distribution of double-cropped soybeans under diverse environmental conditions. Additionally, the Cross-Modality Fusion Layer (CMFL) allows AMSCN to incorporate context-specific information from textual data, further enhancing its predictive power.</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Comparison of AMSCN with SOTA methods on RSICap and RSIEval Datasets.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="center">Model</th>
<th colspan="4" align="center">RSICap dataset</th>
<th colspan="4" align="center">RSIEval dataset</th>
</tr>
<tr>
<th align="center">Accuracy</th>
<th align="center">Recall</th>
<th align="center">F1 Score</th>
<th align="center">AUC</th>
<th align="center">Accuracy</th>
<th align="center">Recall</th>
<th align="center">F1 Score</th>
<th align="center">AUC</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">CLIP <xref ref-type="bibr" rid="B30">Teng et al. (2021)</xref>
</td>
<td align="center">94.12<inline-formula id="inf194">
<mml:math id="m217">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">91.05<inline-formula id="inf195">
<mml:math id="m218">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">89.34<inline-formula id="inf196">
<mml:math id="m219">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">92.45<inline-formula id="inf197">
<mml:math id="m220">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">93.85<inline-formula id="inf198">
<mml:math id="m221">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">90.28<inline-formula id="inf199">
<mml:math id="m222">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">90.62<inline-formula id="inf200">
<mml:math id="m223">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">85.37<inline-formula id="inf201">
<mml:math id="m224">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
</tr>
<tr>
<td align="center">ViT <xref ref-type="bibr" rid="B32">Wang et al. (2022a)</xref>
</td>
<td align="center">89.75<inline-formula id="inf202">
<mml:math id="m225">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">86.32<inline-formula id="inf203">
<mml:math id="m226">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">87.12<inline-formula id="inf204">
<mml:math id="m227">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">90.17<inline-formula id="inf205">
<mml:math id="m228">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">92.05<inline-formula id="inf206">
<mml:math id="m229">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">89.34<inline-formula id="inf207">
<mml:math id="m230">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">88.56<inline-formula id="inf208">
<mml:math id="m231">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">88.92<inline-formula id="inf209">
<mml:math id="m232">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
</tr>
<tr>
<td align="center">BLIP <xref ref-type="bibr" rid="B38">Yu et al. (2024)</xref>
</td>
<td align="center">86.54<inline-formula id="inf210">
<mml:math id="m233">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">88.12<inline-formula id="inf211">
<mml:math id="m234">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">90.11<inline-formula id="inf212">
<mml:math id="m235">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">91.30<inline-formula id="inf213">
<mml:math id="m236">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">96.23<inline-formula id="inf214">
<mml:math id="m237">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">93.65<inline-formula id="inf215">
<mml:math id="m238">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">89.45<inline-formula id="inf216">
<mml:math id="m239">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">88.22<inline-formula id="inf217">
<mml:math id="m240">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
</tr>
<tr>
<td align="center">SatMAE <xref ref-type="bibr" rid="B5">Cong et al. (2022)</xref>
</td>
<td align="center">95.87<inline-formula id="inf218">
<mml:math id="m241">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">92.45<inline-formula id="inf219">
<mml:math id="m242">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">91.52<inline-formula id="inf220">
<mml:math id="m243">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">93.11<inline-formula id="inf221">
<mml:math id="m244">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">94.85<inline-formula id="inf222">
<mml:math id="m245">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">91.78<inline-formula id="inf223">
<mml:math id="m246">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">92.12<inline-formula id="inf224">
<mml:math id="m247">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">89.45<inline-formula id="inf225">
<mml:math id="m248">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
</tr>
<tr>
<td align="center">Scale-MAE <xref ref-type="bibr" rid="B29">Tang et al. (2024)</xref>
</td>
<td align="center">96.34<inline-formula id="inf226">
<mml:math id="m249">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">93.56<inline-formula id="inf227">
<mml:math id="m250">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">92.34<inline-formula id="inf228">
<mml:math id="m251">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">94.21<inline-formula id="inf229">
<mml:math id="m252">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">95.12<inline-formula id="inf230">
<mml:math id="m253">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">92.34<inline-formula id="inf231">
<mml:math id="m254">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">91.67<inline-formula id="inf232">
<mml:math id="m255">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">90.56<inline-formula id="inf233">
<mml:math id="m256">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
</tr>
<tr>
<td align="center">ResNet-50 <xref ref-type="bibr" rid="B12">Harini et al. (2024)</xref>
</td>
<td align="center">90.87<inline-formula id="inf234">
<mml:math id="m257">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">88.05<inline-formula id="inf235">
<mml:math id="m258">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">89.12<inline-formula id="inf236">
<mml:math id="m259">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">90.85<inline-formula id="inf237">
<mml:math id="m260">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">91.65<inline-formula id="inf238">
<mml:math id="m261">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">89.05<inline-formula id="inf239">
<mml:math id="m262">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">88.12<inline-formula id="inf240">
<mml:math id="m263">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">87.45<inline-formula id="inf241">
<mml:math id="m264">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
</tr>
<tr>
<td align="center">AgriCLIP</td>
<td align="center">97.54<inline-formula id="inf242">
<mml:math id="m265">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">95.67<inline-formula id="inf243">
<mml:math id="m266">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">94.12<inline-formula id="inf244">
<mml:math id="m267">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">95.30<inline-formula id="inf245">
<mml:math id="m268">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">97.23<inline-formula id="inf246">
<mml:math id="m269">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">94.78<inline-formula id="inf247">
<mml:math id="m270">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">93.15<inline-formula id="inf248">
<mml:math id="m271">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">91.92<inline-formula id="inf249">
<mml:math id="m272">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Comparison of AMSCN with SOTA methods on RSICap and RSIEval Datasets.</p>
</caption>
<graphic xlink:href="fenvs-12-1515752-g004.tif"/>
</fig>
<p>
<xref ref-type="table" rid="T2">Table 2</xref> and <xref ref-type="fig" rid="F5">Figure 5</xref> presents the results for the MillionAID and HRSID datasets, focusing on computational efficiency and scalability. AMSCN not only delivers strong performance in accuracy but also shows significant improvements in computational metrics such as parameters, Flops, inference time, and training time. For example, AMSCN reduces parameters to 232.47&#xa0;M and Flops to 126.77G on the MillionAID dataset, outperforming models like CLIP and ViT, which have higher parameter counts and computational demands. The reduction in inference time and training time by AMSCN, as shown in <xref ref-type="table" rid="T2">Table 2</xref>, highlights its efficiency, making it particularly suitable for large-scale remote sensing applications. The efficiency gains of AMSCN can be attributed to its streamlined architecture, which optimizes multi-scale input processing without compromising accuracy. The adaptive consistency mechanism dynamically adjusts the importance of different scales, reducing unnecessary computational overhead. Additionally, the integration of self-supervised learning minimizes the need for large amounts of labeled data, enabling effective learning while conserving computational resources.</p>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>Comparison of AMSCN with SOTA methods on MillionAID and HRSID Datasets.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="center">Model</th>
<th colspan="4" align="center">MillionAID dataset</th>
<th colspan="4" align="center">HRSID dataset</th>
</tr>
<tr>
<th align="center">Parameters (M)</th>
<th align="center">Flops (G)</th>
<th align="center">Inference Time (ms)</th>
<th align="center">Training Time (s)</th>
<th align="center">Parameters (M)</th>
<th align="center">Flops (G)</th>
<th align="center">Inference Time (ms)</th>
<th align="center">Training Time (s)</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">CLIP</td>
<td align="center">313.06<inline-formula id="inf250">
<mml:math id="m273">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">235.00<inline-formula id="inf251">
<mml:math id="m274">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">293.76<inline-formula id="inf252">
<mml:math id="m275">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">295.85<inline-formula id="inf253">
<mml:math id="m276">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">306.11<inline-formula id="inf254">
<mml:math id="m277">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">339.02<inline-formula id="inf255">
<mml:math id="m278">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">396.82<inline-formula id="inf256">
<mml:math id="m279">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">318.76<inline-formula id="inf257">
<mml:math id="m280">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
</tr>
<tr>
<td align="center">ViT</td>
<td align="center">232.80<inline-formula id="inf258">
<mml:math id="m281">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">282.15<inline-formula id="inf259">
<mml:math id="m282">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">324.95<inline-formula id="inf260">
<mml:math id="m283">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">266.78<inline-formula id="inf261">
<mml:math id="m284">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">301.20<inline-formula id="inf262">
<mml:math id="m285">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">235.59<inline-formula id="inf263">
<mml:math id="m286">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">379.65<inline-formula id="inf264">
<mml:math id="m287">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">386.29<inline-formula id="inf265">
<mml:math id="m288">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
</tr>
<tr>
<td align="center">BLIP</td>
<td align="center">215.66<inline-formula id="inf266">
<mml:math id="m289">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.01</td>
<td align="center">236.86<inline-formula id="inf267">
<mml:math id="m290">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.01</td>
<td align="center">262.79<inline-formula id="inf268">
<mml:math id="m291">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.01</td>
<td align="center">233.32<inline-formula id="inf269">
<mml:math id="m292">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.01</td>
<td align="center">378.82<inline-formula id="inf270">
<mml:math id="m293">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.01</td>
<td align="center">320.18<inline-formula id="inf271">
<mml:math id="m294">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.01</td>
<td align="center">359.32<inline-formula id="inf272">
<mml:math id="m295">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.01</td>
<td align="center">214.86<inline-formula id="inf273">
<mml:math id="m296">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.01</td>
</tr>
<tr>
<td align="center">SatMAE</td>
<td align="center">373.92<inline-formula id="inf274">
<mml:math id="m297">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">258.79<inline-formula id="inf275">
<mml:math id="m298">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">216.96<inline-formula id="inf276">
<mml:math id="m299">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">303.15<inline-formula id="inf277">
<mml:math id="m300">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">246.66<inline-formula id="inf278">
<mml:math id="m301">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">317.25<inline-formula id="inf279">
<mml:math id="m302">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">310.58<inline-formula id="inf280">
<mml:math id="m303">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">362.57<inline-formula id="inf281">
<mml:math id="m304">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
</tr>
<tr>
<td align="center">Scale-MAE</td>
<td align="center">243.40<inline-formula id="inf282">
<mml:math id="m305">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">347.71<inline-formula id="inf283">
<mml:math id="m306">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">289.21<inline-formula id="inf284">
<mml:math id="m307">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">208.13<inline-formula id="inf285">
<mml:math id="m308">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">318.61<inline-formula id="inf286">
<mml:math id="m309">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">215.34<inline-formula id="inf287">
<mml:math id="m310">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">379.29<inline-formula id="inf288">
<mml:math id="m311">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">259.06<inline-formula id="inf289">
<mml:math id="m312">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
</tr>
<tr>
<td align="center">ResNet-50</td>
<td align="center">304.29<inline-formula id="inf290">
<mml:math id="m313">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">348.44<inline-formula id="inf291">
<mml:math id="m314">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">398.53<inline-formula id="inf292">
<mml:math id="m315">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">396.70<inline-formula id="inf293">
<mml:math id="m316">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">210.14<inline-formula id="inf294">
<mml:math id="m317">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">331.26<inline-formula id="inf295">
<mml:math id="m318">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">238.24<inline-formula id="inf296">
<mml:math id="m319">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">265.43<inline-formula id="inf297">
<mml:math id="m320">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
</tr>
<tr>
<td align="center">AgriCLIP</td>
<td align="center">232.47<inline-formula id="inf298">
<mml:math id="m321">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.01</td>
<td align="center">126.77<inline-formula id="inf299">
<mml:math id="m322">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.01</td>
<td align="center">213.96<inline-formula id="inf300">
<mml:math id="m323">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.01</td>
<td align="center">181.29<inline-formula id="inf301">
<mml:math id="m324">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.01</td>
<td align="center">177.66<inline-formula id="inf302">
<mml:math id="m325">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.01</td>
<td align="center">105.98<inline-formula id="inf303">
<mml:math id="m326">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.01</td>
<td align="center">179.13<inline-formula id="inf304">
<mml:math id="m327">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.01</td>
<td align="center">156.45<inline-formula id="inf305">
<mml:math id="m328">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.01</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>Comparison of AMSCN with SOTA methods on MillionAID and HRSID Datasets.</p>
</caption>
<graphic xlink:href="fenvs-12-1515752-g005.tif"/>
</fig>
<sec id="s4-1-1">
<title>4.1.1 Ablation study</title>
<p>To understand the contribution of each component of the AMSCN, we conduct an ablation study by systematically removing or altering key components of the model. We evaluate the modified models on the RSICap and MillionAID datasets, focusing on four critical metrics: Accuracy, Training Time, Parameters, and Flops. The results of the ablation study are summarized in <xref ref-type="table" rid="T3">Tables 3</xref>, <xref ref-type="table" rid="T4">4</xref> and <xref ref-type="fig" rid="F6">Figures 6</xref>, <xref ref-type="fig" rid="F7">7</xref>.</p>
<table-wrap id="T3" position="float">
<label>TABLE 3</label>
<caption>
<p>Ablation study on RSICap and MillionAID datasets.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="center">Model variant</th>
<th colspan="4" align="center">RSICap dataset</th>
<th colspan="4" align="center">MillionAID dataset</th>
</tr>
<tr>
<th align="center">Accuracy</th>
<th align="center">Recall</th>
<th align="center">F1 Score</th>
<th align="center">AUC</th>
<th align="center">Accuracy</th>
<th align="center">Recall</th>
<th align="center">F1 Score</th>
<th align="center">AUC</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">Full Model</td>
<td align="center">97.54<inline-formula id="inf306">
<mml:math id="m329">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">96.67<inline-formula id="inf307">
<mml:math id="m330">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">94.12<inline-formula id="inf308">
<mml:math id="m331">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">93.30<inline-formula id="inf309">
<mml:math id="m332">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">97.54<inline-formula id="inf310">
<mml:math id="m333">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">95.78<inline-formula id="inf311">
<mml:math id="m334">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">95.15<inline-formula id="inf312">
<mml:math id="m335">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">92.92<inline-formula id="inf313">
<mml:math id="m336">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
</tr>
<tr>
<td align="center">w/o Scale Consistency</td>
<td align="center">88.32<inline-formula id="inf314">
<mml:math id="m337">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.01</td>
<td align="center">88.7<inline-formula id="inf315">
<mml:math id="m338">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.01</td>
<td align="center">89.27<inline-formula id="inf316">
<mml:math id="m339">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.01</td>
<td align="center">86.69<inline-formula id="inf317">
<mml:math id="m340">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.01</td>
<td align="center">88.06<inline-formula id="inf318">
<mml:math id="m341">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.01</td>
<td align="center">84.78<inline-formula id="inf319">
<mml:math id="m342">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.01</td>
<td align="center">86.14<inline-formula id="inf320">
<mml:math id="m343">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.01</td>
<td align="center">91.13<inline-formula id="inf321">
<mml:math id="m344">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.01</td>
</tr>
<tr>
<td align="center">w/o Cross-Attention</td>
<td align="center">86.7<inline-formula id="inf322">
<mml:math id="m345">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">89.67<inline-formula id="inf323">
<mml:math id="m346">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">86.33<inline-formula id="inf324">
<mml:math id="m347">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">92.67<inline-formula id="inf325">
<mml:math id="m348">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">86.94<inline-formula id="inf326">
<mml:math id="m349">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">87.07<inline-formula id="inf327">
<mml:math id="m350">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">86.06<inline-formula id="inf328">
<mml:math id="m351">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
<td align="center">87.29<inline-formula id="inf329">
<mml:math id="m352">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.02</td>
</tr>
<tr>
<td align="center">w/o Self-Supervision</td>
<td align="center">89.77<inline-formula id="inf330">
<mml:math id="m353">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">85.56<inline-formula id="inf331">
<mml:math id="m354">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">90<inline-formula id="inf332">
<mml:math id="m355">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">85.86<inline-formula id="inf333">
<mml:math id="m356">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">90.97<inline-formula id="inf334">
<mml:math id="m357">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">89.89<inline-formula id="inf335">
<mml:math id="m358">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">85.94<inline-formula id="inf336">
<mml:math id="m359">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">87.7<inline-formula id="inf337">
<mml:math id="m360">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T4" position="float">
<label>TABLE 4</label>
<caption>
<p>Ablation study on RSIEval and HRSID datasets.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="center">Model variant</th>
<th colspan="4" align="center">RSIEval dataset</th>
<th colspan="4" align="center">HRSID dataset</th>
</tr>
<tr>
<th align="center">Parameters (M)</th>
<th align="center">Flops (G)</th>
<th align="center">Inference Time (ms)</th>
<th align="center">Training Time (s)</th>
<th align="center">Parameters (M)</th>
<th align="center">Flops (G)</th>
<th align="center">Inference Time (ms)</th>
<th align="center">Training Time (s)</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">Full Model</td>
<td align="center">340.80<inline-formula id="inf338">
<mml:math id="m361">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">256.83<inline-formula id="inf339">
<mml:math id="m362">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">326.56<inline-formula id="inf340">
<mml:math id="m363">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">396.96<inline-formula id="inf341">
<mml:math id="m364">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">217.04<inline-formula id="inf342">
<mml:math id="m365">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">325.30<inline-formula id="inf343">
<mml:math id="m366">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">320.29<inline-formula id="inf344">
<mml:math id="m367">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">226.92<inline-formula id="inf345">
<mml:math id="m368">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
</tr>
<tr>
<td align="center">w/o Scale Consistency</td>
<td align="center">321.82<inline-formula id="inf346">
<mml:math id="m369">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">380.41<inline-formula id="inf347">
<mml:math id="m370">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">337.65<inline-formula id="inf348">
<mml:math id="m371">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">250.17<inline-formula id="inf349">
<mml:math id="m372">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">246.84<inline-formula id="inf350">
<mml:math id="m373">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">220.33<inline-formula id="inf351">
<mml:math id="m374">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">293.39<inline-formula id="inf352">
<mml:math id="m375">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">287.81<inline-formula id="inf353">
<mml:math id="m376">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
</tr>
<tr>
<td align="center">w/o Cross-Attention</td>
<td align="center">275.21<inline-formula id="inf354">
<mml:math id="m377">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">247.65<inline-formula id="inf355">
<mml:math id="m378">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">372.50<inline-formula id="inf356">
<mml:math id="m379">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">349.19<inline-formula id="inf357">
<mml:math id="m380">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">204.76<inline-formula id="inf358">
<mml:math id="m381">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">332.64<inline-formula id="inf359">
<mml:math id="m382">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">318.08<inline-formula id="inf360">
<mml:math id="m383">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">363.24<inline-formula id="inf361">
<mml:math id="m384">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
</tr>
<tr>
<td align="center">w/o Self-Supervision</td>
<td align="center">232.02<inline-formula id="inf362">
<mml:math id="m385">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">127.11<inline-formula id="inf363">
<mml:math id="m386">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">114.99<inline-formula id="inf364">
<mml:math id="m387">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">142.15<inline-formula id="inf365">
<mml:math id="m388">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">114.95<inline-formula id="inf366">
<mml:math id="m389">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">233.00<inline-formula id="inf367">
<mml:math id="m390">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">113.11<inline-formula id="inf368">
<mml:math id="m391">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
<td align="center">203.69<inline-formula id="inf369">
<mml:math id="m392">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.03</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption>
<p>Ablation study on RSICap and MillionAID datasets.</p>
</caption>
<graphic xlink:href="fenvs-12-1515752-g006.tif"/>
</fig>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption>
<p>Ablation study on RSIEval and HRSID datasets.</p>
</caption>
<graphic xlink:href="fenvs-12-1515752-g007.tif"/>
</fig>
</sec>
<sec id="s4-1-2">
<title>4.1.2 Ablation study insights</title>
<p>The results of the ablation study, summarized in <xref ref-type="table" rid="T3">Tables 3</xref>, <xref ref-type="table" rid="T4">4</xref>, provide valuable insights into the contributions of each component of the AMSCN model. The removal of the scale consistency mechanism resulted in a significant drop in accuracy and recall, emphasizing the importance of this mechanism for achieving high segmentation performance. For instance, on the RSICap dataset, accuracy dropped from 97.54% to 88.32% when the scale consistency mechanism was omitted, as shown in <xref ref-type="table" rid="T3">Table 3</xref>. Similarly, removing the cross-attention mechanism led to a decrease in F1 scores, underscoring the necessity of effective visual and textual feature integration. Interestingly, the exclusion of the self-supervised learning component had mixed effects, as reflected in <xref ref-type="table" rid="T3">Tables 3</xref>, <xref ref-type="table" rid="T4">4</xref>. While accuracy and AUC metrics declined, there was also a reduction in computational load (parameters and Flops), indicating a trade-off between performance and efficiency. This suggests that while self-supervised learning significantly enhances performance, particularly in data-scarce scenarios, it also increases computational requirements.</p>
<p>In this experiment (In <xref ref-type="table" rid="T5">Table 5</xref>), we introduced two specialized datasets, the Crop Yield Prediction Dataset and the GF-1 WFV Dataset, to address the challenge of predicting double-cropping soybean distribution. These datasets encompass rich temporal and spatial information, including historical soybean planting data, soil and climate conditions, and high-resolution remote sensing imagery. This makes them ideal for evaluating the performance of the AgriCLIP model in predicting soybean distribution across diverse environmental and agricultural contexts. Experimental results, as presented in <xref ref-type="table" rid="T5">Table 5</xref>, demonstrate that AgriCLIP consistently outperforms other mainstream models, including CLIP, ViT, BLIP, SatMAE, Scale-MAE, and ResNet-50. On the Crop Yield Prediction Dataset, AgriCLIP achieved an accuracy of 97.84 percent, a recall of 95.27 percent, an F1 score of 93.72 percent, and an AUC of 95.18 percent. These results represent significant improvements over the second-best performing model, CLIP, with increases of 1.81 percent in accuracy, 9.85 percent in recall, 4.02 percent in F1 score, and 3.42 percent in AUC. This substantial performance boost highlights the model&#x2019;s ability to effectively capture the complex environmental factors influencing soybean cropping suitability. Similarly, on the GF-1 WFV Dataset, AgriCLIP exhibited superior performance with an accuracy of 98.18 percent, a recall of 94.01 percent, an F1 score of 94.07 percent, and an AUC of 95.34 percent. Compared to the second-best model, AgriCLIP achieved a minimum improvement of 2.05 percent across all metrics. These findings underline AgriCLIP&#x2019;s robustness in analyzing high-resolution remote sensing imagery and its exceptional predictive capability. While some comparison models, such as CLIP and ViT, demonstrated strengths in isolated metrics, their overall performance lacked the balance and consistency observed in AgriCLIP. This further emphasizes AgriCLIP&#x2019;s advantage as a comprehensive and adaptive solution for predicting soybean distribution.</p>
<table-wrap id="T5" position="float">
<label>TABLE 5</label>
<caption>
<p>Comparison of models on crop yield prediction dataset and GF-1 WFV dataset.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="center">Model</th>
<th colspan="4" align="center">Crop yield prediction dataset</th>
<th colspan="4" align="center">GF-1 WFV dataset</th>
</tr>
<tr>
<th align="center">Accuracy</th>
<th align="center">Recall</th>
<th align="center">F1 Score</th>
<th align="center">AUC</th>
<th align="center">Accuracy</th>
<th align="center">Recall</th>
<th align="center">F1 Score</th>
<th align="center">AUC</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">CLIP <xref ref-type="bibr" rid="B30">Teng et al. (2021)</xref>
</td>
<td align="center">96.03<inline-formula id="inf370">
<mml:math id="m393">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.12</td>
<td align="center">85.42<inline-formula id="inf371">
<mml:math id="m394">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.07</td>
<td align="center">89.70<inline-formula id="inf372">
<mml:math id="m395">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.09</td>
<td align="center">91.76<inline-formula id="inf373">
<mml:math id="m396">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.08</td>
<td align="center">87.67<inline-formula id="inf374">
<mml:math id="m397">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.11</td>
<td align="center">88.75<inline-formula id="inf375">
<mml:math id="m398">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.06</td>
<td align="center">85.41<inline-formula id="inf376">
<mml:math id="m399">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.10</td>
<td align="center">89.02<inline-formula id="inf377">
<mml:math id="m400">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.05</td>
</tr>
<tr>
<td align="center">ViT <xref ref-type="bibr" rid="B32">Wang et al. (2022a)</xref>
</td>
<td align="center">87.52<inline-formula id="inf378">
<mml:math id="m401">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.08</td>
<td align="center">92.22<inline-formula id="inf379">
<mml:math id="m402">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.10</td>
<td align="center">83.90<inline-formula id="inf380">
<mml:math id="m403">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.06</td>
<td align="center">85.69<inline-formula id="inf381">
<mml:math id="m404">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.07</td>
<td align="center">92.39<inline-formula id="inf382">
<mml:math id="m405">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.09</td>
<td align="center">89.41<inline-formula id="inf383">
<mml:math id="m406">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.08</td>
<td align="center">89.52<inline-formula id="inf384">
<mml:math id="m407">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.10</td>
<td align="center">93.49<inline-formula id="inf385">
<mml:math id="m408">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.07</td>
</tr>
<tr>
<td align="center">BLIP <xref ref-type="bibr" rid="B38">Yu et al. (2024)</xref>
</td>
<td align="center">88.07<inline-formula id="inf386">
<mml:math id="m409">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.07</td>
<td align="center">84.98<inline-formula id="inf387">
<mml:math id="m410">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.06</td>
<td align="center">87.23<inline-formula id="inf388">
<mml:math id="m411">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.10</td>
<td align="center">87.26<inline-formula id="inf389">
<mml:math id="m412">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.08</td>
<td align="center">95.13<inline-formula id="inf390">
<mml:math id="m413">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.12</td>
<td align="center">85.21<inline-formula id="inf391">
<mml:math id="m414">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.09</td>
<td align="center">90.33<inline-formula id="inf392">
<mml:math id="m415">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.11</td>
<td align="center">85.39<inline-formula id="inf393">
<mml:math id="m416">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.06</td>
</tr>
<tr>
<td align="center">SatMAE <xref ref-type="bibr" rid="B5">Cong et al. (2022)</xref>
</td>
<td align="center">91.57<inline-formula id="inf394">
<mml:math id="m417">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.05</td>
<td align="center">91.71<inline-formula id="inf395">
<mml:math id="m418">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.09</td>
<td align="center">85.94<inline-formula id="inf396">
<mml:math id="m419">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.08</td>
<td align="center">87.16<inline-formula id="inf397">
<mml:math id="m420">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.06</td>
<td align="center">91.32<inline-formula id="inf398">
<mml:math id="m421">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.10</td>
<td align="center">86.53<inline-formula id="inf399">
<mml:math id="m422">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.07</td>
<td align="center">87.12<inline-formula id="inf400">
<mml:math id="m423">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.09</td>
<td align="center">90.88<inline-formula id="inf401">
<mml:math id="m424">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.08</td>
</tr>
<tr>
<td align="center">Scale-MAE <xref ref-type="bibr" rid="B29">Tang et al. (2024)</xref>
</td>
<td align="center">90.05<inline-formula id="inf402">
<mml:math id="m425">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.09</td>
<td align="center">84.95<inline-formula id="inf403">
<mml:math id="m426">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.05</td>
<td align="center">87.85<inline-formula id="inf404">
<mml:math id="m427">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.07</td>
<td align="center">90.13<inline-formula id="inf405">
<mml:math id="m428">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.06</td>
<td align="center">88.05<inline-formula id="inf406">
<mml:math id="m429">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.10</td>
<td align="center">92.09<inline-formula id="inf407">
<mml:math id="m430">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.09</td>
<td align="center">91.12<inline-formula id="inf408">
<mml:math id="m431">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.06</td>
<td align="center">91.17<inline-formula id="inf409">
<mml:math id="m432">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.08</td>
</tr>
<tr>
<td align="center">ResNet-50 <xref ref-type="bibr" rid="B12">Harini et al. (2024)</xref>
</td>
<td align="center">94.05<inline-formula id="inf410">
<mml:math id="m433">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.08</td>
<td align="center">85.22<inline-formula id="inf411">
<mml:math id="m434">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.07</td>
<td align="center">87.26<inline-formula id="inf412">
<mml:math id="m435">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.06</td>
<td align="center">92.08<inline-formula id="inf413">
<mml:math id="m436">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.10</td>
<td align="center">88.88<inline-formula id="inf414">
<mml:math id="m437">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.05</td>
<td align="center">87.55<inline-formula id="inf415">
<mml:math id="m438">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.08</td>
<td align="center">89.77<inline-formula id="inf416">
<mml:math id="m439">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.11</td>
<td align="center">84.87<inline-formula id="inf417">
<mml:math id="m440">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.07</td>
</tr>
<tr>
<td align="center">Ours</td>
<td align="center">97.84<inline-formula id="inf418">
<mml:math id="m441">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.06</td>
<td align="center">95.27<inline-formula id="inf419">
<mml:math id="m442">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.08</td>
<td align="center">93.72<inline-formula id="inf420">
<mml:math id="m443">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.07</td>
<td align="center">95.18<inline-formula id="inf421">
<mml:math id="m444">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.05</td>
<td align="center">98.18<inline-formula id="inf422">
<mml:math id="m445">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.07</td>
<td align="center">94.01<inline-formula id="inf423">
<mml:math id="m446">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.06</td>
<td align="center">94.07<inline-formula id="inf424">
<mml:math id="m447">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.09</td>
<td align="center">95.34<inline-formula id="inf425">
<mml:math id="m448">
<mml:mrow>
<mml:mo>&#x2006;</mml:mo>
<mml:mo>&#xb1;</mml:mo>
<mml:mo>&#x2006;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>0.08</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
</sec>
<sec id="s5">
<title>5 Summary and discussion</title>
<p>In this work, we tackled the challenge of predicting potential distribution areas for double-cropped soybeans under the influence of climate change by introducing the AgriCLIP model, a remote sensing vision-language model. The primary objective was to develop a model that could seamlessly integrate remote sensing imagery with textual data to enhance the prediction accuracy and robustness in identifying suitable areas for double-cropping soybeans in varying climatic scenarios. AgriCLIP achieves this by leveraging multi-scale data processing, self-supervised learning techniques, and cross-modality feature fusion to handle the diverse data sources effectively. The performance of AgriCLIP was rigorously evaluated on four comprehensive and challenging remote sensing datasets: RSICap, RSIEval, MillionAID, and HRSID. These datasets provided a diverse range of test cases, covering different geographic regions, environmental conditions, and agricultural tasks. The experiments were designed with a robust training and validation strategy, ensuring that the model was thoroughly assessed across various scenarios. The results demonstrated that AgriCLIP consistently outperformed six state-of-the-art models across several key metrics, achieving notable improvements in accuracy, recall, and F1 score. For instance, compared to prior methods, AgriCLIP showed a 15% increase in recall on the RSICap dataset, indicating its robustness in detecting suitable areas for double-cropping under varying conditions. To contextualize our findings, we compared AgriCLIP&#x2019;s results with similar studies that utilized conventional remote sensing or unimodal prediction approaches. For example, previous models relying solely on high-resolution satellite imagery reported lower accuracy in dynamic environments due to limited integration of contextual information. AgriCLIP&#x2019;s ability to incorporate textual data and perform cross-modality feature fusion addresses this gap and aligns with findings from multimodal research in agriculture, where data integration is shown to enhance predictive power. This comparative analysis highlights AgriCLIP&#x2019;s unique contributions and positions it as a significant advancement in the field.</p>
<p>Despite the promising outcomes, AgriCLIP has certain limitations. The model&#x2019;s reliance on high-resolution imagery and complex multi-scale inputs significantly increases computational demands, which could limit its deployment in environments with limited resources. Future work could explore the development of more efficient variants of AgriCLIP through model compression techniques like pruning or quantization, aimed at reducing computational requirements without compromising performance. Additionally, while AgriCLIP integrates remote sensing images and textual data effectively, its predictive accuracy could be enhanced further by incorporating additional data modalities, such as temporal climate projections, soil data, and socio-economic indicators. These additions would provide a more holistic understanding of the factors influencing double-cropping, thereby improving the model&#x2019;s generalizability and real-world applicability. Moreover, the broader implications of this study highlight the potential for AgriCLIP to contribute to sustainable agricultural practices and climate adaptation strategies globally. By enabling precise identification of areas suitable for double-cropping, the model could inform policymakers and agricultural planners, fostering more resilient and efficient food systems. Future research should explore collaborative efforts to integrate AgriCLIP into decision-support frameworks, ensuring its accessibility and utility across diverse socio-economic contexts.</p>
</sec>
</body>
<back>
<sec sec-type="data-availability" id="s6">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.</p>
</sec>
<sec sec-type="author-contributions" id="s7">
<title>Author contributions</title>
<p>BG: Writing&#x2013;original draft, Writing&#x2013;review and editing.</p>
</sec>
<sec sec-type="funding-information" id="s8">
<title>Funding</title>
<p>The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. (1) the National Natural Science Foundation of China (Grant No. 42171332) (2) the National Natural Key R&#x26;D Program of Shaanxi Province (Grant 2023-ZDLNY-10) (3) the National Natural Key R&#x26;D Program of China (Grant 2020YFA0607501).</p>
</sec>
<ack>
<p>We acknowledge the financial support provided by the National Natural Science Foundation of China (Grant No. 42171332), the National Natural Key R&#x26;D Program of Shaanxi Province (Grant 2023-ZDLNY-10), and the National Natural Key R&#x26;D Program of China (Grant 2020YFA0607501). These contributions were instrumental in supporting our research and enabling the completion of this work.</p>
</ack>
<sec sec-type="COI-statement" id="s9">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="ai-statement" id="s10">
<title>Generative AI statement</title>
<p>The author(s) declare that no Generative AI was used in the creation of this manuscript.</p>
</sec>
<sec sec-type="disclaimer" id="s11">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Azhand</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Pirasteh</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Varshosaz</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Shahabi</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Abdollahabadi</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Teimouri</surname>
<given-names>H.</given-names>
</name>
<etal/>
</person-group> (<year>2024</year>). <article-title>Sentinel 1a-2a incorporating an object-based image analysis method for flood mapping and extent assessment</article-title>. <source>ISPRS Ann. Photogrammetry, Remote Sens. Spatial Inf. Sci.</source> <volume>X-1</volume>, <fpage>7</fpage>&#x2013;<lpage>17</lpage>. <pub-id pub-id-type="doi">10.5194/isprs-annals-x-1-2024-7-2024</pub-id>
</citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Benchabana</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Kholladi</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Bensaci</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Khaldi</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Building detection in high-resolution remote sensing images by enhancing superpixel segmentation and classification using deep learning approaches</article-title>. <source>Buildings</source> <volume>13</volume>, <fpage>1649</fpage>. <pub-id pub-id-type="doi">10.3390/buildings13071649</pub-id>
</citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bigolin</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Talamini</surname>
<given-names>E.</given-names>
</name>
</person-group> (<year>2024</year>). <article-title>Impacts of climate change scenarios on the corn and soybean double-cropping system in Brazil</article-title>. <source>Climate</source> <volume>12</volume>, <fpage>42</fpage>. <pub-id pub-id-type="doi">10.3390/cli12030042</pub-id>
</citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cheng</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Song</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>A cyclic information-interaction model for remote sensing image segmentation</article-title>. <source>Remote Sens.</source> <volume>13</volume>, <fpage>3871</fpage>. <pub-id pub-id-type="doi">10.3390/rs13193871</pub-id>
</citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cong</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Khanna</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Meng</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Rozi</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>He</surname>
<given-names>Y.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <article-title>Satmae: pre-training transformers for temporal and multi-spectral satellite imagery</article-title>. <source>Adv. Neural Inf. Process. Syst.</source> <volume>35</volume>, <fpage>197</fpage>&#x2013;<lpage>211</lpage>.</citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cui</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Pan</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Multiscale and multisubgraph-based segmentation method for ocean remote sensing images</article-title>. <source>IEEE Trans. Geoscience Remote Sens.</source> <volume>61</volume>, <fpage>1</fpage>&#x2013;<lpage>20</lpage>. <pub-id pub-id-type="doi">10.1109/tgrs.2023.3247697</pub-id>
</citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dong</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Fang</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Yan</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>A deep learning based framework for remote sensing image ground object segmentation</article-title>. <source>Appl. Soft Comput.</source> <volume>130</volume>, <fpage>109695</fpage>. <pub-id pub-id-type="doi">10.1016/j.asoc.2022.109695</pub-id>
</citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Du</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Du</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Incorporating deeplabv3&#x2b; and object-based image analysis for semantic segmentation of very high resolution remote sensing images</article-title>. <source>Int. J. Digital Earth</source> <volume>14</volume>, <fpage>357</fpage>&#x2013;<lpage>378</lpage>. <pub-id pub-id-type="doi">10.1080/17538947.2020.1831087</pub-id>
</citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gammans</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>M&#xe9;rel</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Ortiz-Bobea</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2024</year>). <article-title>Double cropping as an adaptation to climate change in the United States</article-title>. <source>Am. J. Agric. Econ</source>. <pub-id pub-id-type="doi">10.1111/ajae.12491</pub-id>
</citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gao</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2024</year>). <article-title>Deep learning-based key indicator estimation in rivers by leveraging remote sensing image analysis</article-title>. <source>IEEE Access</source> <volume>12</volume>, <fpage>72277</fpage>&#x2013;<lpage>72287</lpage>. <pub-id pub-id-type="doi">10.1109/ACCESS.2024.3399007</pub-id>
</citation>
</ref>
<ref id="B11">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Gomes</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Rozario</surname>
<given-names>P. F.</given-names>
</name>
<name>
<surname>Adhikari</surname>
<given-names>N.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Deep learning optimization in remote sensing image segmentation using dilated convolutions and shufflenet</article-title>,&#x201d; in <source>2021 IEEE international conference on electro information Technology (EIT)</source>.</citation>
</ref>
<ref id="B12">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Harini</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Selvavarshini</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Narmatha</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Anitha</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Selvi</surname>
<given-names>S. K.</given-names>
</name>
<name>
<surname>Manimaran</surname>
<given-names>V.</given-names>
</name>
</person-group> (<year>2024</year>). &#x201c;<article-title>Resnet-50 integrated with attention mechanism for remote sensing classification</article-title>,&#x201d; in <source>International conference on advances in distributed computing and machine learning</source> (<publisher-name>Springer</publisher-name>), <fpage>255</fpage>&#x2013;<lpage>265</lpage>.</citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>He</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Diao</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Yan</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Yao</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Fu</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Multimodal remote sensing image segmentation with intuition-inspired hypergraph modeling</article-title>. <source>IEEE Trans. Image Process.</source> <volume>32</volume>, <fpage>1474</fpage>&#x2013;<lpage>1487</lpage>. <pub-id pub-id-type="doi">10.1109/tip.2023.3245324</pub-id>
</citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Yuan</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wen</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Rsgpt: a remote sensing vision language model and benchmark</article-title>. <source>arXiv Prepr. arXiv:2307</source>, <fpage>15266</fpage>.</citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Lan</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Wen</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Deng</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Deep learning versus object-based image analysis (obia) in weed mapping of uav imagery</article-title>. <source>Int. J. Remote Sens.</source> <volume>41</volume>, <fpage>3446</fpage>&#x2013;<lpage>3479</lpage>. <pub-id pub-id-type="doi">10.1080/01431161.2019.1706112</pub-id>
</citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jung</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Teuscher</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>B&#xf6;hm</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Wells</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Ayasse</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Fischer</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2024</year>). <article-title>Supporting bird diversity and ecological function in managed grassland and forest systems needs an integrative approach</article-title>. <source>Front. Environ. Sci.</source> <volume>12</volume>, <fpage>1401513</fpage>. <pub-id pub-id-type="doi">10.3389/fenvs.2024.1401513</pub-id>
</citation>
</ref>
<ref id="B17">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Junior</surname>
<given-names>C. C.</given-names>
</name>
<name>
<surname>Araki</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>de Campos Macedo</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Object-based image analysis (obia) and machine learning (ml) applied to tropical forest mapping using sentinel-2</article-title>. <source>Can. J. Remote Sens.</source> <volume>49</volume>. <pub-id pub-id-type="doi">10.1080/07038992.2023.2259504</pub-id>
</citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Cai</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Kou</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2024</year>). <article-title>A review of remote sensing image segmentation by deep learning methods</article-title>. <source>Int. J. Digital Earth</source> <volume>17</volume>. <pub-id pub-id-type="doi">10.1080/17538947.2024.2328827</pub-id>
</citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Emam</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Jing</surname>
<given-names>W.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Semi-supervised remote sensing image semantic segmentation method based on deep learning</article-title>. <source>Electronics</source> <volume>12</volume>, <fpage>348</fpage>. <pub-id pub-id-type="doi">10.3390/electronics12020348</pub-id>
</citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ling</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Cheng</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Peng</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Image semantic segmentation method based on deep learning in uav aerial remote sensing image</article-title>. <source>Math. Problems Eng.</source> <volume>2022</volume>, <fpage>1</fpage>&#x2013;<lpage>10</lpage>. <pub-id pub-id-type="doi">10.1155/2022/5983045</pub-id>
</citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Long</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Xia</surname>
<given-names>G.-S.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>M. Y.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>X. X.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>On creating benchmark dataset for aerial image interpretation: reviews, guidances, and million-aid</article-title>. <source>IEEE J. Sel. Top. Appl. earth observations remote Sens.</source> <volume>14</volume>, <fpage>4205</fpage>&#x2013;<lpage>4230</lpage>. <pub-id pub-id-type="doi">10.1109/jstars.2021.3070368</pub-id>
</citation>
</ref>
<ref id="B22">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Luo</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Feng</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Du</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2024</year>). <article-title>A multimodal feature fusion network for building extraction with very high-resolution remote sensing image and lidar data</article-title>. <source>IEEE Trans. Geoscience Remote Sens.</source> <volume>62</volume>, <fpage>1</fpage>&#x2013;<lpage>19</lpage>. <pub-id pub-id-type="doi">10.1109/tgrs.2024.3389110</pub-id>
</citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Norman</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Shahar</surname>
<given-names>H. M.</given-names>
</name>
<name>
<surname>Mohamad</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Rahim</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Mohd</surname>
<given-names>F. A.</given-names>
</name>
<name>
<surname>Shafri</surname>
<given-names>H. Z. M.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Urban building detection using object-based image analysis (obia) and machine learning (ml) algorithms</article-title>. <source>IOP Conf. Ser. Earth Environ. Sci.</source> <volume>620</volume>, <fpage>012010</fpage>. <pub-id pub-id-type="doi">10.1088/1755-1315/620/1/012010</pub-id>
</citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Qi</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Zou</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>Z.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Remote-sensing image segmentation based on implicit 3-d scene representation</article-title>. <source>IEEE Geoscience Remote Sens. Lett.</source> <volume>19</volume>, <fpage>1</fpage>&#x2013;<lpage>5</lpage>. <pub-id pub-id-type="doi">10.1109/lgrs.2022.3227392</pub-id>
</citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Quan</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Ji</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Guo</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2024</year>). <article-title>Learning sar-optical cross modal features for land cover classification</article-title>. <source>Remote Sens.</source> <volume>16</volume>, <fpage>431</fpage>. <pub-id pub-id-type="doi">10.3390/rs16020431</pub-id>
</citation>
</ref>
<ref id="B26">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Rai</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Aburaed</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Al-Saad</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Al-Ahmad</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Al-Mansoori</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Marshall</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Integrating deep learning with active contour models in remote sensing image segmentation</article-title>,&#x201d; in <source>2020 IEEE International Conference on Electronics, Circuits and systems (ICECS)</source>.</citation>
</ref>
<ref id="B27">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shaar</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Y&#x131;lmaz</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Topcu</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Alzoubi</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2024</year>). <article-title>Remote sensing image segmentation for aircraft recognition using u-net as deep learning architecture</article-title>. <source>Appl. Sci.</source> <volume>14</volume>, <fpage>2639</fpage>. <pub-id pub-id-type="doi">10.3390/app14062639</pub-id>
</citation>
</ref>
<ref id="B28">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sun</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Fu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>S. Z.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Deep multimodal fusion network for semantic segmentation using remote sensing image and lidar data</article-title>. <source>IEEE Trans. Geoscience Remote Sens.</source> <volume>60</volume>, <fpage>1</fpage>&#x2013;<lpage>18</lpage>. <pub-id pub-id-type="doi">10.1109/tgrs.2021.3108352</pub-id>
</citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tang</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Cozma</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Georgiou</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Qi</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2024</year>). <article-title>Cross-scale mae: a tale of multiscale exploitation in remote sensing</article-title>. <source>Adv. Neural Inf. Process. Syst.</source> <volume>36</volume>.</citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Teng</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Duan</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Fan</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Global to local: clip-lstm-based object detection from remote sensing images</article-title>. <source>IEEE Trans. Geoscience Remote Sens.</source> <volume>60</volume>, <fpage>1</fpage>&#x2013;<lpage>13</lpage>. <pub-id pub-id-type="doi">10.1109/tgrs.2021.3064840</pub-id>
</citation>
</ref>
<ref id="B31">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tovihoudji</surname>
<given-names>P. G.</given-names>
</name>
<name>
<surname>Sossa</surname>
<given-names>E. L.</given-names>
</name>
<name>
<surname>Egah</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Agbangba</surname>
<given-names>E. C.</given-names>
</name>
<name>
<surname>Akponikp&#xe8;</surname>
<given-names>P. I.</given-names>
</name>
<name>
<surname>Yabi</surname>
<given-names>J. A.</given-names>
</name>
</person-group> (<year>2024</year>). <article-title>Resource endowment and sustainable soil fertility management strategies in maize farming systems in northern Benin</article-title>. <source>Front. Sustain. Resour. Manag.</source> <volume>3</volume>, <fpage>1354981</fpage>. <pub-id pub-id-type="doi">10.3389/fsrma.2024.1354981</pub-id>
</citation>
</ref>
<ref id="B32">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Zheng</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2022a</year>). <article-title>A vit-based multiscale feature fusion approach for remote sensing image segmentation</article-title>. <source>IEEE Geoscience Remote Sens. Lett.</source> <volume>19</volume>, <fpage>1</fpage>&#x2013;<lpage>5</lpage>. <pub-id pub-id-type="doi">10.1109/lgrs.2022.3187135</pub-id>
</citation>
</ref>
<ref id="B33">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Wan</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2022b</year>). &#x201c;<article-title>A cascaded cross-modal network for semantic segmentation from high-resolution aerial imagery and raw lidar data</article-title>,&#x201d; in <source>
<italic>2022 IEEE International Geoscience and remote sensing Symposium (IGARSS)</italic> (IEEE)</source>. <pub-id pub-id-type="doi">10.1109/IGARSS46834.2022.9883824</pub-id>
</citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wei</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Zeng</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Qu</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Su</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Hrsid: a high-resolution sar images dataset for ship detection and instance segmentation</article-title>. <source>Ieee Access</source> <volume>8</volume>, <fpage>120234</fpage>&#x2013;<lpage>120254</lpage>. <pub-id pub-id-type="doi">10.1109/access.2020.3005861</pub-id>
</citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Xu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Efficient transformer for remote sensing image segmentation</article-title>. <source>Remote Sens.</source> <volume>13</volume>, <fpage>3585</fpage>. <pub-id pub-id-type="doi">10.3390/rs13183585</pub-id>
</citation>
</ref>
<ref id="B36">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yan</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Feng</surname>
<given-names>Y.</given-names>
</name>
<etal/>
</person-group> (<year>2023</year>). <article-title>Ringmo-sam: a foundation model for segment anything in multimodal remote-sensing images</article-title>. <source>IEEE Trans. Geoscience Remote Sens.</source> <volume>61</volume>, <fpage>1</fpage>&#x2013;<lpage>16</lpage>. <pub-id pub-id-type="doi">10.1109/tgrs.2023.3332219</pub-id>
</citation>
</ref>
<ref id="B37">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ye</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Gu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Hou</surname>
<given-names>B.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <article-title>A joint-training two-stage method for remote sensing image captioning</article-title>. <source>IEEE Trans. Geoscience Remote Sens.</source> <volume>60</volume>, <fpage>1</fpage>&#x2013;<lpage>16</lpage>. <pub-id pub-id-type="doi">10.1109/tgrs.2022.3224244</pub-id>
</citation>
</ref>
<ref id="B38">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Ran</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2024</year>). <article-title>An intelligent remote sensing image quality inspection system</article-title>. <source>IET Image Process.</source> <volume>18</volume>, <fpage>678</fpage>&#x2013;<lpage>693</lpage>. <pub-id pub-id-type="doi">10.1049/ipr2.12977</pub-id>
</citation>
</ref>
<ref id="B39">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yuan</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Mou</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Hua</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>X. X.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Rrsis: referring remote sensing image segmentation</article-title>. <source>IEEE Trans. Geoscience Remote Sens.</source> <volume>61</volume>.</citation>
</ref>
<ref id="B40">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Zhong</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>A deep learning based method for remote sensing image parcel segmentation</article-title>. <source>J. Food Dairy Technol</source>. <pub-id pub-id-type="doi">10.11871/JFDC.ISSN.2096-742X.2021.02.015</pub-id>
</citation>
</ref>
<ref id="B41">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Su</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Ding</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Qiu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Q.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Research on segmentation algorithm of uav remote sensing image based on deep learning</article-title>. <source>Proc. SPIE - Int. Soc. Opt. Eng.</source>, <fpage>13</fpage>. <pub-id pub-id-type="doi">10.1117/12.2668097</pub-id>
</citation>
</ref>
<ref id="B42">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Guan</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Hipple</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Z.</given-names>
</name>
</person-group> (<year>2024</year>). <article-title>From satellite-based phenological metrics to crop planting dates: deriving field-level planting dates for corn and soybean in the us midwest</article-title>. <source>ISPRS J. Photogrammetry Remote Sens.</source> <volume>216</volume>, <fpage>259</fpage>&#x2013;<lpage>273</lpage>. <pub-id pub-id-type="doi">10.1016/j.isprsjprs.2024.07.031</pub-id>
</citation>
</ref>
</ref-list>
</back>
</article>