<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="2.3" xml:lang="EN">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Mar. Sci.</journal-id>
<journal-title>Frontiers in Marine Science</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Mar. Sci.</abbrev-journal-title>
<issn pub-type="epub">2296-7745</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fmars.2024.1348883</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Marine Science</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Underwater small target detection based on dynamic convolution and attention mechanism</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Cheng</surname>
<given-names>Chensheng</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/2337669"/>
<role content-type="https://credit.niso.org/contributor-roles/conceptualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/methodology/"/>
<role content-type="https://credit.niso.org/contributor-roles/software/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Wang</surname>
<given-names>Can</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/1524841"/>
<role content-type="https://credit.niso.org/contributor-roles/methodology/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Yang</surname>
<given-names>Dianyu</given-names>
</name>
<role content-type="https://credit.niso.org/contributor-roles/software/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Wen</surname>
<given-names>Xin</given-names>
</name>
<role content-type="https://credit.niso.org/contributor-roles/software/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Liu</surname>
<given-names>Weidong</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/1611666"/>
<role content-type="https://credit.niso.org/contributor-roles/project-administration/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Zhang</surname>
<given-names>Feihu</given-names>
</name>
<xref ref-type="author-notes" rid="fn001">
<sup>*</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/384178"/>
<role content-type="https://credit.niso.org/contributor-roles/conceptualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/project-administration/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
</contrib-group>
<aff id="aff1">
<institution>School of Marine Science and Technology, Northwestern Polytenical University</institution>, <addr-line>Xi&#x2019;an</addr-line>, <country>China</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>Edited by: Xinyu Zhang, Dalian Maritime University, China</p>
</fn>
<fn fn-type="edited-by">
<p>Reviewed by: Lanyong Zhang, Harbin Engineering University, China</p>
<p>Tingkai Chen, Dalian Maritime University, China</p>
<p>Boguslaw Cyganek, AGH University of Science and Technology, Poland</p>
</fn>
<fn fn-type="corresp" id="fn001">
<p>*Correspondence: Feihu Zhang, <email xlink:href="mailto:feihu.zhang@nwpu.edu.cn">feihu.zhang@nwpu.edu.cn</email>
</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>12</day>
<month>03</month>
<year>2024</year>
</pub-date>
<pub-date pub-type="collection">
<year>2024</year>
</pub-date>
<volume>11</volume>
<elocation-id>1348883</elocation-id>
<history>
<date date-type="received">
<day>03</day>
<month>12</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>20</day>
<month>02</month>
<year>2024</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2024 Cheng, Wang, Yang, Wen, Liu and Zhang</copyright-statement>
<copyright-year>2024</copyright-year>
<copyright-holder>Cheng, Wang, Yang, Wen, Liu and Zhang</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>In ocean observation missions, unmanned autonomous ocean observation platforms play a crucial role, with precise target detection technology serving as a key support for the autonomous operation of unmanned platforms. Among various underwater sensing devices, side-scan sonar (SSS) has become a primary tool for wide-area underwater detection due to its extensive detection range. However, current research on target detection with SSS primarily focuses on large targets such as sunken ships and aircraft, lacking investigations into small targets. In this study, we collected data on underwater small targets using an unmanned boat equipped with SSS and proposed an enhancement method based on the YOLOv7 model for detecting small targets in SSS images. First, to obtain more accurate initial anchor boxes, we replaced the original k-means algorithm with the k-means++ algorithm. Next, we replaced ordinary convolution blocks in the backbone network with Omni-dimensional Dynamic Convolution (ODConv) to enhance the feature extraction capability for small targets. Subsequently, we inserted a Global Attention Mechanism (GAM) into the neck network to focus on global information and extract target features, effectively addressing the issue of sparse target features in SSS images. Finally, we mitigated the harmful gradients produced by low-quality annotated data by adopting Wise-IoU (WIoU) to improve the detection accuracy of small targets in SSS images. Through validation on the test set, the proposed method showed a significant improvement compared to the original YOLOv7, with increases of 5.05% and 2.51% in <italic>mAP</italic>@0.5 and <italic>mAP</italic>@0.5: 0.95 indicators, respectively. The proposed method demonstrated excellent performance in detecting small targets in SSS images and can be applied to the detection of underwater mines and small equipment, providing effective support for underwater small target detection tasks.</p>
</abstract>
<kwd-group>
<kwd>side-scan sonar</kwd>
<kwd>underwater target detection</kwd>
<kwd>YOLOv7</kwd>
<kwd>K-Means++</kwd>
<kwd>ODConv</kwd>
<kwd>GAM</kwd>
<kwd>WIoU</kwd>
</kwd-group>
<counts>
<fig-count count="15"/>
<table-count count="6"/>
<equation-count count="10"/>
<ref-count count="46"/>
<page-count count="15"/>
<word-count count="6229"/>
</counts>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-in-acceptance</meta-name>
<meta-value>Ocean Observation</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec id="s1" sec-type="intro">
<label>1</label>
<title>Introduction</title>
<p>Due to the distinctive attributes of the underwater environment, optical imaging techniques face substantial limitations when deployed underwater. Conversely, sound waves experience minimal attenuation in water, rendering side-scan sonar (SSS) a prevalent tool for underwater target detection.</p>
<p>Sonar target detection methods can be categorized into traditional techniques and Convolutional Neural Network (CNN)-based approaches. Conventional sonar image detection methods predominantly employ pixel-based (<xref ref-type="bibr" rid="B4">Chen et&#xa0;al., 2014</xref>), feature-based (<xref ref-type="bibr" rid="B21">Mukherjee et&#xa0;al., 2011</xref>), and echo-based (<xref ref-type="bibr" rid="B24">Raghuvanshi et&#xa0;al., 2014</xref>) strategies. These methods utilize manually crafted filters founded on pixel value characteristics, grayscale thresholds, or <italic>a priori</italic> information about the targets for detection. However, underwater settings are intricate, and sonar echoes contend with self-noise, reverberation noise, and environmental noise. Consequently, sonar images exhibit low resolution, blurred edge details, and significant speckle noise, complicating the identification of dependable pixel traits and grayscale thresholds. Furthermore, owing to the diminutive illuminated regions and ambiguous target features in acoustic images, even for the same target, discrepancies in the sonar&#x2019;s position, depth, and angle can lead to variations in the morphological attributes of the target within sonar images. Hence, existing conventional algorithms encounter notable constraints in terms of technical feasibility, time requirements, and applicability when confronted with intricate sonar target detection scenarios. A pressing necessity exists for a detection algorithm that remains robust against fluctuations in target morphology in sonar images, mitigates erroneous detections and omissions induced by background noise interference, and exhibits commendable generalization capabilities.</p>
<p>In comparison to traditional methodologies, deep learning approaches rooted in CNN offer substantial advantages due to their capacity to autonomously acquire and extract deep-level features from images. The learned feature parameters often outperform manually devised counterparts, resulting in significantly heightened detection accuracy when applied to large datasets, as compared to traditional methods. Presently, CNN-based object detection methodologies within the domain of optical image processing have attained a mature stage of development. Researchers have progressively extended the application of these technologies to various inspection tasks, such as steel defect detection (<xref ref-type="bibr" rid="B38">Yang et&#xa0;al., 2021</xref>; <xref ref-type="bibr" rid="B45">Zhao et&#xa0;al., 2021</xref>), medical image analysis (<xref ref-type="bibr" rid="B2">Bhattacharya et&#xa0;al., 2021</xref>; <xref ref-type="bibr" rid="B10">Jia et&#xa0;al., 2022</xref>), marine life detection (<xref ref-type="bibr" rid="B5">Chen et&#xa0;al., 2021</xref>; <xref ref-type="bibr" rid="B32">Wang et&#xa0;al., 2023c</xref>), radar image interpretation (<xref ref-type="bibr" rid="B8">Hou et&#xa0;al., 2021</xref>; <xref ref-type="bibr" rid="B42">Zhang et&#xa0;al., 2021a</xref>), agricultural product inspection (<xref ref-type="bibr" rid="B27">Soeb et&#xa0;al., 2023</xref>; <xref ref-type="bibr" rid="B39">Yang et&#xa0;al., 2023</xref>), and more. Significant achievements have been made in each of these fields. Moreover, CNN-based methods can also be employed for image enhancement to improve the quality of blurry images and enhance the recognition of regions of interest (<xref ref-type="bibr" rid="B3">Chen et&#xa0;al., 2023</xref>; <xref ref-type="bibr" rid="B31">Wang et&#xa0;al., 2023b</xref>), thereby enhancing the effectiveness of target detection. Therefore, investigating how to apply CNN-based object detection methods more efficiently to the field of underwater acoustic image target detection is a highly worthwhile research endeavor. Furthermore, this research can contribute to addressing the challenges associated with underwater acoustic image target detection difficulties.</p>
<p>As of now, employing deep learning techniques for target detection in SSS images still faces several challenges (<xref ref-type="bibr" rid="B13">Le et&#xa0;al., 2020</xref>; <xref ref-type="bibr" rid="B22">Neupane and Seok, 2020</xref>; <xref ref-type="bibr" rid="B7">Ho&#x17c;y&#x144;, 2021</xref>). Firstly, current target detection networks typically rely on anchor box initializations derived from extensive optical datasets, which may not necessarily be suitable for our unique SSS dataset. Consequently, there is a need to re-cluster and generate anchor box initializations customized to specific dataset. Secondly, factors such as sound wave propagation loss, refraction, and scattering often result in acquired sonar images exhibiting characteristics such as low contrast, strong speckle noise, and blurry target edges. In comparison to conventional camera images, sonar images significantly differ in terms of texture diversity, color saturation, and feature resolution. Hence, it is imperative to enhance the feature extraction capability of the backbone network and apply appropriate attention mechanisms to target features in sonar images, aiming to improve detection accuracy. Lastly, due to the formidable challenges associated with collecting SSS image data, obtaining a sufficient quantity of thoroughly comprehensive and high-quality image data for network training is challenging. This necessitates making the most of all available data, including some lower-quality data, to maximize the average detection accuracy.</p>
<p>In response to these challenges, this paper takes full consideration of the unique characteristics of the SSS dataset. Four improvements are made to the YOLOv7 network to enhance its detection performance for small targets in SSS images. The effectiveness of the proposed improvements is validated through multiple experiments. The main contributions of this paper are as follows:</p>
<list list-type="simple">
<list-item>
<p>1) We replaced the k-means algorithm with k-means++ to recluster the annotated bounding boxes in the SSS dataset, thereby obtaining initial anchor boxes that are more suitable for the sizes of small targets in the dataset.</p>
</list-item>
<list-item>
<p>2) We replaced the static convolutional blocks in the backbone network with Omni-dimensional Dynamic Convolution (ODConv), considering the multi-dimensional information of convolutional kernels. This substitution enhances the feature extraction capability of the network without significantly increasing the number of parameters.</p>
</list-item>
<list-item>
<p>3) In the neck network, five global attention mechanism (GAM) modules are introduced, taking into account global information and enhancing the capability to extract target features. This addresses the challenge of feature sparsity commonly found in SSS images.</p>
</list-item>
<list-item>
<p>4) In the loss function section, we introduced Wise-IoU (WIoU) to address the issue of poor quality in SSS data. Such an improvement can alleviate the adverse impact of low-quality data on gradients, leading to higher data utilization and, consequently, an improvement in the detection accuracy of the trained model.</p>
</list-item>
</list>
<p>The remaining sections of this paper are structured as follows. Section 2 elaborates on related research concerning underwater acoustic target detection. In Section 3, we detail the methodology adopted in this study. The experimental procedure and outcome presentation are outlined in Section 4. Finally, Section 5 provides a summary of this paper and offers prospects for future work.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related work</title>
<p>Extensive research has been undertaken in the domain of underwater acoustic image target detection (<xref ref-type="bibr" rid="B14">Lee et&#xa0;al., 2018</xref>; <xref ref-type="bibr" rid="B43">Zhang et&#xa0;al., 2021b</xref>; <xref ref-type="bibr" rid="B12">Kim et&#xa0;al., 2022</xref>; <xref ref-type="bibr" rid="B28">Tang et&#xa0;al., 2023</xref>). These endeavors encompass the design of specialized functional modules tailored to data characteristics or the adaptation and enhancement of networks originally well-suited for optical data to underwater acoustic data.</p>
<p>(<xref ref-type="bibr" rid="B11">Jin et&#xa0;al., 2019</xref>) devised EchoNet, a deep neural network architecture that leverages transfer learning to detect sizable objects like airplanes and submerged vessels in forward-looking sonar images (<xref ref-type="bibr" rid="B6">Fan et&#xa0;al., 2021</xref>). introduced a 32-layer residual network to replace ResNet50/101 in MASK-RCNN, streamlining the network&#x2019;s parameter count while upholding object detection accuracy. They also adopted the Adagard optimizer in place of SGD and evaluated the detection accuracy of the network model through cross-training with a collection of 2500 sonar images (<xref ref-type="bibr" rid="B26">Singh and Valdenegro-Toro, 2021</xref>). conducted a comparison of diverse target segmentation networks, including LinkNet, DeepLabV3, PSPNet, and UNet, based on an extensive dataset of over 1800 forward-looking sonar images. Their investigation revealed that a UNet network employing ResNet34 as the backbone, tailored for their sonar dataset, achieved the most favorable outcomes. This network was subsequently applied to the detection and segmentation of marine debris (<xref ref-type="bibr" rid="B37">Xiao et&#xa0;al., 2021</xref>). addressed shadow information in acoustic images by introducing a shadow capture module capable of capturing and utilizing shadow data within the feature map. This module, compatible with CNN models, incurred a modest parameter increase and displayed portability. The incorporation of shadow features improved detection accuracy (<xref ref-type="bibr" rid="B35">Wang et&#xa0;al., 2021</xref>). proposed AGFE-Net, a novel sonar image target detection algorithm. This algorithm extended the receptive field of convolutional kernels through multi-scale receptive field feature extraction blocks and self-attention mechanisms, thus acquiring multi-scale feature information from sonar images and enhancing feature correlations. Employing a bidirectional feature pyramid network and an adaptive feature fusion block enabled the acquisition of deep semantic features, suppression of background noise interference, and precise prediction box selection through an adaptive non-maximum suppression algorithm, ultimately enhancing target localization accuracy. To address the issue of suboptimal transfer learning results due to significant domain gaps between optical and sonar images (<xref ref-type="bibr" rid="B18">Li et&#xa0;al., 2023a</xref>), introduced a transfer learning method for sonar image classification and object detection known as the Texture Feature Removal Network. They considered texture features in images as domain-specific features and mitigated domain gaps by discarding these domain-specific features, facilitating a more seamless knowledge transfer process. This innovative approach aims to bridge the gap between optical and sonar image analysis, enhancing the effectiveness of transfer learning techniques.</p>
<p>Due to the YOLO series of networks&#x2019; excellent detection performance and ease of deployment, they have found wide application in the field of underwater target detection. Additionally, researchers have made numerous enhancements to the YOLO series networks, making them even more suitable for object detection in underwater acoustic images. In order to address the limitations in detection performance and low detection accuracy resulting from multi-scale image inputs (<xref ref-type="bibr" rid="B15">Li et&#xa0;al., 2023b</xref>), proposed an underwater target detection neural network based on the YOLOv3 algorithm, enhanced with spatial pyramid pooling. The improved neural network demonstrated promising results in the detection of underwater targets, including shipwrecks, schools of fish, and seafloor topography (<xref ref-type="bibr" rid="B17">Li et&#xa0;al., 2021</xref>). introduced an enhanced RBF-SE-YOLOv5 network that reallocates channel information weights to enhance effective information extraction. This enhancement entailed refining the backbone network of the original model and integrating it with RBFNet, thus improving the network&#x2019;s receptive field, feature representation, and capacity to learn vital information. The study demonstrated that amplifying perception information in high receptive fields and integrating multi-scale information augments the efficacy of vital feature extraction. The proposed algorithm notably enhances effective feature extraction, comprehensively captures global information, and mitigates prediction errors and issues of low credibility. Addressing the deficiency in detecting small targets in underwater sonar images (<xref ref-type="bibr" rid="B33">Wang et&#xa0;al., 2022</xref>), harnessed the YOLOv5 framework for marine debris detection. They introduced a multi-branch shuttle network into YOLOv5s and replaced YOLOv5s&#x2019; neck network with BiFPN to augment detection performance. The study also analyzed the impact of uneven target data distribution and network scale on model performance, thereby furnishing reference solutions for ensuring accuracy and speed in target detection (<xref ref-type="bibr" rid="B44">Zhang et&#xa0;al., 2022a</xref>), grounded in the YOLOv5 framework, employed the IOU value between initial anchor boxes and target boxes instead of YOLOv5&#x2019;s Euclidean distance as the clustering criterion. This refinement brought the initial anchor boxes closer to true values, enhancing network convergence speed. Additionally, they introduced coordinate information by appending pixel coordinates of the image as extra channels to the feature map and performing convolution operations, consequently amplifying the accuracy of the detection module&#x2019;s localization regression (<xref ref-type="bibr" rid="B16">Li et&#xa0;al., 2023c</xref>). proposed MA-YOLOv7, a YOLOv7-based network that incorporates multi-scale information fusion and attention mechanisms for target detection and filtering in images. They also introduced a target localization method to determine target positions (latitude and longitude).</p>
<p>However, current research primarily revolves around employing SSS to detect large targets such as airplanes and sunken ships, or using forward-looking sonar to detect small targets at close range. There remains a significant dearth of research focused on utilizing SSS for wide-ranging detection of small underwater targets. This paper constructs a small target SSS dataset based on data collected during experiments and conducts a comprehensive study on small target detection methods in SSS. The primary objective is to facilitate the advancement of the field of small target detection in SSS.</p>
</sec>
<sec id="s3">
<label>3</label>
<title>Improved methods</title>
<p>YOLOv7, introduced in 2022, stands as a one-stage object detection network (<xref ref-type="bibr" rid="B30">Wang et&#xa0;al., 2023a</xref>). It demonstrates outstanding proficiency in both detection speed and accuracy compared to other detection algorithms. In this study, we improved the YOLOv7 model and, through multiple experimental validations, identified four effective improvement points, as illustrated in <xref ref-type="fig" rid="f1">
<bold>Figure&#xa0;1</bold>
</xref>. We applied these enhancements to small target detection in SSS images, achieving notable improvements in detection performance compared to the original YOLOv7, as evidenced by significant enhancements in detection metrics.</p>
<fig id="f1" position="float">
<label>Figure&#xa0;1</label>
<caption>
<p>Improvements made to YOLOv7.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmars-11-1348883-g001.tif"/>
</fig>
<sec id="s3_1">
<label>3.1</label>
<title>K-means++</title>
<p>To enhance both efficiency and accuracy in detection, this study employs the k-means++ (<xref ref-type="bibr" rid="B1">Arthur and Vassilvitskii, 2007</xref>) technique to supplant the k-means approach, initially employed in YOLOv7, for clustering anchor boxes within the dataset. In the conventional k-means method, the first phase involves the random generation of <italic>n</italic> cluster centers from the data samples. Subsequently, the Euclidean distance between each sample and the cluster centers is computed, and the sample is assigned to the cluster center exhibiting the smallest Euclidean distance. In the subsequent phase, the cluster centers are reevaluated, and samples are reclassified. This iterative process is repeated until the cluster centers reach stability.</p>
<p>The k-means++ method represents an enhancement over the conventional k-means approach. Unlike generating all cluster centers randomly in a single instance, k-means++ generates one cluster center at a time. It calculates the Euclidean distance <italic>D</italic>(<italic>x</italic>) between all samples and the cluster center, subsequently deriving the probability of each sample being chosen as the next cluster center through the <xref ref-type="disp-formula" rid="eq1">Equation 1</xref>.</p>
<disp-formula id="eq1">
<label>(1)</label>
<mml:math display="block" id="M1">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>X</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi>D</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Subsequently, the next cluster center is chosen via the roulette wheel selection method. This sequence of steps is reiterated until n cluster centers are generated. After this stage, the ensuing process resembles that of the conventional k-means algorithm: the cluster centers are updated, samples are reclassified, and these steps are iterated until the cluster centers achieve stability. While the k-means++ algorithm invests more time in selecting initial cluster centers, once these initial centers are established, the convergence speed accelerates, yielding cluster centers that hold greater representativeness. This approach mitigates the challenge of becoming trapped in local optima.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Omni-dimensional dynamic convolution</title>
<p>In current neural networks, the majority typically employ static convolutional kernels. However, recent research on dynamic convolutions suggests calculating relevant weights based on the input and linearly combining <italic>n</italic> convolutional kernels according to these weights. This makes the convolution operation dependent on the input, leading to a significant improvement in neural network accuracy. The experimental results from (<xref ref-type="bibr" rid="B19">Li et&#xa0;al., 2022</xref>) demonstrate that the use of ODConv enhances the detection performance for small targets. Therefore, in this study, all convolutional operations in the YOLOv7 backbone network are replaced with ODConv to enhance the detection performance of the network.</p>
<p>The core innovation of ODConv lies in its multi-dimensional dynamic attention mechanism. Traditional dynamic convolution typically achieves dynamism only in the dimension of the number of convolutional kernels, by weighting and combining multiple kernels to adapt to different input features. ODConv extends this concept further by dynamically adjusting not only the number of convolutional kernels but also three other dimensions: spatial size, input channel number, and output channel number. This means that ODConv can adapt more finely to the features of input data, thereby improving the effectiveness of feature extraction.</p>
<p>Additionally, ODConv employs a parallel strategy to simultaneously learn attention across different dimensions. This strategy allows the network to efficiently process features on each dimension while ensuring complementarity and synergy among the dimensions. This is particularly beneficial for handling complex features in SSS images. The network structure is illustrated in <xref ref-type="fig" rid="f2">
<bold>Figure&#xa0;2</bold>
</xref>.</p>
<fig id="f2" position="float">
<label>Figure&#xa0;2</label>
<caption>
<p>The architecture of the ODConv module.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmars-11-1348883-g002.tif"/>
</fig>
<p>The output after ODConv can be expressed using the <xref ref-type="disp-formula" rid="eq2">Equation 2</xref>.</p>
<disp-formula id="eq2">
<label>(2)</label>
<mml:math display="block" id="M2">
<mml:mrow>
<mml:mi>y</mml:mi>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>&#x2609;</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>&#x2609;</mml:mo>
<mml:msub>
<mml:mi>b</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>&#x2609;</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>&#x2609;</mml:mo>
<mml:msub>
<mml:mi>W</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo>&#x2609;</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo>&#x2609;</mml:mo>
<mml:msub>
<mml:mi>b</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo>&#x2609;</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo>&#x2609;</mml:mo>
<mml:msub>
<mml:mi>W</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>*</mml:mo>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <italic>a</italic> represents the attention parameter for the spatial dimensions of the convolutional kernel, <italic>b</italic> represents the attention parameter for the input channel dimensions, <italic>c</italic> represents the attention parameter for the output channel dimensions, <italic>d</italic> represents the attention parameter for the convolutional kernel <italic>W</italic>.</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Global attention mechanism</title>
<p>The incorporation of attention mechanisms within neural networks draws inspiration from human visual attention, enhancing feature extraction by assigning distinct weights to various channels within neural network feature layers. This strategy enables the model to concentrate on pertinent information while disregarding irrelevant data, leading to resource conservation and augmented model performance. Several mainstream attention mechanisms, such as SE-Net (<xref ref-type="bibr" rid="B9">Hu et&#xa0;al., 2018</xref>), ECA-Net (<xref ref-type="bibr" rid="B34">Wang et&#xa0;al., 2020</xref>), BAM (<xref ref-type="bibr" rid="B23">Park et&#xa0;al., 2018</xref>), CBAM (<xref ref-type="bibr" rid="B36">Woo et&#xa0;al., 2018</xref>) and GAM (<xref ref-type="bibr" rid="B20">Liu et&#xa0;al., 2021</xref>), have been demonstrated to enhance the detection performance of models.</p>
<p>The GAM represents a form of global attention mechanism that curtails information loss and amplifies interactions across global dimensions. Consequently, the neural network&#x2019;s aptitude for extracting target features is bolstered. The schematic depiction of the GAM module structure is presented in <xref ref-type="fig" rid="f3">
<bold>Figure&#xa0;3</bold>
</xref>.</p>
<fig id="f3" position="float">
<label>Figure&#xa0;3</label>
<caption>
<p>The structure diagram of the GAM module.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmars-11-1348883-g003.tif"/>
</fig>
<p>GAM employs a sequential channel-spatial attention mechanism with the aim of amplifying global inter-feature interactions while reducing information dispersion. In the channel attention submodule of GAM, a three-dimensional configuration is employed to preserve tridimensional information. The input feature map undergoes dimensional transformation and subsequently undergoes an MLP operation. The result is then reverted to the original dimension, and a sigmoid function is applied to produce the final output.</p>
<p>In the spatial attention submodule, aimed at intensifying focus on spatial information, two convolutional layers facilitate spatial data fusion. Initially, a convolution employing a kernel size of 7 is executed to diminish channel count and computational complexity. Subsequently, another convolution with a kernel size of 7 enhances the number of channels while maintaining uniform channel consistency. The resulting output is then processed through a sigmoid function.</p>
<p>In order to enhance the detection performance of the detection network, we introduced GAM modules at five distinct locations in the neck network. The architecture of the YOLOv7 network with added GAM modules, as well as the specific structures of individual sub-modules within the network, are illustrated in <xref ref-type="fig" rid="f4">
<bold>Figure&#xa0;4</bold>
</xref>.</p>
<fig id="f4" position="float">
<label>Figure&#xa0;4</label>
<caption>
<p>The network architecture diagram of the improved YOLOv7.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmars-11-1348883-g004.tif"/>
</fig>
</sec>
<sec id="s3_4">
<label>3.4</label>
<title>Wise-IoU</title>
<p>The bounding box regression function holds a pivotal role in object detection by enhancing object localization accuracy, accommodating objects of varying scales, rectifying object orientations and shapes, and bolstering algorithmic robustness. This collective functionality contributes significantly to the advancement of object detection algorithms.</p>
<p>However, the majority of current research on Intersection over Union (IoU) (<xref ref-type="bibr" rid="B40">Yu et&#xa0;al., 2016</xref>) assumes that the training data consists of high-quality samples, with their primary focus being on enhancing the fitting capability of bounding box regression loss functions, such as Generalized-IoU (GIoU) (<xref ref-type="bibr" rid="B25">Rezatofighi et&#xa0;al., 2019</xref>), Distance-IoU (DIoU) (<xref ref-type="bibr" rid="B46">Zheng et&#xa0;al., 2020</xref>), Complete-IoU (CIoU) (<xref ref-type="bibr" rid="B46">Zheng et&#xa0;al., 2020</xref>), and Efficient-IoU (EIoU) (<xref ref-type="bibr" rid="B41">Zhang et&#xa0;al., 2022b</xref>), as shown in <xref ref-type="table" rid="T1">
<bold>Table&#xa0;1</bold>
</xref>, where their advantages and disadvantages are compared. Yet, when dealing with datasets that contain a significant amount of inaccurately annotated low-quality data, blindly intensifying the fitting ability of the bounding box regression loss function can have detrimental effects on the model&#x2019;s learning process.</p>
<table-wrap id="T1" position="float">
<label>Table&#xa0;1</label>
<caption>
<p>Comparison of the advantages and shortcoming of different IoU methods.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="left"/>
<th valign="top" align="left">Overlapping</th>
<th valign="top" align="left">Center Point</th>
<th valign="top" align="left">Aspect Ratio</th>
<th valign="top" align="center">Advantage</th>
<th valign="top" align="center">Shortcoming</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="center">IoU</td>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">&#xd7;</td>
<td valign="top" align="center">&#xd7;</td>
<td valign="top" align="center">Taking into account scale invariance and non-negativity.</td>
<td valign="top" align="center">If two boxes do not intersect, it cannot reflect the distance and cannot accurately reflect the degree of overlap between the two boxes.</td>
</tr>
<tr>
<td valign="top" align="left">GIoU</td>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">&#xd7;</td>
<td valign="top" align="center">&#xd7;</td>
<td valign="top" align="center">Addressing the issue where the loss equals zero when there is no overlap between the detection box and the ground truth box.</td>
<td valign="top" align="center">When there is containment between the detection box and the ground truth box, GIOU degenerates into IOU, and when the two boxes intersect, convergence is slow in both the horizontal and vertical directions.</td>
</tr>
<tr>
<td valign="top" align="left">DIoU</td>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">&#xd7;</td>
<td valign="top" align="center">Directly regressing the Euclidean distance between the centers of the two boxes to accelerate convergence.</td>
<td valign="top" align="center">Considering the aspect ratio of bounding boxes during the regression process, there is still room for further improvement in accuracy.</td>
</tr>
<tr>
<td valign="top" align="left">CIoU</td>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">Introducing loss terms for the scale of the detection box, as well as for its length and width, which makes the predicted box better match the ground truth.</td>
<td valign="top" align="center">The aspect ratio describes relative values, introducing some degree of ambiguity and not considering the balance of difficulty levels among samples.</td>
</tr>
<tr>
<td valign="top" align="left">EIoU</td>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">Calculating differences in width and height instead of aspect ratio, while also incorporating Focal Loss to tackle the problem of imbalanced difficulty levels among samples.</td>
<td valign="top" align="center">More attention is given to high-quality anchor boxes, with insufficient focus on low-quality anchor boxes.</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In SSS imagery, targets are highly susceptible to noise interference in the generated images, which presents significant challenges for annotation. In the process of manual annotation, inaccuracies inevitably arise, as illustrated in <xref ref-type="fig" rid="f5">
<bold>Figure&#xa0;5</bold>
</xref>. If the annotation boxes are initially flawed, when an excellent detection model generates high-quality anchor boxes for low-quality sample data, the loss function <italic>L<sub>IoU</sub>
</italic> will have a relatively large value, leading to a substantial gradient gain. In such cases, the model will learn in an unfavorable direction. This phenomenon is particularly relevant in the context of scientific research and analysis for SSS imagery.</p>
<fig id="f5" position="float">
<label>Figure&#xa0;5</label>
<caption>
<p>Low quality annotated samples. <bold>(A&#x2013;D)</bold> are low-quality samples with inaccurate annotations, while <bold>(E&#x2013;H)</bold> have accurate annotations.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmars-11-1348883-g005.tif"/>
</fig>
<p>To address the issue of poor quality in underwater SSS data, we introduce the WIoU (<xref ref-type="bibr" rid="B29">Tong et&#xa0;al., 2023</xref>) as the bounding box loss function. This aims to alleviate the impact of low-quality anchor boxes generated during annotation. The WIoU function employs a dynamic non-monotonic focus mechanism that evaluates anchor box quality through outliers, instead of IoU. This approach furnishes a judicious gradient allocation strategy, curbing the competitiveness of high-quality anchor boxes while attenuating detrimental gradients arising from low-quality instances. Consequently, WIoU prioritizes anchor boxes of moderate quality, ameliorating detector performance overall.</p>
<p>The symbols defined in WIoU are illustrated as shown in <xref ref-type="fig" rid="f6">
<bold>Figure&#xa0;6</bold>
</xref>. In this figure, the blue box represents the smallest bounding box, and the red line represents the line connecting the centers of the true box and the predicted box, where the union area is denoted as <inline-formula>
<mml:math display="inline" id="im1">
<mml:mrow>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>u</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi>w</mml:mi>
<mml:mi>h</mml:mi>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>g</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mrow>
<mml:mi>g</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>W</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mi>H</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
<fig id="f6" position="float">
<label>Figure&#xa0;6</label>
<caption>
<p>The symbol definitions in WIoU.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmars-11-1348883-g006.tif"/>
</fig>
<p>The WIoU methodology, founded on distance metrics, incorporates a two-tier attention mechanism known as WIoU v1. WIoU v1 can be represented by <xref ref-type="disp-formula" rid="eq3">Equations 3</xref>, <xref ref-type="disp-formula" rid="eq4">4</xref>.</p>
<disp-formula id="eq3">
<label>(3)</label>
<mml:math display="block" id="M3">
<mml:mrow>
<mml:msub>
<mml:mi>&#x2112;</mml:mi>
<mml:mrow>
<mml:mi>W</mml:mi>
<mml:mi>I</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>U</mml:mi>
<mml:mi>v</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>&#x211b;</mml:mi>
<mml:mrow>
<mml:mi>W</mml:mi>
<mml:mi>I</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>U</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>&#x2112;</mml:mi>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>U</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="eq4">
<label>(4)</label>
<mml:math display="block" id="M4">
<mml:mrow>
<mml:msub>
<mml:mi>&#x211b;</mml:mi>
<mml:mrow>
<mml:mi>W</mml:mi>
<mml:mi>I</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>U</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mtext>exp&#xa0;</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mi>g</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>y</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mrow>
<mml:mi>g</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mi>W</mml:mi>
<mml:mi>g</mml:mi>
<mml:mn>2</mml:mn>
</mml:msubsup>
<mml:mo>+</mml:mo>
<mml:msubsup>
<mml:mi>H</mml:mi>
<mml:mi>g</mml:mi>
<mml:mn>2</mml:mn>
</mml:msubsup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>*</mml:mo>
</mml:msup>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math display="inline" id="im2">
<mml:mrow>
<mml:msub>
<mml:mi>&#x2112;</mml:mi>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>U</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:mtext>&#xa0;</mml:mtext>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, <italic>W<sub>g</sub>
</italic> and <italic>H<sub>g</sub>
</italic> are the dimensions of the minimum bounding box, <inline-formula>
<mml:math display="inline" id="im3">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>y</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>and <inline-formula>
<mml:math display="inline" id="im4">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mi>g</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mrow>
<mml:mi>g</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> represent the center coordinates of the predicted box and the ground truth box.</p>
<p>Subsequently, building upon WIoU v1, the incorporation of outliers is achieved through the <xref ref-type="disp-formula" rid="eq5">Equation 5</xref>.</p>
<disp-formula id="eq5">
<label>(5)</label>
<mml:math display="block" id="M5">
<mml:mrow>
<mml:mi>&#x3b2;</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msubsup>
<mml:mi>&#x2112;</mml:mi>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>U</mml:mi>
</mml:mrow>
<mml:mo>*</mml:mo>
</mml:msubsup>
</mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:msub>
<mml:mi>&#x2112;</mml:mi>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>U</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="true">&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mfrac>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi>&#x221e;</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Finally, a non-monotonic focus coefficient <italic>&#x3b2;</italic> is formulated and integrated into WIoU v1. As a result, we obtain <xref ref-type="disp-formula" rid="eq6">Equation 6</xref>.</p>
<disp-formula id="eq6">
<label>(6)</label>
<mml:math display="block" id="M6">
<mml:mrow>
<mml:msub>
<mml:mi>&#x2112;</mml:mi>
<mml:mrow>
<mml:mi>W</mml:mi>
<mml:mi>I</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>U</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi>r</mml:mi>
<mml:msub>
<mml:mi>&#x2112;</mml:mi>
<mml:mrow>
<mml:mi>W</mml:mi>
<mml:mi>I</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>U</mml:mi>
<mml:mi>v</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mi>&#x3b2;</mml:mi>
<mml:mrow>
<mml:mi>&#x3b4;</mml:mi>
<mml:msup>
<mml:mi>&#x3b1;</mml:mi>
<mml:mrow>
<mml:mi>&#x3b2;</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>&#x3b4;</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<p>A reduced outlier score implies a higher quality anchor box, yielding a diminished gradient gain assigned to it. Consequently, the bounding box regression concentrates on anchor boxes of intermediate quality. In contrast, anchor boxes exhibiting larger outlier scores are allocated lesser gradient gains, effectively curtailing the generation of significant harmful gradients from low-quality instances. Notably, as <inline-formula>
<mml:math display="inline" id="im5">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:msub>
<mml:mi>&#x2112;</mml:mi>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>U</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="true">&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> remains dynamic, the categorization threshold for anchor boxes&#x2019; quality also remains adaptive. This adaptability empowers WIoU to judiciously allocate gradient gains that are suitable for real-time scenarios, enhancing its effectiveness in each instance.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experiments</title>
<sec id="s4_1">
<label>4.1</label>
<title>Experiment platform</title>
<p>The experiments presented in this study were carried out on an Ubuntu 20.04 system, serving to corroborate the efficacy of the proposed enhanced detection algorithm. Detailed configuration parameters of the system are provided in <xref ref-type="table" rid="T2">
<bold>Table&#xa0;2</bold>
</xref>.</p>
<table-wrap id="T2" position="float">
<label>Table&#xa0;2</label>
<caption>
<p>Experimental environment settings.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="center">Component</th>
<th valign="top" align="center">Specification</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="center">Operating system</td>
<td valign="top" align="center">Ubuntu 20.04(64-bit)</td>
</tr>
<tr>
<td valign="top" align="center">Deep learning framwork</td>
<td valign="top" align="center">Pytorch 1.11</td>
</tr>
<tr>
<td valign="top" align="center">Programming language</td>
<td valign="top" align="center">Python 3.9</td>
</tr>
<tr>
<td valign="top" align="center">GPU accelerated environment</td>
<td valign="top" align="center">CUDA 11.3</td>
</tr>
<tr>
<td valign="top" align="center">Graphics Card (GPU)</td>
<td valign="top" align="left">Nvidia GeForce RTX 3090</td>
</tr>
<tr>
<td valign="top" align="center">Processor (CPU)</td>
<td valign="top" align="left">Platinum 8255C CPU @ 2.50GHz</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Model evaluation metrics</title>
<p>When evaluating the detection performance of the improved YOLOv7, we employed evaluation metrics including Recall (R), Precision (P), Average Precision (AP), and mean Average Precision (mAP). The calculation methods of these four indicators can be expressed by <xref ref-type="disp-formula" rid="eq7">Equations 7</xref>&#x2013;<xref ref-type="disp-formula" rid="eq10">10</xref> respectively.</p>
<disp-formula id="eq7">
<label>(7)</label>
<mml:math display="block" id="M7">
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo stretchy="false">/</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="eq8">
<label>(8)</label>
<mml:math display="block" id="M8">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo stretchy="false">/</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="eq9">
<label>(9)</label>
<mml:math display="block" id="M9">
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mo>&#x222b;</mml:mo>
<mml:mn>0</mml:mn>
<mml:mn>1</mml:mn>
</mml:munderover>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>R</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mi>d</mml:mi>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="eq10">
<label>(10)</label>
<mml:math display="block" id="M10">
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>N</mml:mi>
</mml:munderover>
<mml:mi>A</mml:mi>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">/</mml:mo>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Within the array of evaluation metrics mentioned, True Positive (TP) signifies the tally of correctly identified positive samples, False Positive (FP) corresponds to the count of erroneously identified negative samples, and False Negative (FN) stands for the tally of positively labeled samples that remain undetected. The variable <italic>N</italic> represents the overall number of detected categories.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Dataset preparation</title>
<p>In the data acquisition phase, we deployed objects of two types, namely cylindrical and conical structures, as detection targets in the experimental marine area. We utilized the SS3060 dual-frequency SSS as the detector for data collection. The SSS and detection targets are illustrated in <xref ref-type="fig" rid="f7">
<bold>Figure&#xa0;7</bold>
</xref>. The size of the SSS is 100<italic>mm</italic> in diameter and 1250<italic>mm</italic> in length, with a weight of 25<italic>kg</italic> in air and 12<italic>kg</italic> in water. And the performance parameters of SSS are presented in <xref ref-type="table" rid="T3">
<bold>Table&#xa0;3</bold>
</xref>. For the experiment&#x2019;s execution, the SSS was affixed beneath an unmanned boat. The utilization of GPS signals emanating from the unmanned boat enabled the verification of congruence between features visible in the SSS images and the physically predetermined targets. This methodology thereby facilitated the creation of a dataset characterized by high quality.</p>
<fig id="f7" position="float">
<label>Figure&#xa0;7</label>
<caption>
<p>The SSS and preset targets. <bold>(A)</bold> SSS. <bold>(B)</bold> preset targets.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmars-11-1348883-g007.tif"/>
</fig>
<table-wrap id="T3" position="float">
<label>Table&#xa0;3</label>
<caption>
<p>Performance parameters of the SSS.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="center">Frequency</th>
<th valign="top" align="center">300kHz</th>
<th valign="top" align="left">600kHz</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="center">Maximum range</td>
<td valign="top" align="center">150<italic>m</italic>
</td>
<td valign="top" align="center">100<italic>m</italic>
</td>
</tr>
<tr>
<td valign="top" align="center">Maximum slope distance</td>
<td valign="top" align="center">230<italic>m</italic>
</td>
<td valign="top" align="center">200<italic>m</italic>
</td>
</tr>
<tr>
<td valign="top" align="center">Horizontal beam width</td>
<td valign="top" align="center">0.5&#xb0;</td>
<td valign="top" align="center">0.26&#xb0;</td>
</tr>
<tr>
<td valign="top" align="center">Vertical beam width</td>
<td valign="top" align="center">50&#xb0;</td>
<td valign="top" align="center">50&#xb0;</td>
</tr>
<tr>
<td valign="top" align="center">Horizontal resolution</td>
<td valign="top" align="center">1.3<italic>m</italic>
</td>
<td valign="top" align="center">0.45<italic>m</italic>
</td>
</tr>
<tr>
<td valign="top" align="center">Vertical resolution</td>
<td valign="top" align="center">2.5<italic>m</italic>
</td>
<td valign="top" align="center">1.25<italic>m</italic>
</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>After deploying the targets, to ensure the diversity of the collected dataset, we employed two different survey paths in the target water area to perform a comprehensive scan of underwater targets. The placement of the targets and the scanning paths are illustrated in <xref ref-type="fig" rid="f8">
<bold>Figure&#xa0;8</bold>
</xref>. In the figure, the lateral distance between the targets is approximately 50 meters, and the longitudinal distance is approximately 100 meters. Due to the influence of underwater currents, some degree of deviation in this distance is inevitably present.</p>
<fig id="f8" position="float">
<label>Figure&#xa0;8</label>
<caption>
<p>The target deployment locations and scanning paths. <bold>(A)</bold> Scanning path 1. <bold>(B)</bold> Scanning path 2.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmars-11-1348883-g008.tif"/>
</fig>
<p>Due to the complex and variable underwater environment, as well as the susceptibility of images to noise interference, the images acquired using SSS also exhibit significant variations, as shown in <xref ref-type="fig" rid="f9">
<bold>Figure&#xa0;9</bold>
</xref>.</p>
<fig id="f9" position="float">
<label>Figure&#xa0;9</label>
<caption>
<p>Acquired Sonar Images. <bold>(A)</bold> Background Images. <bold>(B)</bold> Images with Targets. <bold>(C)</bold> Target Images in Complex Environments. <bold>(D)</bold> Interfered Images.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmars-11-1348883-g009.tif"/>
</fig>
<p>Discerning distinct target features within SSS images presents a formidable challenge. Manual annotation subsequent to data collection is arduous, making on-site, real-time labeling the optimal strategy. To attain the real-time processing of SSS images, we adopt a tactic wherein image segments are extracted from the sonar waterfall plot at intervals of 30 seconds, illustrated in <xref ref-type="fig" rid="f10">
<bold>Figure&#xa0;10</bold>
</xref>. This approach facilitates the annotation of targets on SSS images in real-time, while accounting for the field environment and GPS coordinates.</p>
<fig id="f10" position="float">
<label>Figure&#xa0;10</label>
<caption>
<p>Preprocessing of SSS images. We partitioned the images into diminutive patches with dimensions of 200&#xd7;200. Each patch features a 50-pixel overlap to prevent the loss of target characteristics.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmars-11-1348883-g010.tif"/>
</fig>
<p>Furthermore, the targets occupy a minuscule proportion within the complete SSS image. Training the network directly with large-scale SSS images would generate an excessive number of negative samples, potentially impeding the training process and squandering computational resources. Moreover, considering practical applications, the network needs to be deployed on resource-constrained underwater autonomous vehicles, making it imperative to restrict the image size fed into the detection network. To address these challenges, we partitioned the images into diminutive patches with dimensions of 200 &#xd7; 200. Each patch features a 50-pixel overlap to prevent the loss of target characteristics. From these patches, we selectively identified those containing targets for training, significantly reducing the generation of irrelevant negative samples stemming from extraneous background information. Similarly, during the detection phase, we performed the same cropping operation before inputting the complete image into the detection network.</p>
<p>Finally, we filtered out unusable data and conducted data augmentation using high-quality data, yielding a total of 975 sample images. These images include 293 Cones, 318 Cylinders, and 364 Non-target instances. (&#x201c;Non-target&#x201d; refers to miscellaneous items on the seafloor, such as rocks or accidentally dropped artificial objects, which were not intentionally deployed by us. Despite not being the primary focus of the experiment, these Non-target items share certain similarities with the intentionally deployed targets. Including them in the dataset is essential, as their presence could potentially impact our ability to detect the deployed targets.) These samples were then randomly divided into training, validation, and test sets in a 7:1:2 ratio, with the specific number of samples for each set as shown in <xref ref-type="table" rid="T4">
<bold>Table&#xa0;4</bold>
</xref>.</p>
<table-wrap id="T4" position="float">
<label>Table&#xa0;4</label>
<caption>
<p>The actual dimensions of underwater targets and the final dataset sample size.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="middle" rowspan="2" align="center">Category</th>
<th valign="top" colspan="3" align="center">Target</th>
<th valign="top" align="center" colspan="4">Dataset</th>
</tr>
<tr>
<th valign="top" align="center">Diameter</th>
<th valign="top" align="center">Height</th>
<th valign="top" align="center">Number</th>
<th valign="top" align="center">Train</th>
<th valign="top" align="center">Val</th>
<th valign="top" align="center">Test</th>
<th valign="top" align="center">Total</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="center">Cone</td>
<td valign="top" align="center">0.30<italic>m</italic>/0.50<italic>m</italic>
</td>
<td valign="top" align="center">0.60<italic>m</italic>
</td>
<td valign="top" align="center">4</td>
<td valign="top" align="center">205</td>
<td valign="top" align="center">29</td>
<td valign="top" align="center">59</td>
<td valign="top" align="center">293</td>
</tr>
<tr>
<td valign="top" align="center">Cylinder</td>
<td valign="top" align="center">0.50<italic>m</italic>
</td>
<td valign="top" align="center">1.00<italic>m</italic>
</td>
<td valign="top" align="center">4</td>
<td valign="top" align="center">223</td>
<td valign="top" align="center">31</td>
<td valign="top" align="center">64</td>
<td valign="top" align="center">318</td>
</tr>
<tr>
<td valign="top" align="center">Non-target</td>
<td valign="top" align="center">/</td>
<td valign="top" align="center">/</td>
<td valign="top" align="center">/</td>
<td valign="top" align="center">255</td>
<td valign="top" align="center">36</td>
<td valign="top" align="center">73</td>
<td valign="top" align="center">364</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_4">
<label>4.4</label>
<title>Experiment results</title>
<p>To validate the effectiveness of the algorithm proposed in this study for detecting small targets in SSS imagery, we tested the algorithm on a real dataset collected during our sea trials. The variations in various loss functions and accuracy metrics during the training process are illustrated in <xref ref-type="fig" rid="f11">
<bold>Figure&#xa0;11</bold>
</xref>.</p>
<fig id="f11" position="float">
<label>Figure&#xa0;11</label>
<caption>
<p>The loss function and relevant metrics during the training process of the improved YOLOv7. The horizontal axis in the figure represents the number of training epochs.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmars-11-1348883-g011.tif"/>
</fig>
<p>To ensure that all the introduced modifications exerted a positive influence on the network, a sequence of ablation experiments was carried out. The results of these experiments are presented in <xref ref-type="table" rid="T5">
<bold>Table&#xa0;5</bold>
</xref>. In the table, <italic>mAP</italic>@0.5 represents the average precision at an IoU threshold of 0.5, while <italic>mAP</italic>@0.5: 0.95 represents the average of mAP values at IoU thresholds ranging from 0.5 to 0.95. It is apparent that the integration of k-means++, ODConv, GAM, and WIoU enhancements has resulted in an improved detection performance of the original YOLOv7 model on our assembled SSS dataset. The comparison of Precision-Recall (PR) curves on the test set between the improved YOLOv7 network and the original YOLOv7 network is shown in <xref ref-type="fig" rid="f12">
<bold>Figure&#xa0;12</bold>
</xref>, while the comparison of confusion matrices is shown in <xref ref-type="fig" rid="f13">
<bold>Figure&#xa0;13</bold>
</xref>. From <xref ref-type="fig" rid="f12">
<bold>Figure&#xa0;12</bold>
</xref>, it can be observed that the improved YOLOv7 network achieved an average precision improvement of 5.05% on the test set.</p>
<table-wrap id="T5" position="float">
<label>Table&#xa0;5</label>
<caption>
<p>Ablation experiment.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="center">Model</th>
<th valign="top" align="center">K-means++</th>
<th valign="top" align="center">ODConv</th>
<th valign="top" align="center">GAM</th>
<th valign="top" align="center">WIoU</th>
<th valign="top" align="center">
<italic>mAP</italic>@0.5(%)</th>
<th valign="top" align="center">
<italic>mAP</italic>@0.5: 0.95(%)</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="center"/>
<td valign="top" align="center">&#xd7;</td>
<td valign="top" align="center">&#xd7;</td>
<td valign="top" align="center">&#xd7;</td>
<td valign="top" align="center">&#xd7;</td>
<td valign="top" align="center">90.73</td>
<td valign="top" align="center">49.78</td>
</tr>
<tr>
<td valign="top" align="center"/>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">&#xd7;</td>
<td valign="top" align="center">&#xd7;</td>
<td valign="top" align="center">&#xd7;</td>
<td valign="top" align="center">91.77(1.04&#x2191;)</td>
<td valign="top" align="center">50.39(0.61&#x2191;)</td>
</tr>
<tr>
<td valign="top" align="center">YOLOv7</td>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">&#xd7;</td>
<td valign="top" align="center">&#xd7;</td>
<td valign="top" align="center">93.28(2.55&#x2191;)</td>
<td valign="top" align="center">51.17(1.39&#x2191;)</td>
</tr>
<tr>
<td valign="top" align="center"/>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">&#xd7;</td>
<td valign="top" align="center">94.49(3.76&#x2191;)</td>
<td valign="top" align="center">51.79(2.01&#x2191;)</td>
</tr>
<tr>
<td valign="top" align="center"/>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">&#x2713;</td>
<td valign="top" align="center">95.78(5.05&#x2191;)</td>
<td valign="top" align="center">52.29(2.51&#x2191;)</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="f12" position="float">
<label>Figure&#xa0;12</label>
<caption>
<p>The PR curve on the test set. <bold>(A)</bold> initial YOLOv7 network. <bold>(B)</bold> improved YOLOv7 network.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmars-11-1348883-g012.tif"/>
</fig>
<fig id="f13" position="float">
<label>Figure&#xa0;13</label>
<caption>
<p>The Confusion Matrix on the test set. <bold>(A)</bold> initial YOLOv7 network. <bold>(B)</bold> improved YOLOv7 network.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmars-11-1348883-g013.tif"/>
</fig>
<p>From <xref ref-type="fig" rid="f12">
<bold>Figures&#xa0;12</bold>
</xref>, <xref ref-type="fig" rid="f13">
<bold>13</bold>
</xref>, it can be observed that the improved YOLOv7 network demonstrates a noticeable enhancement in the detection performance of Non-target objects. In <xref ref-type="fig" rid="f12">
<bold>Figure&#xa0;12</bold>
</xref>, the PR curve of the enhanced YOLOv7 network for the Non-target category shows a value of 0.909, which represents an improvement of 0.095 compared to the original network&#x2019;s 0.814. In <xref ref-type="fig" rid="f13">
<bold>Figure&#xa0;13</bold>
</xref>, within the improved YOLOv7&#x2019;s confusion matrix, the Non-target category registers a value of 0.94, as opposed to the original network&#x2019;s 0.92, marking a 0.02 improvement.</p>
<p>In addition, our experimental results provide evidence of the enhanced network&#x2019;s superior performance in detecting Non-target objects, as depicted in <xref ref-type="fig" rid="f14">
<bold>Figure&#xa0;14</bold>
</xref>. The original YOLOv7 network misclassified Non-target objects as Cylinder and Cone, whereas the improved YOLOv7 network can accurately identify Non-target categories. This advancement has reduced the false detection rate for Non-target, which holds significant practical significance in engineering applications. During the search process, it prevents wasting time on Non-target objects.</p>
<fig id="f14" position="float">
<label>Figure&#xa0;14</label>
<caption>
<p>Comparison of Non-target category detection results between the improved YOLOv7 and the original YOLOv7 networks. <bold>(A)</bold> Labels. <bold>(B)</bold> Initial YOLOv7 network. <bold>(C)</bold> Improved YOLOv7 network.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmars-11-1348883-g014.tif"/>
</fig>
<p>Furthermore, a comparative analysis was conducted between our enhanced detection algorithm and prominent detection networks to validate the efficacy of the proposed methodology. The comparative visualization of detection outcomes is illustrated in <xref ref-type="fig" rid="f15">
<bold>Figure&#xa0;15</bold>
</xref>. Detailed detection metrics are presented in <xref ref-type="table" rid="T6">
<bold>Table&#xa0;6</bold>
</xref>. These findings collectively furnish compelling evidence for the superior performance of the approach proposed in this paper within the domain of small target detection using SSS.</p>
<fig id="f15" position="float">
<label>Figure&#xa0;15</label>
<caption>
<p>Comparison of detection results between our method and other detection networks. The first row in the figure represents the ground truth labels for different target categories, while the second to fifth rows depict the detection results of various algorithms.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmars-11-1348883-g015.tif"/>
</fig>
<table-wrap id="T6" position="float">
<label>Table&#xa0;6</label>
<caption>
<p>Comparison of detection metrics between our method and other detection networks.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="center">Method</th>
<th valign="top" align="center">Precision(%)</th>
<th valign="top" align="center">Recall(%)</th>
<th valign="top" align="center">
<italic>mAP</italic>@0.5(%)</th>
<th valign="top" align="center">
<italic>mAP</italic>@0.5: 0.95(%)</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="center">SSD</td>
<td valign="top" align="center">88.31</td>
<td valign="top" align="center">89.76</td>
<td valign="top" align="center">89.28</td>
<td valign="top" align="center">48.24</td>
</tr>
<tr>
<td valign="top" align="center">Faster-RCNN</td>
<td valign="top" align="center">85.33</td>
<td valign="top" align="center">83.91</td>
<td valign="top" align="center">87.19</td>
<td valign="top" align="center">46.73</td>
</tr>
<tr>
<td valign="top" align="center">YOLOv5</td>
<td valign="top" align="center">88.72</td>
<td valign="top" align="center">90.46</td>
<td valign="top" align="center">89.98</td>
<td valign="top" align="center">49.80</td>
</tr>
<tr>
<td valign="top" align="center">YOLOv7</td>
<td valign="top" align="center">93.56</td>
<td valign="top" align="center">89.12</td>
<td valign="top" align="center">90.73</td>
<td valign="top" align="center">49.78</td>
</tr>
<tr>
<td valign="top" align="center">Our method</td>
<td valign="top" align="center">92.99</td>
<td valign="top" align="center">89.10</td>
<td valign="top" align="center">95.78</td>
<td valign="top" align="center">52.29</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The results illustrated in <xref ref-type="fig" rid="f15">
<bold>Figure&#xa0;15</bold>
</xref> provide empirical validation of the efficacy of the approach introduced in this research. As demonstrated in columns (2), (3), and (4) of <xref ref-type="fig" rid="f15">
<bold>Figure&#xa0;15</bold>
</xref>, some mainstream detection networks often exhibit mis-detections when accurately distinguishing between the categories of cylindrical and conical objects. In contrast, the proposed method in this paper demonstrates accurate detection for objects that are challenging to differentiate, with higher probability values assigned. This highlights the superiority of the algorithm presented in this paper.</p>
<p>Nonetheless, it is important to note that the enhanced network in this study does exhibit certain limitations. For example, as depicted in column (6) of <xref ref-type="fig" rid="f15">
<bold>Figure&#xa0;15</bold>
</xref>, all networks misclassify a Non-target as a Cone. This misclassification arises due to the distinct shadow surrounding the Non-target and the similarity in the size of the bright spot to the Cone category, resulting in a false positive detection. At present, there is a lack of definitive solutions for scenarios in which acoustic image features exhibit extremely high similarity, yet the actual objects belong to different categories. Using a higher-precision device to acquire images with increased resolution may be beneficial for addressing this issue.</p>
</sec>
</sec>
<sec id="s5" sec-type="conclusions">
<label>5</label>
<title>Conclusions</title>
<p>This study collected a dataset of small target SSS images during sea trials and proposed an enhancement method based on the YOLOv7 model for detecting small targets in SSS images. The method utilizes the k-means++ algorithm to obtain more accurate initial anchor box sizes. Subsequently, it employs ODConv to replace static convolution modules in the YOLOv7 backbone network and integrates a GAM attention mechanism into the YOLOv7 neck network, thereby enhancing the feature extraction capabilities of the detection network. In the loss function section, a WIoU loss function is introduced to balance the impact of high-quality and low-quality anchor boxes on gradients, enhancing the network&#x2019;s focus on average-quality anchor boxes. Experimental results demonstrate the effectiveness of the proposed YOLOv7-based enhancement algorithm, with <italic>mAP</italic>@0.5 and <italic>mAP</italic>@0.5: 0.95 metrics reaching 95.78% and 52.29%, respectively, representing improvements of 5.05% and 2.51% over the original YOLOv7 network. Furthermore, comparisons with mainstream underwater detection networks confirm the superiority of the proposed method in small target detection in SSS images.</p>
<p>The proposed method can be applied to autonomous target detection in Unmanned Underwater Vehicles (UUVs) and Unmanned Surface Vehicles (USVs), enhancing the autonomous operational capabilities of unmanned autonomous ocean observation platforms. In the future, we plan to collect more diverse small target data and continue researching SSS-based small target detection methods to further contribute to underwater exploration.</p>
</sec>
<sec id="s6" sec-type="data-availability">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/supplementary materials, further inquiries can be directed to the corresponding author/s.</p>
</sec>
<sec id="s7" sec-type="author-contributions">
<title>Author contributions</title>
<p>CC: Conceptualization, Methodology, Software, Writing &#x2013; original draft, Writing &#x2013; review &amp; editing. CW: Methodology, Writing &#x2013; review &amp; editing. DY: Software, Writing &#x2013; review &amp; editing. XW: Software, Writing &#x2013; review &amp; editing. WL: Project administration, Writing &#x2013; review &amp; editing. FZ: Conceptualization, Project administration, Writing &#x2013; review &amp; editing.</p>
</sec>
</body>
<back>
<sec id="s8" sec-type="funding-information">
<title>Funding</title>
<p>The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. This study was supported by the National Key Research and Development Program (2023YFC2808400).</p>
</sec>
<sec id="s9" sec-type="COI-statement">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
<p>The author(s) declared that they were an editorial board member of Frontiers, at the time of submission. This had no impact on the peer review process and the final decision.</p>
</sec>
<sec id="s10" sec-type="disclaimer">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors&#xa0;and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Arthur</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Vassilvitskii</surname> <given-names>S.</given-names>
</name>
</person-group> (<year>2007</year>). &#x201c;<article-title>K-means++ the advantages of careful seeding</article-title>,&#x201d; in <source>Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, SODA  2007.</source> (<publisher-loc>New Orleans, Louisiana, USA</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>1027</fpage>&#x2013;<lpage>1035</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1145/1283383.1283494</pub-id>
</citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bhattacharya</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Maddikunta</surname> <given-names>P. K. R.</given-names>
</name>
<name>
<surname>Pham</surname> <given-names>Q.-V.</given-names>
</name>
<name>
<surname>Gadekallu</surname> <given-names>T. R.</given-names>
</name>
<name>
<surname>Chowdhary</surname> <given-names>C. L.</given-names>
</name>
<name>
<surname>Alazab</surname> <given-names>M.</given-names>
</name>
<etal/>
</person-group>. (<year>2021</year>). <article-title>Deep learning and medical image processing for coronavirus (covid-19) pandemic: A survey</article-title>. <source>Sustain. cities Soc.</source> <volume>65</volume>, <fpage>102589</fpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1016/j.scs.2020.102589</pub-id>
</citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>N.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Kong</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Lin</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Zhao</surname> <given-names>H.</given-names>
</name>
<etal/>
</person-group>. (<year>2023</year>). <article-title>Semantic attention and relative scene depth-guided network for underwater image enhancement</article-title>. <source>Eng. Appl. Artif. Intell.</source> <volume>123</volume>, <fpage>106532</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.engappai.2023.106532</pub-id>
</citation>
</ref>
<ref id="B4">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Chen</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Shen</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Dong</surname> <given-names>X.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Underwater object detection by combining the spectral residual and three-frame algorithm</article-title>,&#x201d; in <source>Lecture Notes in Electrical Engineering</source> (<publisher-loc>Berlin, Germany</publisher-loc>: <publisher-name>Springer</publisher-name>). <volume>279</volume>, <fpage>1109</fpage>&#x2013;<lpage>1114</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1007/978-3-642-41674-3_154</pub-id>
</citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>N.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Zhao</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>G.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>One-stage cnn detector-based benthonic organisms detection with limited training dataset</article-title>. <source>Neural Networks</source> <volume>144</volume>, <fpage>247</fpage>&#x2013;<lpage>259</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.neunet.2021.08.014</pub-id>
</citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fan</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Xia</surname> <given-names>W.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>H.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Detection and segmentation of underwater objects from forward-looking sonar based on a modified mask rcnn</article-title>. <source>Signal Image Video Process.</source> <volume>15</volume>, <fpage>1135</fpage>&#x2013;<lpage>1143</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s11760-020-01841-x</pub-id>
</citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ho&#x17c;y&#x144;</surname> <given-names>S.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>A review of underwater mine detection and classification in sonar imagery</article-title>. <source>Electronics</source> <volume>10</volume>, <fpage>2943</fpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.3390/electronics10232943</pub-id>
</citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hou</surname> <given-names>F.</given-names>
</name>
<name>
<surname>Lei</surname> <given-names>W.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Xi</surname> <given-names>J.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Deep learning-based subsurface target detection from gpr scans</article-title>. <source>IEEE sensors J.</source> <volume>21</volume>, <fpage>8161</fpage>&#x2013;<lpage>8171</lpage>. doi: <pub-id pub-id-type="doi">10.1109/JSEN.2021.3050262</pub-id>
</citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hu</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Shen</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Sun</surname> <given-names>G.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Squeeze-and-Excitation Networks</article-title>,&#x201d; in <source>IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>. (<publisher-loc>Salt Lake City, UT, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>). pp. <fpage>7132</fpage>&#x2013;<lpage>7141</lpage>. doi: <pub-id pub-id-type="doi">10.1109/CVPR.2018.00745</pub-id>
</citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jia</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>W.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Yang</surname> <given-names>B.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>An attention-based cascade r-cnn model for sternum fracture detection in x-ray images</article-title>. <source>CAAI Trans. Intell. Technol.</source> <volume>7</volume>, <fpage>658</fpage>&#x2013;<lpage>670</lpage>. doi: <pub-id pub-id-type="doi">10.1049/cit2.12072</pub-id>
</citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jin</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Liang</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Yang</surname> <given-names>C.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Accurate underwater atr in forward-looking sonar imagery using deep convolutional neural networks</article-title>. <source>IEEE Access</source> <volume>7</volume>, <fpage>125522</fpage>&#x2013;<lpage>125531</lpage>. doi: <pub-id pub-id-type="doi">10.1109/ACCESS.2019.2939005</pub-id>
</citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kim</surname> <given-names>W.-K.</given-names>
</name>
<name>
<surname>Bae</surname> <given-names>H. S.</given-names>
</name>
<name>
<surname>Son</surname> <given-names>S.-U.</given-names>
</name>
<name>
<surname>Park</surname> <given-names>J.-S.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Neural network-based underwater object detection off the coast of the korean peninsula</article-title>. <source>J. Mar. Sci. Eng.</source> <volume>10</volume>, <fpage>1436</fpage>. doi: <pub-id pub-id-type="doi">10.3390/jmse10101436</pub-id>
</citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Le</surname> <given-names>H. T.</given-names>
</name>
<name>
<surname>Phung</surname> <given-names>S. L.</given-names>
</name>
<name>
<surname>Chapple</surname> <given-names>P. B.</given-names>
</name>
<name>
<surname>Bouzerdoum</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Ritz</surname> <given-names>C. H.</given-names>
</name>
<name>
<surname>Tran</surname> <given-names>L. C.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Deep gabor neural network for automatic detection of mine-like objects in sonar imagery</article-title>. <source>IEEE Access</source> <volume>8</volume>, <fpage>94126</fpage>&#x2013;<lpage>94139</lpage>. doi: <pub-id pub-id-type="doi">10.1109/ACCESS.2020.2995390</pub-id>
</citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lee</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Park</surname> <given-names>B.</given-names>
</name>
<name>
<surname>Kim</surname> <given-names>A.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Deep learning from shallow dives: Sonar image generation and training for underwater object detection</article-title>. <source>arXiv preprint arXiv:1810.07990</source>. doi:&#xa0;<pub-id pub-id-type="doi">10.48550/arXiv.1810.07990</pub-id>
</citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Shen</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Xiao</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Sun</surname> <given-names>X.</given-names>
</name>
<etal/>
</person-group>. (<year>2023</year>b). <article-title>Improved neural network with spatial pyramid pooling and online datasets preprocessing for underwater target detection based on side scan sonar imagery</article-title>. <source>Remote Sens.</source> <volume>15</volume>, <fpage>440</fpage>. doi: <pub-id pub-id-type="doi">10.3390/rs15020440</pub-id>
</citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Yue</surname> <given-names>C.</given-names>
</name>
<name>
<surname>Xu</surname> <given-names>G.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Feng</surname> <given-names>X.</given-names>
</name>
</person-group> (<year>2023</year>c). <article-title>Real-time underwater target detection for auv using side scan sonar images based on deep learning</article-title>. <source>Appl. Ocean Res.</source> <volume>138</volume>, <fpage>103630</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.apor.2023.103630</pub-id>
</citation>
</ref>
<ref id="B17">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Li</surname> <given-names>W.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Zhao</surname> <given-names>X. F.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>Q. J.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Target detection in color sonar image based on yolov5 network</article-title>,&#x201d; in <source>2021 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC)</source>. (<publisher-loc>Xi'an, China</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x2013;<lpage>5</lpage>.</citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname> <given-names>C.</given-names>
</name>
<name>
<surname>Ye</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Xi</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Jia</surname> <given-names>Y.</given-names>
</name>
</person-group> (<year>2023</year>a). <article-title>A texture feature removal network for sonar image classification and detection</article-title>. <source>Remote Sens.</source> <volume>15</volume>, <fpage>616</fpage>. doi: <pub-id pub-id-type="doi">10.3390/rs15030616</pub-id>
</citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname> <given-names>C.</given-names>
</name>
<name>
<surname>Zhou</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Yao</surname> <given-names>A.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Omni-dimensional dynamic convolution</article-title>. <source>arXiv preprint arXiv:2209.07947</source>. doi:&#xa0;<pub-id pub-id-type="doi">10.48550/arXiv.2209.07947</pub-id>
</citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Shao</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Hoffmann</surname> <given-names>N.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Global attention mechanism: Retain information to enhance channel-spatial interactions</article-title>. <source>arXiv preprint arXiv:2112.05561</source>. doi:&#xa0;<pub-id pub-id-type="doi">10.48550/arXiv.2112.05561</pub-id>
</citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mukherjee</surname> <given-names>K.</given-names>
</name>
<name>
<surname>Gupta</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Ray</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Phoha</surname> <given-names>S.</given-names>
</name>
</person-group> (<year>2011</year>). <article-title>Symbolic analysis of sonar data for underwater target detection</article-title>. <source>IEEE J. Oceanic Eng.</source> <volume>36</volume>, <fpage>219</fpage>&#x2013;<lpage>230</lpage>. doi: <pub-id pub-id-type="doi">10.1109/JOE.2011.2122590</pub-id>
</citation>
</ref>
<ref id="B22">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Neupane</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Seok</surname> <given-names>J.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>A review on deep learning-based approaches for automatic sonar target recognition</article-title>. <source>Electronics</source> <volume>9</volume>, <elocation-id>1972</elocation-id>. doi: <pub-id pub-id-type="doi">10.3390/electronics9111972</pub-id>
</citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Park</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Woo</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Lee</surname> <given-names>J.-Y.</given-names>
</name>
<name>
<surname>Kweon</surname> <given-names>I. S.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Bam: Bottleneck attention module</article-title>. <source>arXiv preprint arXiv:1807.06514</source>. doi:&#xa0;<pub-id pub-id-type="doi">10.48550/arXiv.1807.06514</pub-id>
</citation>
</ref>
<ref id="B24">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Raghuvanshi</surname> <given-names>D. S.</given-names>
</name>
<name>
<surname>Dutta</surname> <given-names>I.</given-names>
</name>
<name>
<surname>Vaidya</surname> <given-names>R.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Design and analysis of a novel sonar-based obstacleavoidance system for the visually impaired and unmanned systems</article-title>,&#x201d; in <source>2014 International Conference on Embedded Systems (ICES)</source>. (<publisher-loc>Coimbatore, India</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>238</fpage>&#x2013;<lpage>243</lpage>.</citation>
</ref>
<ref id="B25">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Rezatofighi</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Tsoi</surname> <given-names>N.</given-names>
</name>
<name>
<surname>Gwak</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Sadeghian</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Reid</surname> <given-names>I.</given-names>
</name>
<name>
<surname>Savarese</surname> <given-names>S.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Generalized intersection over union: A metric and a loss for bounding box regression</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.</source> (<publisher-loc>Long Beach, CA, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>658</fpage>&#x2013;<lpage>666</lpage>.</citation>
</ref>
<ref id="B26">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Singh</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Valdenegro-Toro</surname> <given-names>M.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>The marine debris dataset for forward-looking sonar semantic segmentation</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source>. (<publisher-loc>Montreal, BC, Canada</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>3741</fpage>&#x2013;<lpage>3749</lpage>.</citation>
</ref>
<ref id="B27">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Soeb</surname> <given-names>M. J. A.</given-names>
</name>
<name>
<surname>Jubayer</surname> <given-names>M. F.</given-names>
</name>
<name>
<surname>Tarin</surname> <given-names>T. A.</given-names>
</name>
<name>
<surname>Al Mamun</surname> <given-names>M. R.</given-names>
</name>
<name>
<surname>Ruhad</surname> <given-names>F. M.</given-names>
</name>
<name>
<surname>Parven</surname> <given-names>A.</given-names>
</name>
<etal/>
</person-group>. (<year>2023</year>). <article-title>Tea leaf disease detection and identification based on yolov7 (yolo-t)</article-title>. <source>Sci. Rep.</source> <volume>13</volume>, <fpage>6078</fpage>. doi: <pub-id pub-id-type="doi">10.1038/s41598-023-33270-4</pub-id>
</citation>
</ref>
<ref id="B28">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tang</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Jin</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Zhao</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Huang</surname> <given-names>C.</given-names>
</name>
<name>
<surname>Yu</surname> <given-names>Y.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Auv-based side-scan sonar real-time method for underwater-target detection</article-title>. <source>J. Mar. Sci. Eng.</source> <volume>11</volume>, <fpage>690</fpage>. doi: <pub-id pub-id-type="doi">10.3390/jmse11040690</pub-id>
</citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tong</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Xu</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Yu</surname> <given-names>R.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Wise-iou: Bounding box regression loss with dynamic focusing mechanism</article-title>. <source>arXiv preprint arXiv:2301.10051</source>. doi:&#xa0;<pub-id pub-id-type="doi">10.48550/arXiv.2301.10051</pub-id>
</citation>
</ref>
<ref id="B30">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Wang</surname> <given-names>C.-Y.</given-names>
</name>
<name>
<surname>Bochkovskiy</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Liao</surname> <given-names>H.-Y. M.</given-names>
</name>
</person-group> (<year>2023</year>a). &#x201c;<article-title>Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>. (<publisher-loc>Vancouver, BC, Canada</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>7464</fpage>&#x2013;<lpage>7475</lpage>.</citation>
</ref>
<ref id="B31">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname> <given-names>N.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Kong</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Gong</surname> <given-names>Y.</given-names>
</name>
<etal/>
</person-group>. (<year>2023</year>b). <article-title>Underwater attentional generative adversarial networks for image enhancement</article-title>. <source>IEEE Trans. Human-Machine Syst</source>. <volume>53</volume> (<issue>3</issue>), <fpage>490</fpage>-<lpage>500</lpage>. doi: <pub-id pub-id-type="doi">10.1109/THMS.2023.3261341</pub-id>
</citation>
</ref>
<ref id="B32">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname> <given-names>N.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Karimi</surname> <given-names>H. R.</given-names>
</name>
<name>
<surname>Lin</surname> <given-names>Y.</given-names>
</name>
</person-group> (<year>2023</year>c). <article-title>Deep learning-based visual detection of marine organisms: A survey</article-title>. <source>Neurocomputing</source> <volume>532</volume>, <fpage>1</fpage>&#x2013;<lpage>32</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.neucom.2023.02.018</pub-id>
</citation>
</ref>
<ref id="B33">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Feng</surname> <given-names>C.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>G.</given-names>
</name>
<name>
<surname>He</surname> <given-names>B.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Detection of weak and small targets in forward-looking sonar image using multi-branch shuttle neural network</article-title>. <source>IEEE Sensors J.</source> <volume>22</volume>, <fpage>6772</fpage>&#x2013;<lpage>6783</lpage>. doi: <pub-id pub-id-type="doi">10.1109/JSEN.2022.3147234</pub-id>
</citation>
</ref>
<ref id="B34">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Wang</surname> <given-names>Q.</given-names>
</name>
<name>
<surname>Wu</surname> <given-names>B.</given-names>
</name>
<name>
<surname>Zhu</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Zuo</surname> <given-names>W.</given-names>
</name>
<name>
<surname>Hu</surname> <given-names>Q.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Eca-net: Efficient channel attention for deep convolutional neural networks</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>. (<publisher-loc>Seattle, WA, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>11534</fpage>&#x2013;<lpage>11542</lpage>.</citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Huang</surname> <given-names>W.</given-names>
</name>
<name>
<surname>Guo</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Zeng</surname> <given-names>L.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Sonar image target detection based on adaptive global feature enhancement network</article-title>. <source>IEEE Sensors J.</source> <volume>22</volume>, <fpage>1509</fpage>&#x2013;<lpage>1530</lpage>. doi: <pub-id pub-id-type="doi">10.1109/JSEN.2021.3131645</pub-id>
</citation>
</ref>
<ref id="B36">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Woo</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Park</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Lee</surname> <given-names>J.-Y.</given-names>
</name>
<name>
<surname>Kweon</surname> <given-names>I. S.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Cbam: Convolutional block attention module</article-title>,&#x201d; in <source>Proceedings of the European conference on computer vision. (ECCV)</source> (<publisher-loc>Munich Germany</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>3</fpage>&#x2013;<lpage>19</lpage>.</citation>
</ref>
<ref id="B37">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Xiao</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Cai</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Lin</surname> <given-names>C.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>Q.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>A shadow capture deep neural network for underwater forward-looking sonar image detection</article-title>. <source>Mobile Inf. Syst.</source> <volume>2021</volume>, <fpage>1</fpage>&#x2013;<lpage>10</lpage>. doi: <pub-id pub-id-type="doi">10.1155/2021/3168464</pub-id>
</citation>
</ref>
<ref id="B38">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yang</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Cui</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Yu</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Yuan</surname> <given-names>H.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Deep learning based steel pipe weld defect detection</article-title>. <source>Appl. Artif. Intell.</source> <volume>35</volume>, <fpage>1237</fpage>&#x2013;<lpage>1249</lpage>. doi: <pub-id pub-id-type="doi">10.1080/08839514.2021.1975391</pub-id>
</citation>
</ref>
<ref id="B39">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yang</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Qu</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>N.</given-names>
</name>
<name>
<surname>Wu</surname> <given-names>J.</given-names>
</name>
<etal/>
</person-group>. (<year>2023</year>). <article-title>Improved apple fruit target recognition method based on yolov7 model</article-title>. <source>Agriculture</source> <volume>13</volume>, <elocation-id>1278</elocation-id>. doi: <pub-id pub-id-type="doi">10.3390/agriculture13071278</pub-id>
</citation>
</ref>
<ref id="B40">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Yu</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Jiang</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Cao</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Huang</surname> <given-names>T.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Unitbox: An advanced object detection network</article-title>,&#x201d; in <source>Proceedings of the 24th ACM international conference on Multimedia</source>. (<publisher-loc>Amsterdam, The Netherlands</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>516</fpage>&#x2013;<lpage>520</lpage>.</citation>
</ref>
<ref id="B41">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname> <given-names>Y.-F.</given-names>
</name>
<name>
<surname>Ren</surname> <given-names>W.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Jia</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Tan</surname> <given-names>T.</given-names>
</name>
</person-group> (<year>2022</year>b). <article-title>Focal and efficient iou loss for accurate bounding box regression</article-title>. <source>Neurocomputing</source> <volume>506</volume>, <fpage>146</fpage>&#x2013;<lpage>157</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.neucom.2022.07.042</pub-id>
</citation>
</ref>
<ref id="B42">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Tang</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Zhong</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Ning</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Wu</surname> <given-names>K.</given-names>
</name>
</person-group> (<year>2021</year>a). <article-title>Self-trained target detection of radar and sonar images using automatic deep learning</article-title>. <source>IEEE Trans. Geosci. Remote Sens.</source> <volume>60</volume>, <fpage>1</fpage>&#x2013;<lpage>14</lpage>. doi: <pub-id pub-id-type="doi">10.1109/TGRS.2021.3096011</pub-id>
</citation>
</ref>
<ref id="B43">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Tang</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Zhong</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Ning</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Wu</surname> <given-names>K.</given-names>
</name>
</person-group> (<year>2021</year>b). <article-title>Self-trained target detection of radar and sonar images using automatic deep learning</article-title>. <source>IEEE Trans. Geosci. Remote Sens.</source> <volume>60</volume>, <fpage>1</fpage>&#x2013;<lpage>14</lpage>. doi: <pub-id pub-id-type="doi">10.1109/TGRS.2021.3096011</pub-id>
</citation>
</ref>
<ref id="B44">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Tian</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Shao</surname> <given-names>G.</given-names>
</name>
<name>
<surname>Cheng</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>J.</given-names>
</name>
</person-group> (<year>2022</year>a). <article-title>Target detection of forward-looking sonar image based on improved yolov5</article-title>. <source>IEEE Access</source> <volume>10</volume>, <fpage>18023</fpage>&#x2013;<lpage>18034</lpage>. doi: <pub-id pub-id-type="doi">10.1109/ACCESS.2022.3150339</pub-id>
</citation>
</ref>
<ref id="B45">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname> <given-names>W.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>F.</given-names>
</name>
<name>
<surname>Huang</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Cheng</surname> <given-names>W.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>A new steel defect detection algorithm based on deep learning</article-title>. <source>Comput. Intell. Neurosci.</source> <volume>2021</volume>, <fpage>1</fpage>&#x2013;<lpage>13</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1155/2021/5592878</pub-id>
</citation>
</ref>
<ref id="B46">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zheng</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>W.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Ye</surname> <given-names>R.</given-names>
</name>
<name>
<surname>Ren</surname> <given-names>D.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Distance-iou loss: Faster and better learning for bounding box regression</article-title>,&#x201d; in <source>Proceedings of the AAAI conference on artificial intelligence</source>. (<publisher-loc>New York, USA</publisher-loc>: <publisher-name>AAAI</publisher-name>), Vol. <volume>3</volume>, <fpage>12993</fpage>&#x2013;<lpage>13000</lpage>.</citation>
</ref>
</ref-list>
</back>
</article>