<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neurorobot.</journal-id>
<journal-title>Frontiers in Neurorobotics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neurorobot.</abbrev-journal-title>
<issn pub-type="epub">1662-5218</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnbot.2023.1220443</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>E-YOLOv4-tiny: a traffic sign detection algorithm for urban road scenarios</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Xiao</surname> <given-names>Yanqiu</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Yin</surname> <given-names>Shiao</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Cui</surname> <given-names>Guangzhen</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2306013/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Zhang</surname> <given-names>Weili</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Yao</surname> <given-names>Lei</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Fang</surname> <given-names>Zhanpeng</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>College of Mechanical and Electrical Engineering, Zhengzhou University of Light Industry</institution>, <addr-line>Zhengzhou</addr-line>, <country>China</country></aff>
<aff id="aff2"><sup>2</sup><institution>Collaborative Innovation Center of Intelligent Tunnel Boring Machine</institution>, <addr-line>Zhengzhou</addr-line>, <country>China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Hong Qiao, University of Chinese Academy of Sciences, China</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Yahia Said, Northern Border University, Saudi Arabia; Jiong Wu, University of Pennsylvania, United States</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Guangzhen Cui <email>15225115031&#x00040;163.com</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>18</day>
<month>07</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>17</volume>
<elocation-id>1220443</elocation-id>
<history>
<date date-type="received">
<day>10</day>
<month>05</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>30</day>
<month>06</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2023 Xiao, Yin, Cui, Zhang, Yao and Fang.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Xiao, Yin, Cui, Zhang, Yao and Fang</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license> </permissions>
<abstract>
<sec>
<title>Introduction</title>
<p>In urban road scenes, due to the small size of traffic signs and the large amount of surrounding interference information, current methods are difficult to achieve good detection results in the field of unmanned driving.</p></sec>
<sec>
<title>Methods</title>
<p>To address the aforementioned challenges, this paper proposes an improved E-YOLOv4-tiny based on the YOLOv4-tiny. Firstly, this article constructs an efficient layer aggregation lightweight block with deep separable convolutions to enhance the feature extraction ability of the backbone. Secondly, this paper presents a feature fusion refinement module aimed at fully integrating multi-scale features. Moreover, this module incorporates our proposed efficient coordinate attention for refining interference information during feature transfer. Finally, this article proposes an improved S-RFB to add contextual feature information to the network, further enhancing the accuracy of traffic sign detection.</p></sec>
<sec>
<title>Results and discussion</title>
<p>The method in this paper is tested on the CCTSDB dataset and the Tsinghua-Tencent 100K dataset. The experimental results show that the proposed method outperforms the original YOLOv4-tiny in traffic sign detection with 3.76% and 7.37% improvement in mAP, respectively, and 21% reduction in the number of parameters. Compared with other advanced methods, the method proposed in this paper achieves a better balance between accuracy, real-time performance, and the number of model parameters, which has better application value.</p></sec></abstract>
<kwd-group>
<kwd>traffic sign detection</kwd>
<kwd>unmanned driving</kwd>
<kwd>small object</kwd>
<kwd>feature fusion</kwd>
<kwd>convolutional neural network</kwd>
<kwd>YOLOv4-tiny</kwd>
</kwd-group>
<contract-sponsor id="cn001">National Natural Science Foundation of China<named-content content-type="fundref-id">10.13039/501100001809</named-content></contract-sponsor>
<contract-sponsor id="cn002">Henan Provincial Science and Technology Research Project<named-content content-type="fundref-id">10.13039/501100017700</named-content></contract-sponsor>
<contract-sponsor id="cn003">Key Scientific Research Project of Colleges and Universities in Henan Province<named-content content-type="fundref-id">10.13039/501100013066</named-content></contract-sponsor>
<counts>
<fig-count count="8"/>
<table-count count="5"/>
<equation-count count="15"/>
<ref-count count="45"/>
<page-count count="13"/>
<word-count count="7974"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>The semantic information conveyed by traffic signs is essential in providing accurate details on road conditions ahead, which can be used by in-vehicle intelligent systems to help driverless vehicles make informed decisions. Traffic sign detection technology plays a vital role in reducing the incidence of traffic accidents and ensuring safe driving. As a result, it has become a key component of current vehicle-assisted driving systems and holds significant research significance within the urban transportation field (Badue et al., <xref ref-type="bibr" rid="B1">2021</xref>).</p>
<p>Traffic sign detection techniques are divided into traditional methods and deep learning-based methods (Sharma and Mir, <xref ref-type="bibr" rid="B30">2020</xref>). Traditional methods mainly use manually designed features to extract and identify targets by combining multiple features. However, the weak generalization ability of traditional methods for recognition leads to their poor robustness for detection in complex scenes. The algorithms based on deep learning include two-stage detection methods represented by R-CNN (Girshick et al., <xref ref-type="bibr" rid="B5">2014</xref>), Fast R-CNN (Girshick, <xref ref-type="bibr" rid="B4">2015</xref>), and Faster R-CNN (Ren et al., <xref ref-type="bibr" rid="B28">2015</xref>). This type of method first sub-classifies the extracted region candidate frames, and then performs position correction, which results in good detection accuracy. However, the two-stage detection method sacrifices detection speed to a certain extent and requires significant storage space. The YOLO series (Redmon et al., <xref ref-type="bibr" rid="B25">2016</xref>; Redmon and Farhadi, <xref ref-type="bibr" rid="B26">2017</xref>, <xref ref-type="bibr" rid="B27">2018</xref>; Bochkovskiy et al., <xref ref-type="bibr" rid="B2">2020</xref>) and SSD (Liu et al., <xref ref-type="bibr" rid="B21">2016</xref>) are representatives of single-stage detection methods. Instead of omitting the step of generating candidate regions, they directly regress the feature map to obtain target category and boundary box coordinate information. While this approach boasts faster detection speeds compared to the two-stage detection algorithm, it still falls short in terms of accuracy.</p>
<p>The ability to quickly detect distant traffic signs is critical for autonomous driving decision systems to provide sufficient response time. However, traffic signs in images typically occupy an absolute area of no more than 32 &#x000D7; 32 pixels, rendering the detection task a classic example of small target detection (Lin et al., <xref ref-type="bibr" rid="B18">2014</xref>). Currently, many scholars have improved small target detection algorithms based on deep learning. Yang and Tong (<xref ref-type="bibr" rid="B39">2022</xref>) proposed a visual multi-scale attention module based on the YOLOv3 algorithm, which integrated feature maps of different scales with attention weights to eliminate the interference information of traffic sign features. Pei et al. (<xref ref-type="bibr" rid="B23">2023</xref>) proposed an LCB-YOLOv5 algorithm to detect small targets in remote sensing images. This method improves the accuracy of small target detection by introducing more receptive field and replacing the EIOU loss function. Prasetyo et al. (<xref ref-type="bibr" rid="B24">2022</xref>) improved the diversity of network feature extraction by incorporating a wing convolution layer into the YOLOv4-tiny&#x00027;s backbone network. They also added extra detection heads to enhance the accuracy of the detector for small targets. Wei et al. (<xref ref-type="bibr" rid="B36">2023</xref>) proposed an approach to improve the detection ability of small targets by adding a transformer attention mechanism and deformable convolution to the backbone network. They also utilized deformable ROI pooling to process multi-scale semantic information extracted from the network, effectively addressing the problem of multi-scale traffic sign detection. Huang et al. (<xref ref-type="bibr" rid="B15">2022</xref>) effectively improved the detection accuracy of SSD algorithm for small targets by fusing the target detection layer and adjacent features, and validated it on their own indoor small target dataset. Wu and Liao (<xref ref-type="bibr" rid="B38">2022</xref>) proposed a SSD traffic sign detection algorithm combining the receptive field block (RFB) (Liu and Huang, <xref ref-type="bibr" rid="B19">2018</xref>) and path aggregation network (Liu et al., <xref ref-type="bibr" rid="B20">2018</xref>) to improve the target location and classification accuracy. However, this method was only suitable for detection when there was less interference information around the target. While the above-mentioned methods have succeeded in enhancing the detection accuracy of traffic sign models, they have also led to an increase in model size. Since traffic sign detection tasks are usually deployed on devices with limited storage space, the pursuit of lightweight models is of significant practical value.</p>
<p>The backbone serves as the primary feature extractor in a convolutional neural network (CNN) model, and its performance plays a critical role in determining the strength of the model&#x00027;s feature extraction capability. Currently, classic lightweight backbone networks such as the MobileNet series (Howard et al., <xref ref-type="bibr" rid="B12">2017</xref>, <xref ref-type="bibr" rid="B11">2019</xref>), ShuffleNet series (Ma et al., <xref ref-type="bibr" rid="B22">2018</xref>; Zhang et al., <xref ref-type="bibr" rid="B43">2018</xref>), and GhostNet (Han et al., <xref ref-type="bibr" rid="B8">2020</xref>) are widely used. However, although these networks are known for their fast forward reasoning speeds, their feature extraction ability for small targets is suboptimal. On the other hand, more complex backbone networks like ResNet (He et al., <xref ref-type="bibr" rid="B9">2016</xref>), DenseNet (Huang et al., <xref ref-type="bibr" rid="B14">2017</xref>), and DLA (Yu et al., <xref ref-type="bibr" rid="B40">2018</xref>) have achieved higher detection accuracy but at the expense of increased parameter quantity and computational complexity. As a result, these networks may not meet the real-time requirements in terms of reasoning speed. The multi-scale feature fusion method is also an important approach to address the model&#x00027;s insufficient ability to extract small target features. The feature pyramid network (FPN) (Lin et al., <xref ref-type="bibr" rid="B17">2017</xref>) fuses features of multiple scales through top-down lateral connections to obtain fused features with stronger expression capability, which are more beneficial for small target detection. Additionally, other methods such as PANet, NAS-FPN (Ghiasi et al., <xref ref-type="bibr" rid="B3">2019</xref>), and BiFPN (Tan et al., <xref ref-type="bibr" rid="B31">2020</xref>) explore more diverse information fusion paths and adaptive weighting methods to enhance the expression ability of different scale features, further improving the accuracy of small target detection. Although all these methods improve the performance of small target detection to different degrees, they often fail to take into account the significant amount of redundant information present during feature transfer, which can impede the network&#x00027;s ability to effectively fuse multi-scale features.</p>
<p>In urban road scenes, the background of traffic signs often contains many objects with similar characteristics, which will introduce significant interference information during the feature extraction process of the model. This interference feature information may lead to detector misdetection. In recent years, attention mechanisms have emerged as an effective method for enhancing features in the field of deep learning image processing (Guo et al., <xref ref-type="bibr" rid="B7">2022</xref>). By drawing on the process of extracting external information from human vision, the attention mechanism can identify key feature regions of the target from the image and suppress distracting information to enhance representation. Several attention mechanisms, such as SE (Hu et al., <xref ref-type="bibr" rid="B13">2018</xref>), CBAM (Woo et al., <xref ref-type="bibr" rid="B37">2018</xref>), and coordinate attention (Hou et al., <xref ref-type="bibr" rid="B10">2021</xref>), have been proposed to enhance the ability of feature expression by suppressing interference information and capturing key feature areas. However, these methods have limitations. For instance, SE only considers channel weight and ignores location information, while CBAM focuses on local feature information without capturing long-range dependencies. On the other hand, coordinate attention combines both channel and location information and captures long-range dependencies, but at the cost of increased computation, which reduces the real-time performance of the algorithm. Despite these limitations, attention mechanisms have been shown to improve the detection of small targets in complex backgrounds.</p>
<p>In summary, existing deep learning-based methods primarily aim to improve the detection capability of models through enhancing feature extraction, utilizing multi-scale feature fusion, and incorporating attention mechanism. However, in practical applications, traffic sign detection tasks have strict requirements for accuracy, model size, and real-time performance. Existing methods typically focus only on improving detection accuracy, while neglecting model lightweightness and real-time performance, which makes them difficult to be applied in current practical scenarios. In this paper, we propose an E-YOLOv4-tiny algorithm for urban road traffic sign detection based on the current excellent lightweight YOLOv4-tiny (Wang C. Y. et al., <xref ref-type="bibr" rid="B32">2021</xref>) algorithm, from the perspective of achieving a balance among model accuracy, parameter quantity, and real-time performance, and taking into account the influence of interference information in the feature fusion process on multi-scale feature representation. The proposed method can further improve the detection accuracy while reducing the model parameter size, and ensure real-time performance, thereby better application in practical scenarios. The main contributions of this paper are as follows.</p>
<p>(1) To address the poor feature extraction performance of the YOLOv4-tiny&#x00027;s backbone network, we construct a lightweight E-DSC block to optimize it. Drawing inspiration from ELAN&#x00027;s gradient structure and employing depthwise separable convolutions to reduce the network parameters while maintaining performance, we aim to improve the module with minimal parameter costs.</p>
<p>(2) In order to solve the problem of redundant information interference during FPN feature fusion at different levels, a feature fusion refinement module (FFRM) is proposed in this paper. Our method suppresses redundant and interfering information in the multi-scale feature fusion process by constructing a semantic information refinement module and a texture information refinement module that combine efficient coordinate attention (ECA). Additionally, we utilize residual connections to ensure that the output feature maps integrate high-level semantics and detailed information.</p>
<p>(3) We improve the coordinate attention mechanism to further focus and enhance small object features. We use both global max pooling and global average pooling to compress feature maps along the spatial dimension, allowing for a more accurate reflection of channel responses to small objects. Additionally, we employ group convolution and channel shuffling operations to improve the computational efficiency of the model.</p>
<p>(4) To address the issue of limited receptive fields in the YOLOv4-tiny, we propose an S-RFB module in this paper. We simplify the structure of the original RFB module and reduce the number of convolution operations in each branch. The aim is to integrate contextual information into the network to enhance the ability to detect small objects without introducing an excessive number of parameters.</p>
<p>(5) The proposed method in this paper is trained and evaluated on two benchmark datasets, CSUST chinese traffic sign detection benchmark (CCTSDB) and Tsinghua-Tencent 100K (TT100K). The experimental results demonstrate that our method outperforms several state-of-the-art methods in terms of small target detection performance for urban road traffic signs.</p>
<p>The remainder of this article is structured as follows. Section 2 provides a brief overview of the YOLOv4-tiny algorithm. Section 3 provides a detailed introduction to the E-YOLOv4-tiny algorithm proposed in this paper. Section 4 presents the results of ablation experiments conducted on our proposed algorithm, as well as a performance comparison with other state-of-the-art algorithms on CCTSDB and TT100K datasets. Finally, in Section 5, we summarize our article and discuss our future research directions.</p></sec>
<sec id="s2">
<title>2. YOLOv4-tiny algorithm</title>
<p>YOLOv4-tiny is a simplified model based on YOLOv4 and is currently a popular model in lightweight detection networks. The detection process of YOLOv4-tiny is the same as YOLOv4. Firstly, YOLOv4-tiny resizes the input image to a fixed size. Next, the input image is divided into candidate boxes, and <italic>S</italic>&#x000D7;<italic>S</italic> small cells are generated for each image. Within each cell, the model predicts <italic>B</italic> boundary boxes and identifies <italic>D</italic>-type objects. The prediction boundary box contains the category name, the center coordinates (<italic>x</italic>,<italic>y</italic>) of the boundary box, the width and height (<italic>w</italic>,<italic>h</italic>) of the boundary box, and confidence information. Finally, a non-maximum suppression algorithm is applied to remove redundant candidate boxes and obtain the final detection boxes for the model.</p>
<p>The YOLOv4-tiny backbone consists of basic components such as CBL, cross stage partial (CSP) (Wang et al., <xref ref-type="bibr" rid="B33">2020</xref>), etc. The CBL structure consists of a 3 &#x000D7; 3 2D convolution, a BN layer, and a LeakyReLu activation function. In this structure, the convolution kernel step is set to 2 to down-sample the feature map. The CSP structure is made up of the CBL block and the Concat operation. Cross-layer splicing can be used to better connect information about features. YOLOv4-tiny uses the FPN structure for feature fusion and obtains two fusion feature maps with sizes of 38 &#x000D7; 38 and 19 &#x000D7; 19, respectively. Finally, the fusion feature maps are sent to the detection head for processing, and the confidence and position information of the target are obtained. The network structure of YOLOv4-tiny is shown in the <xref ref-type="fig" rid="F1">Figure 1</xref>.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>YOLOv4-tiny framework diagram.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1220443-g0001.tif"/>
</fig></sec>
<sec id="s3">
<title>3. E-YOLOv4-tiny algorithm</title>
<p>The YOLOv4-tiny algorithm is not strong in feature extraction for small targets and does not take into account the influence of interference information in feature fusion on multi-scale feature representation, which leads to its low accuracy in detecting traffic signs in urban roads. Therefore, an improved E-YOLOv4-tiny algorithm is proposed in this paper, utilizing feature maps with down-sampling multiples of 4 and 8 as prediction headers to effectively leverage underlying feature maps with more detailed information. Furthermore, the backbone and feature fusion parts are optimized to achieve improved detection performance. <xref ref-type="fig" rid="F2">Figure 2</xref> illustrates the structure of the E-YOLOv4-tiny.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>E-YOLOv4-tiny framework diagram.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1220443-g0002.tif"/>
</fig>
<sec>
<title>3.1. Backbone based on E-DSC block</title>
<p>The low resolution of traffic signs in images collected from urban roads often presents a challenge for detectors to extract reliable features, which may result in missed targets. ELAN (Wang et al., <xref ref-type="bibr" rid="B34">2022</xref>) addresses the issue of &#x0201C;how to design an efficient network&#x0201D; by analyzing the gradient path of the network. By designing gradient paths in a reasonable manner, ELAN can lengthen the shortest gradient path of the entire network with fewer transition layers, leading to improved efficiency. Furthermore, ELAN combines the weights of different feature layers, enabling the network to learn more diverse features. Compared to CSP structures, ELAN can improve the model&#x00027;s learning capabilities further through better combinations of gradient paths. However, convolution operations in multi-branch paths can significantly increase the network&#x00027;s parameters and consume more memory on the device.</p>
<p>Depthwise separable convolution (DSC) (Howard et al., <xref ref-type="bibr" rid="B12">2017</xref>) can significantly reduce the number of network parameters and computational cost with a small loss of accuracy compared with ordinary convolution, and its structure is shown in <xref ref-type="fig" rid="F2">Figure 2</xref>. DSC partitions the input image into single layer channels and applies depthwise convolution (Dwise) to process spatial information along the long and wide directions. Each channel is associated with a dedicated convolution kernel, and the quantity of channels in the input layer corresponds to the number of feature maps generated. Subsequently, pointwise convolution (Pwise) is applied to supplement the missing cross-channel information in the feature map, leading to the final feature map. Compared with regular convolution, the combination of depthwise convolution and pointwise convolution has the advantage of reducing the number of parameters while ensuring the feature extraction capability.</p>
<p>Based on the aforementioned research, this paper proposes a lightweight E-DSC structure by taking the gradient path in the ELAN structure into account. The goal of this structure is to enhance the network&#x00027;s learning ability without introducing too many parameters, which is achieved through optimizing the stacking of computing modules and fusing deep separable convolutions. The E-DSC structure is illustrated in <xref ref-type="fig" rid="F2">Figure 2</xref>.</p>
<p>This article replaces the CSP with E-DSC in the backbone to enhance the feature extraction ability of the YOLOv4-tiny. The structural details of the improved backbone are shown in <xref ref-type="table" rid="T1">Table 1</xref>.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>The structural details of the improved backbone.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Steps</bold></th>
<th valign="top" align="left"><bold>Operation</bold></th>
<th valign="top" align="left"><bold>Resolution</bold></th>
<th valign="top" align="left"><bold>Output channels</bold></th>
<th valign="top" align="left"><bold>Number of times</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Input</td>
<td valign="top" align="left">-</td>
<td valign="top" align="left">608</td>
<td valign="top" align="left">3</td>
<td valign="top" align="left">-</td>
</tr> <tr>
<td valign="top" align="left">1</td>
<td valign="top" align="left">CBL</td>
<td valign="top" align="left">304</td>
<td valign="top" align="left">32</td>
<td valign="top" align="left">1</td>
</tr> <tr>
<td valign="top" align="left">2</td>
<td valign="top" align="left">E-DSC</td>
<td valign="top" align="left">152</td>
<td valign="top" align="left">64</td>
<td valign="top" align="left">1</td>
</tr> <tr>
<td valign="top" align="left">3</td>
<td valign="top" align="left">E-DSC</td>
<td valign="top" align="left">76</td>
<td valign="top" align="left">128</td>
<td valign="top" align="left">1</td>
</tr> <tr>
<td valign="top" align="left">4</td>
<td valign="top" align="left">E-DSC</td>
<td valign="top" align="left">76</td>
<td valign="top" align="left">256</td>
<td valign="top" align="left">1</td>
</tr> <tr>
<td valign="top" align="left">5</td>
<td valign="top" align="left">Maxpool</td>
<td valign="top" align="left">38</td>
<td valign="top" align="left">256</td>
<td valign="top" align="left">1</td>
</tr> <tr>
<td valign="top" align="left">6</td>
<td valign="top" align="left">E-DSC</td>
<td valign="top" align="left">38</td>
<td valign="top" align="left">512</td>
<td valign="top" align="left">1</td>
</tr> <tr>
<td valign="top" align="left">7</td>
<td valign="top" align="left">Maxpool</td>
<td valign="top" align="left">19</td>
<td valign="top" align="left">512</td>
<td valign="top" align="left">1</td>
</tr>
<tr>
<td valign="top" align="left">8</td>
<td valign="top" align="left">S-RFB</td>
<td valign="top" align="left">19</td>
<td valign="top" align="left">512</td>
<td valign="top" align="left">1</td>
</tr>
</tbody>
</table>
</table-wrap></sec>
<sec>
<title>3.2. Feature fusion refinement module</title>
<p>YOLOv4-tiny uses FPN to fuse feature maps of different scales to predict objects of different sizes, which can improve the overall detection accuracy of the network. In practice, the fusion of feature maps with different scales through up-sampling operations often fails to accurately represent the fused multiscale features due to semantic information differences and interference information. This paper proposes an FFRM to refine and enhance the fused features. The FFRM structure is shown in <xref ref-type="fig" rid="F3">Figure 3</xref>.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>The structure diagram of FFRM.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1220443-g0003.tif"/>
</fig>
<p>In this paper, we leverage the inverted residual structure in Mobilenetv2 (Sandler et al., <xref ref-type="bibr" rid="B29">2018</xref>) and the ECA mechanism proposed herein to construct the semantic information refinement module and the texture information refinement module. These modules are designed to extract semantic and texture information from feature maps of varying scales without introducing too many parameter quantities. This enables the network to learn the significance of feature maps in different channels and spatial dimensions, allowing it to highlight important features while suppressing interference information expression.</p>
<p>The FFRM takes in a low-level feature map <italic><bold>M</bold></italic><sub>1</sub> and a high-level feature map <italic><bold>M</bold></italic><sub>2</sub> as inputs. Firstly, the semantic information refinement module extracts semantic features from <italic><bold>M</bold></italic><sub>2</sub>. Secondly, <italic><bold>M</bold></italic><sub>1</sub> is upsampled by using bilinear interpolation and concatenated with <italic><bold>M</bold></italic><sub>2</sub> to obtain the fusion feature map <italic><bold>M</bold></italic><sub>3</sub>. Then, the texture information refinement module filters out interference information in <italic><bold>M</bold></italic><sub>3</sub>. Finally, an addition operation is used to integrate both high-level semantic information and low-level texture information, resulting in the output feature map <italic><bold>M</bold></italic>&#x00027;. The output feature map <italic><bold>M</bold></italic>&#x00027; can be represented as:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>M</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>M</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x02297;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>M</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>&#x02191;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x000D7;</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02295;</mml:mo><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>M</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>&#x02191;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x000D7;</mml:mo></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>R</italic><sub>C</sub> represents the semantic information refine module. <italic>R</italic><sub>T</sub> represents a texture information refine module. &#x02297; represents concatenate operation. &#x02295; represents the element-wise summation operation. &#x02191;<sub>2&#x000D7;</sub> represents bilinear interpolation up-sampling.</p>
<sec>
<title>3.2.1. Efficient coordinate attention mechanism</title>
<p>Traffic signs on urban roads are small in size and are often surrounded by a large amount of background interference information. While the coordinate attention mechanism uses 1D global average pooling to aggregate information from input feature maps, this pooling method only emphasizes the preservation of overall information, which can be challenging to accurately reflect in complex backgrounds for small target information. To address this problem, this paper presents an improved ECA mechanism that utilizes both global average pooling and global maximum pooling to extract the extreme responses of the target channel, allowing for better focus on small target features during down-sampling. This approach enables the network to better capture and highlight the most salient features of the input signal, even in the presence of complex backgrounds and other sources of interference. In addition, embedding the coordinate attention module in the network structure increases the number of parameters, which can reduce detection speed. To address this, this paper introduces the use of group convolution (Krizhevsky et al., <xref ref-type="bibr" rid="B16">2017</xref>) and channel shuffle mechanisms (Zhang et al., <xref ref-type="bibr" rid="B43">2018</xref>) into the structure. These techniques help to further reduce the number of module parameters and computational complexity while maintaining high accuracy. The structure of ECA is shown in <xref ref-type="fig" rid="F4">Figure 4</xref>.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Diagram of efficient coordinate attention.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1220443-g0004.tif"/>
</fig>
<p>To begin with, each channel of the input feature map <bold>X</bold> &#x02208; <bold>R</bold><sup><italic>C</italic>&#x000D7;<italic>H</italic>&#x000D7;<italic>W</italic></sup> are encoded along the horizontal and vertical coordinate directions using the global average pooling and the global maximum pooling with core sizes of (<italic>H</italic>,1) and (1,<italic>W</italic>), respectively. Then, the resulting features in the horizontal and vertical directions are aggregated into four direction-aware feature maps. Thus, the outputs of the <italic>c</italic>-th channel at height <italic>h</italic> can be formulated as:</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M2"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>A</mml:mi><mml:mi>v</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>W</mml:mi></mml:mrow></mml:mfrac><mml:msubsup><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mn>0</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow></mml:msubsup><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>h</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E3"><label>(3)</label><mml:math id="M3"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>M</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>W</mml:mi></mml:mrow></mml:mfrac><mml:msub><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mn>0</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>h</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>x</italic><sub><italic>c</italic></sub>(<italic>h</italic>,<italic>i</italic>) represents the <italic>c</italic>-th channel component with coordinates (<italic>h</italic>,<italic>i</italic>) in the input feature map <italic><bold>X</bold></italic>. <italic>Avg</italic> represents the global average pooling. <italic>Max</italic> represents the global maximum pooling. <inline-formula><mml:math id="M4"><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M5"><mml:msubsup><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> represent the <italic>c</italic>-th channel output at height h after passing through the global average pooling and the global maximum pooling, respectively.</p>
<p>Similarly, the outputs of the <italic>c</italic>-th channel at width <italic>w</italic> can be formulated as:</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M6"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>A</mml:mi><mml:mi>v</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>H</mml:mi></mml:mrow></mml:mfrac><mml:msub><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mn>0</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mi>H</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>w</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E5"><label>(5)</label><mml:math id="M7"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>M</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>H</mml:mi></mml:mrow></mml:mfrac><mml:msub><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mn>0</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mi>H</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi>w</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>x</italic><sub><italic>c</italic></sub>(<italic>j</italic>,<italic>w</italic>) represents the <italic>c</italic>-th channel component with coordinates (<italic>j</italic>,<italic>w</italic>) in the input feature map <italic><bold>X</bold></italic>. <inline-formula><mml:math id="M8"><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M9"><mml:msubsup><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> represent the <italic>c</italic>-th channel output at width <italic>w</italic> after passing through the global average pooling and the global maximum pooling, respectively.</p>
<p>Then, the output components <inline-formula><mml:math id="M10"><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M11"><mml:msubsup><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M12"><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M13"><mml:msubsup><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> are merged through an element addition operation, as follows:</p>
<disp-formula id="E6"><label>(6)</label><mml:math id="M14"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:msubsup><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E7"><label>(7)</label><mml:math id="M15"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:msubsup><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Then, the two output feature tensors are concatenated in the spatial dimension to generate the feature map <bold>Z</bold> &#x02208;<bold>R</bold><sup><italic>C</italic>&#x000D7;1 &#x000D7; (<italic>W</italic>&#x0002B;<italic>H</italic>)</sup>. The feature map <bold>Z</bold> are divided into <italic>G</italic> groups along the channel direction, i.e., <bold>Z</bold> &#x0003D; [<bold>Z</bold><sub>1</sub>, ..., <bold>Z</bold><sub><italic>G</italic></sub>], <inline-formula><mml:math id="M16"><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>Z</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>R</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x000D7;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>W</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>H</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>/</mml:mo><mml:mi>G</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. The shared 1 &#x000D7; 1 convolutional transformation function <italic>F</italic> is used to reduce the dimension of each group of feature graphs. The process can be formulated as:</p>
<disp-formula id="E8"><label>(8)</label><mml:math id="M17"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mstyle mathvariant="bold"><mml:mtext>f</mml:mtext></mml:mstyle><mml:mo>=</mml:mo><mml:mi>&#x003B4;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>F</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>Z</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003B4; represents the H-swish activation function. <bold>f</bold>&#x02208;<bold>R</bold><sup><italic>C</italic>&#x000D7;1 &#x000D7; (<italic>W</italic>&#x0002B;<italic>H</italic>)/<italic>G</italic>&#x000D7;<italic>r</italic></sup> is the intermediate mapping feature map of group <italic>g</italic>, where <italic>r</italic> is the proportion of the control module size reduction.</p>
<p>Due to the use of group convolution in a continuous manner, boundary effects may occur. That is to say, a small part of the input feature map channel is used for a certain output feature map channel, resulting in no information exchange between different groups and affecting the network&#x00027;s ability to extract global information. Therefore, after obtaining the intermediate feature map, we use the channel shuffle operation to rearrange the order of channels of different group feature maps to achieve intergroup information flow in multiple group convolution layers. In addition, we conducted experiments on the CCTSDB dataset to compare the performance of models with/without channel shuffle, as shown in <xref ref-type="table" rid="T2">Table 2</xref>.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Performance comparison of models with/without channel shuffle on the CCTSDB dataset.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Models</bold></th>
<th valign="top" align="left"><bold>mAP&#x00040;0.5/%</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Baseline</td>
<td valign="top" align="left">92.44</td>
</tr> <tr>
<td valign="top" align="left">FFRM (no shuffle)</td>
<td valign="top" align="left">93.58</td>
</tr>
<tr>
<td valign="top" align="left">FFRM (shuffle)</td>
<td valign="top" align="left">94.28</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The results in <xref ref-type="table" rid="T2">Table 2</xref> demonstrate that using the channel shuffle operation in the FFRM module leads to a 0.7% higher mAP metric compared to not using it. This experimental result effectively demonstrates the necessity of using the channel shuffle operation in group convolution, allowing the network to learn more diverse features.</p>
<p>Then, the intermediate mapping feature map is split into two separate feature tensors, <bold>f</bold><sup><italic>h</italic></sup>&#x02208;<bold>R</bold><sup><italic>C</italic>&#x000D7;<italic>H</italic>&#x000D7;1/<italic>r</italic></sup> and <bold>f</bold><sup><italic>w</italic></sup>&#x02208;<bold>R</bold><sup><italic>C</italic>&#x000D7;1 &#x000D7; <italic>W</italic>/<italic>r</italic></sup>, along the spatial dimension. Next, the channel numbers of the two tensors are kept consistent with the channel numbers of the input feature map using two convolutional transformations F<sub><italic>h</italic></sub> and F<sub><italic>w</italic></sub>, respectively. The process can be expressed by the following formula:</p>
<disp-formula id="E9"><label>(9)</label><mml:math id="M18"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>p</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>&#x003C3;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>f</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E10"><label>(10)</label><mml:math id="M19"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>p</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>&#x003C3;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>f</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003C3; is the sigmoid activation function.</p>
<p>Finally, the two output tensors are used as attention features, expanded through the broadcast mechanism, and multiplied by the input feature map <bold>X</bold> to give attention weight to obtain the final output feature map <bold>Y</bold>. The process can be formulated as:</p>
<disp-formula id="E11"><label>(11)</label><mml:math id="M20"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000D7;</mml:mo><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000D7;</mml:mo><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></sec></sec>
<sec>
<title>3.3. S-RFB module</title>
<p>The YOLOv4-tiny network extracts features by using only fixed-size convolutional kernels, resulting in a single receptive field in each layer of the network and making it difficult to capture multiscale information. To address the difficulty of capturing multiscale information using only fixed-size convolutional kernels in YOLOv4-tiny, this study presents an improved version of the receptive field block called S-RFB. The integration of void convolutions with varying expansion rates in S-RFB enriches the extracted features by incorporating rich contextual information and diverse receptive fields into the network. This leads to an improvement in the detection of small traffic sign targets, as the network becomes better equipped to capture and distinguish fine details.</p>
<p>The structure of the S-RFB module is shown in <xref ref-type="fig" rid="F5">Figure 5</xref>. Firstly, the input feature map with size (<italic>C</italic>,<italic>H</italic>,<italic>W</italic>) is extracted using dilated convolution. The convolution rate is set to 1, 3, and 5, respectively, to obtain three different sizes of receptive fields. To extract more detailed features from the small input feature map of this module, a smaller 3 &#x000D7; 3 convolution is selected in this paper. Meanwhile, the number of convolution kernels is set to <italic>C</italic>/4 to prevent excessive parameters from being introduced. Secondly, a 1 &#x000D7; 1 convolution with a number of <italic>C</italic>/4 is used to concatenate the input feature map, resulting in an equivalent mapping with the output. Finally, the generated feature maps are fused by the Concat operation to aggregate network context information, further enhancing the network&#x00027;s capability to detect small targets.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Architecture of S-RFB module.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1220443-g0005.tif"/>
</fig></sec></sec>
<sec id="s4">
<title>4. Experimental results and analysis</title>
<sec>
<title>4.1. Dataset preparation</title>
<p>This paper initially performs ablation experiments on the CCTSDB dataset (Zhang et al., <xref ref-type="bibr" rid="B41">2017</xref>) to validate the efficacy of each module improvement in enhancing the model&#x00027;s performance. Additionally, the article compares the proposed method with other advanced target detection techniques that currently exist. In addition, this method is also trained and tested on TT100K dataset (Zhu et al., <xref ref-type="bibr" rid="B45">2016</xref>) with richer traffic sign categories and smaller target areas to further verify the generalization ability of this method.</p>
<p>The CCTSDB dataset consists of 13,826 images with nearly 60,000 traffic signs, divided into three categories: mandatory, prohibitory, and warning. Compared to other public traffic sign datasets, this dataset contains mainly urban road scenes with more interference around the targets. This paper divides the dataset into training sets and test sets according to 9:1.</p>
<p>The TT100K dataset consists of images with a resolution of 2,048 &#x000D7; 2048, containing 221 categories of traffic signs with a total of approximately 26,349 targets. Around 40.5% of the total traffic signs have an area of &#x0003C; 32 &#x000D7; 32 pixels, making it crucial for the algorithm to have a high ability to detect small targets. In order to maintain a balanced distribution of target categories, this paper only selects 45 types of traffic signs with more than 100 images for training. A total of 7,968 images are used as the dataset, with 5,289 images used for training and 2,679 images used for testing.</p></sec>
<sec>
<title>4.2. Experimental details</title>
<p>The experimental platform in this article is equipped with an Intel Xeon Sliver 4110 processor with 32GB of memory. There are two NVIDIA Tesla P4 GPUs with 8GB of video memory. The system used in the experiment is Ubuntu 16.04, and the deep learning framework used is Pytorch 1.2.0.</p>
<p>To verify the effectiveness of the algorithm, this paper adopts the same training parameter settings for all network models to ensure experimental fairness. The input image size is set to 608 &#x000D7; 608. The initial learning rate is 0.001. The batch size is 16. The epoch size is set to 500. Select Adam as the optimizer. The cosine annealing algorithm is used in training to attenuate the learning rate.</p></sec>
<sec>
<title>4.3. Evaluation indicators</title>
<p>In the experiment, accuracy (P), recall (R), mean average precision (mAP), frames per s (FPS), and Params are selected to evaluate the performance of the algorithms. The accuracy and recall are used to measure the classification ability and detection ability of the algorithm for targets, and the mAP is used to comprehensively evaluate the detection performance of the algorithm. The formulas for calculating accuracy, recall, and mean average precision are as follows:</p>
<disp-formula id="E12"><label>(12)</label><mml:math id="M21"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E13"><label>(13)</label><mml:math id="M22"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>R</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E14"><label>(14)</label><mml:math id="M23"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>m</mml:mi><mml:mi>A</mml:mi><mml:mi>P</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x0222B;</mml:mo></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:munderover></mml:mstyle><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo class="qopname">d</mml:mo><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>TP</italic> indicates that the detection is a positive sample and the result is correct. <italic>FP</italic> indicates that the detection is a positive sample and the result is incorrect. <italic>FN</italic> indicates that the detection is a negative sample and the result is incorrect. <italic>C</italic> represents the number of target categories.</p>
<p>FPS represents the number of frames per second that the network detects images, which is used to evaluate the real-time performance of model. Params refers to the total number of model parameters, and its calculation formula is as follows:</p>
<disp-formula id="E15"><label>(15)</label><mml:math id="M24"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>p</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>m</mml:mi><mml:mi>s</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>K</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mrow><mml:mi>K</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>K</italic><sub><italic>h</italic></sub> and <italic>K</italic><sub><italic>w</italic></sub> represent the length and width of the convolution kernel, respectively. <italic>C</italic><sub><italic>in</italic></sub> and <italic>C</italic><sub><italic>out</italic></sub> represent the number of convolutional kernel input and output channels, respectively.</p></sec>
<sec>
<title>4.4. Experimental results and analysis</title>
<sec>
<title>4.4.1. Comparison and analysis of experimental results based on CCTSDB dataset</title>
<p>To assess the effectiveness of the proposed method, we conduct a comparative analysis with five state-of-the-art object detection algorithms on the CCTSDB dataset. Specifically, we evaluate the performance of Faster R-CNN, Centernet (Zhou et al., <xref ref-type="bibr" rid="B44">2019</xref>), SSD, YOLOv5-s (Glenn, <xref ref-type="bibr" rid="B6">2020</xref>), YOLOv4-tiny, and an improved version of YOLOv4 based on attention mechanism (Zhang et al., <xref ref-type="bibr" rid="B42">2022</xref>). Additionally, we add the performance results of YOLOv4-tiny combined with these three different modules to make the part that affects the experimental results more apparent. Specifically, E-DSC, FFRM, and S-RFB represent the models improved using the corresponding methods for YOLOv4-tiny. The evaluation results are tabulated in <xref ref-type="table" rid="T3">Table 3</xref>.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Performance comparison results of different models on CCTSDB dataset.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Models</bold></th>
<th valign="top" align="center"><bold>R/%</bold></th>
<th valign="top" align="center"><bold>P/%</bold></th>
<th valign="top" align="left"><bold>mAP&#x00040; 0.5/%</bold></th>
<th valign="top" align="left"><bold>Params/ MB</bold></th>
<th valign="top" align="left"><bold>FPS/(frame/ s)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Faster-RCNN</td>
<td valign="top" align="center">78.84</td>
<td valign="top" align="center">82.06</td>
<td valign="top" align="left">84.74</td>
<td valign="top" align="left">137.09</td>
<td valign="top" align="left">6</td>
</tr> <tr>
<td valign="top" align="left">Centernet</td>
<td valign="top" align="center">65.89</td>
<td valign="top" align="center">96.94</td>
<td valign="top" align="left">91.14</td>
<td valign="top" align="left">32.66</td>
<td valign="top" align="left">32</td>
</tr> <tr>
<td valign="top" align="left">SSD</td>
<td valign="top" align="center">68.15</td>
<td valign="top" align="center">90.27</td>
<td valign="top" align="left">76.92</td>
<td valign="top" align="left">26.28</td>
<td valign="top" align="left">35</td>
</tr> <tr>
<td valign="top" align="left">Improved YOLOv4</td>
<td valign="top" align="center">-</td>
<td valign="top" align="center">-</td>
<td valign="top" align="left">96.88</td>
<td valign="top" align="left">-</td>
<td valign="top" align="left">40</td>
</tr> <tr>
<td valign="top" align="left">YOLOv5-s</td>
<td valign="top" align="center">89.05</td>
<td valign="top" align="center">92.32</td>
<td valign="top" align="left">95.11</td>
<td valign="top" align="left">27.6</td>
<td valign="top" align="left">42</td>
</tr> <tr>
<td valign="top" align="left">YOLOv4-tiny</td>
<td valign="top" align="center">83.81</td>
<td valign="top" align="center">93.14</td>
<td valign="top" align="left">92.44</td>
<td valign="top" align="left">23.1</td>
<td valign="top" align="left">100</td>
</tr> <tr>
<td valign="top" align="left">E-DSC</td>
<td valign="top" align="center">87.05</td>
<td valign="top" align="center">96.34</td>
<td valign="top" align="left">94.31</td>
<td valign="top" align="left">17.6</td>
<td valign="top" align="left">87</td>
</tr> <tr>
<td valign="top" align="left">FFRM</td>
<td valign="top" align="center">88.34</td>
<td valign="top" align="center">95.08</td>
<td valign="top" align="left">94.28</td>
<td valign="top" align="left">25.3</td>
<td valign="top" align="left">85</td>
</tr> <tr>
<td valign="top" align="left">S-RFB</td>
<td valign="top" align="center">84.85</td>
<td valign="top" align="center">94.49</td>
<td valign="top" align="left">93.86</td>
<td valign="top" align="left">23.4</td>
<td valign="top" align="left">91</td>
</tr>
<tr>
<td valign="top" align="left">E-YOLOv4-tiny</td>
<td valign="top" align="center">90.14</td>
<td valign="top" align="center">98.32</td>
<td valign="top" align="left">96.2</td>
<td valign="top" align="left">18.2</td>
<td valign="top" align="left">62</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The results reported in <xref ref-type="table" rid="T3">Table 3</xref> indicate that the backbone network, as the primary feature extractor of the model, has the greatest impact on the model&#x00027;s performance improvement, with an increase of 1.87% in mAP metric. The algorithm proposed in this paper outperforms advanced two-stage and one-stage algorithms in terms of both accuracy and parameter efficiency. Compared to Faster R-CNN, SSD, Centernet, and YOLOv5-s, the proposed method achieves mAP advantage of 11.46, 19.28, 5.06, and 1.09%, respectively. Moreover, the proposed method improves the mAP index by 3.76% while reducing the number of model&#x00027;s parameters by 21% compared to the original method. The improved YOLOv4 algorithm based on attention mechanism achieves an average detection accuracy of 96.88%, meeting real-time requirements. Additionally, the proposed method in this paper maintains superior detection speed while achieving a similar detection accuracy as the improved YOLOv4 algorithm, effectively demonstrating a good balance between model accuracy and speed.</p>
<p><xref ref-type="fig" rid="F6">Figure 6</xref> illustrates the detection performance of our proposed E-YOLOv4-tiny model and the YOLOv4-tiny model on the CCTSDB dataset. The first set of graphs indicates that the E-YOLOv4-tiny model achieves higher confidence levels than the baseline model and can detect three &#x0201C;mandatory&#x0201D; signs that the latter cannot detect. The second set of images demonstrates that our model can still achieve good detection accuracy even in the presence of numerous interfering objects around small targets. In contrast, the YOLOv4-tiny model in the third group of images misses two objects, while our model can detect all objects. These results provide strong evidence that our proposed E-YOLOv4-tiny model outperforms the original YOLOv4-tiny model in detecting small objects.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Detection results of CCTSDB dataset.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1220443-g0006.tif"/>
</fig></sec>
<sec>
<title>4.4.2. Comparison and analysis of experimental results based on TT100K dataset</title>
<p>To further validate the generalization ability of our proposed method for detecting traffic signs, we conduct experiments on the TT100K dataset. We compare the performance of our method against several state-of-the-art target detection algorithms, including Fast R-CNN, Centernet, SSD, YOLOv5-s, YOLOv4-tiny, and improved YOLOv4-tiny (Wang L. et al., <xref ref-type="bibr" rid="B35">2021</xref>). At the same time, we add the performance results of YOLOv4-tiny combined with these three different modules. The results of the experiments are presented in <xref ref-type="table" rid="T4">Table 4</xref>.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Performance comparison results of different models on TT100K dataset.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Models</bold></th>
<th valign="top" align="center"><bold>mAP&#x00040;0.5/%</bold></th>
<th valign="top" align="left"><bold>Params/ MB</bold></th>
<th valign="top" align="left"><bold>FPS/(frame/ s)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Faster R-CNN</td>
<td valign="top" align="center">46.23</td>
<td valign="top" align="left">137.09</td>
<td valign="top" align="left">6</td>
</tr> <tr>
<td valign="top" align="left">Centernet</td>
<td valign="top" align="center">44.09</td>
<td valign="top" align="left">32.66</td>
<td valign="top" align="left">32</td>
</tr> <tr>
<td valign="top" align="left">SSD</td>
<td valign="top" align="center">40.17</td>
<td valign="top" align="left">26.28</td>
<td valign="top" align="left">35</td>
</tr> <tr>
<td valign="top" align="left">Improved YOLOv4-tiny</td>
<td valign="top" align="center">52.07</td>
<td valign="top" align="left">24.7</td>
<td valign="top" align="left">-</td>
</tr> <tr>
<td valign="top" align="left">YOLOv5-s</td>
<td valign="top" align="center">53.28</td>
<td valign="top" align="left">27.6</td>
<td valign="top" align="left">42</td>
</tr> <tr>
<td valign="top" align="left">YOLOv4-tiny</td>
<td valign="top" align="center">47</td>
<td valign="top" align="left">23.1</td>
<td valign="top" align="left">100</td>
</tr> <tr>
<td valign="top" align="left">E-DSC</td>
<td valign="top" align="center">50.09</td>
<td valign="top" align="left">17.6</td>
<td valign="top" align="left">87</td>
</tr> <tr>
<td valign="top" align="left">FFRM</td>
<td valign="top" align="center">49.65</td>
<td valign="top" align="left">25.3</td>
<td valign="top" align="left">85</td>
</tr> <tr>
<td valign="top" align="left">S-RFB</td>
<td valign="top" align="center">49.27</td>
<td valign="top" align="left">23.4</td>
<td valign="top" align="left">91</td>
</tr>
<tr>
<td valign="top" align="left">E-YOLOv4-tiny</td>
<td valign="top" align="center">54.37</td>
<td valign="top" align="left">18.2</td>
<td valign="top" align="left">62</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Based on the results presented in <xref ref-type="table" rid="T4">Table 4</xref>, we observe that the performance of our model on the TT100K dataset is notably lower compared to its performance on the CCTSDB dataset. This difference can be attributed to the smaller absolute area of the targets in the TT100K dataset. In addition, among the compared object detection algorithms, our model achieves the highest mAP value and the lowest Params. Specifically, our proposed method achieves a mAP index of 54.37%, which is 7.37% higher than the original algorithm and 2.3% higher than the improved YOLOv4-tiny. These results demonstrate the effectiveness of our proposed method in this paper.</p>
<p>In <xref ref-type="fig" rid="F7">Figure 7</xref>, we compare the detection performance of our proposed model with that of the YOLOv4-tiny model on the TT100K dataset. The first set of images demonstrates that our model outperforms YOLOv4-tiny in detecting small traffic signs located at a distance, which indicates that the backbone structure of our model effectively captures the features associated with small targets and reduces the rate of miss detection. In the second set of images, the results indicate that our model can accurately detect three urban road traffic sign targets despite the presence of more interference information around them. Furthermore, our proposed method yields higher confidence scores compared to the original model, providing evidence of its superiority.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Detection results of TT100K dataset.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1220443-g0007.tif"/>
</fig></sec></sec>
<sec>
<title>4.5. Ablation experiments</title>
<p>To assess the effectiveness of our proposed improved method, we conducted ablation experiments on the CCTSDB dataset. We begin by using the YOLOv4-tiny model as a baseline and then add the aforementioned improved methods to enhance its performance. The modified models are trained and tested on the CCTSDB dataset, and the results are compared and presented in <xref ref-type="table" rid="T5">Table 5</xref>.</p>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>Results of ablation experiments based on CCTSDB dataset.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Models</bold></th>
<th valign="top" align="center"><bold>mAP&#x00040;0.5/%</bold></th>
<th valign="top" align="center"><bold>Params/MB</bold></th>
<th valign="top" align="center"><bold>FPS/(frame/s)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Baseline</td>
<td valign="top" align="center">92.44</td>
<td valign="top" align="center">23.1</td>
<td valign="top" align="center">100</td>
</tr> <tr>
<td valign="top" align="left">E-DSC</td>
<td valign="top" align="center">94.31</td>
<td valign="top" align="center">17.6</td>
<td valign="top" align="center">87</td>
</tr> <tr>
<td valign="top" align="left">FFRM</td>
<td valign="top" align="center">94.28</td>
<td valign="top" align="center">25.3</td>
<td valign="top" align="center">85</td>
</tr> <tr>
<td valign="top" align="left">S-RFB</td>
<td valign="top" align="center">93.86</td>
<td valign="top" align="center">23.4</td>
<td valign="top" align="center">91</td>
</tr>
<tr>
<td valign="top" align="left">E-YOLOv4-tiny</td>
<td valign="top" align="center">96.20</td>
<td valign="top" align="center">18.2</td>
<td valign="top" align="center">62</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Comparing the performance indicators of the baseline model and our proposed improved methods in <xref ref-type="table" rid="T5">Table 5</xref>, we observe that E-DSC, FFRM, and S-RFB are all effective in enhancing the model&#x00027;s mAP performance, with increases of 1.87, 1.84, and 1.42%, respectively. The E-YOLOv4-tiny model, which integrates all three improved methods, achieves the highest mAP performance, with a 3.76% improvement over the baseline. Remarkably, the E-YOLOv4-tiny model also reduces the number of model parameters by 21%, indicating that it is more efficient and cost-effective for practical applications.</p>
<p>We visualize and compare the detection process of each of the above models by heat map for visual comparison, as shown in <xref ref-type="fig" rid="F8">Figure 8</xref>. In the heat map, blue color indicates the minimum activation value for that target region, and red color indicates the maximum activation value for that target region.</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>Visualization of heat maps.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1220443-g0008.tif"/>
</fig>
<p>The original detection image with four traffic signs, including three &#x0201C;prohibitory&#x0201D; signs and one &#x0201C;mandatory&#x0201D; sign, is shown in <xref ref-type="fig" rid="F8">Figure 8A</xref>. The heat map of the YOLOv4-tiny model is illustrated in <xref ref-type="fig" rid="F8">Figure 8B</xref>, where the activation responses of the &#x0201C;prohibitory&#x0201D; and &#x0201C;mandatory&#x0201D; signs are confused, leading to a higher risk of false detection. <xref ref-type="fig" rid="F8">Figure 8C</xref> shows the heat map of the model based on the E-DSC backbone. The result indicates that the model&#x00027;s ability to extract target features has improved, but there is still confusion in the activation response of similar types of targets. <xref ref-type="fig" rid="F8">Figure 8D</xref> shows the heat map of the model using FFRM in the feature fusion section. It can be seen that the FFRM can enable the network to focus on the main features of the target and effectively distinguish the features of each target. <xref ref-type="fig" rid="F8">Figure 8E</xref> shows the heat map of the model with the addition of the S-RFB context enhancement module, effectively enhancing the model&#x00027;s ability to detect targets. <xref ref-type="fig" rid="F8">Figure 8F</xref> showcases the heat map of the E-YOLOv4-tiny model. The model exhibits distinct activation responses for each target, allowing it to effectively focus on the target feature area and enhance the target area feature activation response.</p></sec></sec>
<sec sec-type="conclusions" id="s5">
<title>5. Conclusion</title>
<p>In this paper, an E-YOLOv4-tiny traffic sign detection algorithm is proposed to address the difficulties faced by autonomous vehicles in recognizing small target traffic signs in complex urban road environments. Specifically, we address these challenges through three main contributions. Firstly, we propose a lightweight E-DSC block to optimize the backbone and enhance the network&#x00027;s ability to extract small target features. Secondly, we propose an FFRM that fully fuse multi-scale features while efficiently filtering interference information through the ECA. Finally, we introduce an S-RFB module with multi-branch structure and dilated convolutional layer, which can introduce context information into the network and increase the diversity of network&#x00027;s receptive field. The experimental results on the CCTDDB dataset and TT100K dataset demonstrate that our proposed method significantly improves model accuracy and parameter efficiency compared to the YOLOv4-tiny algorithm. Moreover, our method achieves real-time performance, making it highly practical for improving urban road traffic sign detection. Therefore, the advantages of our method lie in achieving a balance between model accuracy, parameter efficiency, and real-time performance, making it more suitable for practical deployment on edge devices for real-time traffic sign detection. Additionally, our proposed method enhances the detection ability of the model for extremely small objects in complex backgrounds present in urban roads. The current drawback of our approach is that, although it achieves real-time performance and outperforms other advanced algorithms, the detection speed still suffers some loss compared to the original algorithm. Furthermore, our method does not consider traffic sign detection in extreme weather conditions. Therefore, our next step will be to research traffic sign detection in extreme scenarios, such as rain, snow, and extreme lighting conditions, in urban roads and minimize the loss of real-time performance as much as possible.</p></sec>
<sec sec-type="data-availability" id="s6">
<title>Data availability statement</title>
<p>Publicly available datasets were analyzed in this study. This data can be found at: CCTSDB: <ext-link ext-link-type="uri" xlink:href="https://github.com/csust7zhangjm/CCTSDB">https://github.com/csust7zhangjm/CCTSDB</ext-link>; TT100k: <ext-link ext-link-type="uri" xlink:href="https://cg.cs.tsinghua.edu.cn/traffic-sign/">https://cg.cs.tsinghua.edu.cn/traffic-sign/</ext-link>.</p></sec>
<sec sec-type="author-contributions" id="s7">
<title>Author contributions</title>
<p>Conceptualization: YX. Methodology: YX and SY. Formal analysis and investigation: GC. Writing&#x02013;original draft preparation: SY. Writing&#x02013;review and editing: WZ. Resources: LY. Supervision: ZF. All authors contributed to the article and approved the submitted version.</p></sec>
</body>
<back>
<sec sec-type="funding-information" id="s8">
<title>Funding</title>
<p>This research was funded by National Natural Science Foundation of China under Grant (51805490), Henan Provincial Science and Technology Research Project (22210220013), Key Scientific Research Project of Colleges and Universities in Henan Province (23ZX013), and Major Science and Technology Projects of Longmen Laboratory (LMZDXM202204).</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Badue</surname> <given-names>C.</given-names></name> <name><surname>Guidolini</surname> <given-names>R.</given-names></name> <name><surname>Carneiro</surname> <given-names>R. V.</given-names></name></person-group> (<year>2021</year>). <article-title>Self-driving cars: a survey</article-title>. <source>Expert Syst. Appl.</source> <volume>165</volume>, <fpage>113816</fpage>. <pub-id pub-id-type="doi">10.1016/j.eswa.2020.113816</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bochkovskiy</surname> <given-names>A.</given-names></name> <name><surname>Wang</surname> <given-names>C. Y.</given-names></name> <name><surname>Liao</surname> <given-names>H. Y. M.</given-names></name></person-group> (<year>2020</year>). <source>Yolov4: Optimal Speed and Accuracy of Object Detection</source>. a<italic>rXiv [preprint]</italic>. <pub-id pub-id-type="doi">10.48550/arXiv.2004.10934</pub-id></citation>
</ref>
<ref id="B3">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ghiasi</surname> <given-names>G.</given-names></name> <name><surname>Lin</surname> <given-names>T. Y.</given-names></name> <name><surname>Le</surname> <given-names>Q. V.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Nas-fpn: Learning scalable feature pyramid architecture for object detection,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Long Beach, CA</publisher-loc>), <fpage>7036</fpage>&#x02013;<lpage>7045</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2019.00720</pub-id></citation>
</ref>
<ref id="B4">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Girshick</surname> <given-names>R.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Fast r-cnn,&#x0201D;</article-title> in <source>Proceedings of the IEEE International Conference on Computer Vision</source> (<publisher-loc>Santiago</publisher-loc>), <fpage>1440</fpage>&#x02013;<lpage>1448</lpage>.</citation>
</ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Girshick</surname> <given-names>R.</given-names></name> <name><surname>Donahue</surname> <given-names>J.</given-names></name> <name><surname>Darrell</surname> <given-names>T.</given-names></name> <name><surname>Malik</surname> <given-names>J.</given-names></name></person-group> (<year>2014</year>). <article-title>&#x0201C;Rich feature hierarchies for accurate object detection and semantic segmentation,&#x0201D;</article-title> in <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>, <fpage>580</fpage>&#x02013;<lpage>587</lpage>.</citation>
</ref>
<ref id="B6">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Glenn</surname> <given-names>J.</given-names></name></person-group> (<year>2020</year>). <source>yolov5</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://github.com/ultralytics/yolov5">https://github.com/ultralytics/yolov5</ext-link> (accessed June 10, 2020).</citation>
</ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Guo</surname> <given-names>M. H.</given-names></name> <name><surname>Xu</surname> <given-names>T. X.</given-names></name> <name><surname>Liu</surname> <given-names>J. J.</given-names></name> <name><surname>Liu</surname> <given-names>Z. N.</given-names></name></person-group> (<year>2022</year>). <article-title>Attention mechanisms in computer vision: a survey</article-title>. <source>Computat. Vis. Media</source> <volume>8</volume>, <fpage>331</fpage>&#x02013;<lpage>368</lpage>. <pub-id pub-id-type="doi">10.1007/s41095-022-0271-y</pub-id></citation>
</ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Han</surname> <given-names>K.</given-names></name> <name><surname>Wang</surname> <given-names>Y.</given-names></name> <name><surname>Tian</surname> <given-names>Q.</given-names></name> <name><surname>Guo</surname> <given-names>J.</given-names></name> <name><surname>Xu</surname> <given-names>C.</given-names></name> <name><surname>Xu</surname> <given-names>C.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>&#x0201C;Ghostnet: more features from cheap operations,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Seattle, WA</publisher-loc>: <publisher-name>Vision and Pattern Recognition (CVPR</publisher-name>)), <fpage>1580</fpage>&#x02013;<lpage>1589</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR42600.2020.00165</pub-id></citation>
</ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Ren</surname> <given-names>S.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Deep residual learning for image recognition,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>, <fpage>770</fpage>&#x02013;<lpage>778</lpage>.<pub-id pub-id-type="pmid">32166560</pub-id></citation></ref>
<ref id="B10">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hou</surname> <given-names>Q.</given-names></name> <name><surname>Zhou</surname> <given-names>D.</given-names></name> <name><surname>Feng</surname> <given-names>J.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Coordinate attention for efficient mobile network design,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Nashville, TN</publisher-loc>), <fpage>13713</fpage>&#x02013;<lpage>13722</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR46437.2021.01350</pub-id></citation>
</ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Howard</surname> <given-names>A.</given-names></name> <name><surname>Sandler</surname> <given-names>M.</given-names></name> <name><surname>Chu</surname> <given-names>G.</given-names></name> <name><surname>Chen</surname> <given-names>L. C.</given-names></name> <name><surname>Chen</surname> <given-names>B.</given-names></name> <name><surname>Tan</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>&#x0201C;Searching for mobilenetv3,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source>, <fpage>1314</fpage>&#x02013;<lpage>1324</lpage>.</citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Howard</surname> <given-names>A. G.</given-names></name> <name><surname>Zhu</surname> <given-names>M.</given-names></name> <name><surname>Chen</surname> <given-names>B.</given-names></name> <name><surname>Kalenichenko</surname> <given-names>D.</given-names></name> <name><surname>Wang</surname> <given-names>W.</given-names></name> <name><surname>Weyand</surname> <given-names>T.</given-names></name> <etal/></person-group>. (<year>2017</year>). <source>Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv [preprint]</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1704.04861</pub-id></citation>
</ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hu</surname> <given-names>J.</given-names></name> <name><surname>Shen</surname> <given-names>L.</given-names></name> <name><surname>Sun</surname> <given-names>G.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Squeeze-and-excitation networks,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Salt Lake City, UT</publisher-loc>), <fpage>7132</fpage>&#x02013;<lpage>7141</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR42600.2020.01079</pub-id></citation>
</ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>G.</given-names></name> <name><surname>Liu</surname> <given-names>Z.</given-names></name> <name><surname>Van Der Maaten</surname> <given-names>L.</given-names></name> <name><surname>Weinberger</surname> <given-names>K. Q.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Densely connected convolutional networks,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>, <fpage>4700</fpage>&#x02013;<lpage>4708</lpage>.</citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>L.</given-names></name> <name><surname>Chen</surname> <given-names>C.</given-names></name> <name><surname>Yun</surname> <given-names>J.</given-names></name> <name><surname>Sun</surname> <given-names>Y.</given-names></name> <name><surname>Tian</surname> <given-names>J.</given-names></name> <name><surname>Hao</surname> <given-names>Z.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>Multi-scale feature fusion convolutional neural network for indoor small target detection</article-title>. <source>Front. Neurorobot.</source> <volume>16</volume>, <fpage>881021</fpage>. <pub-id pub-id-type="doi">10.3389/fnbot.2022.881021</pub-id><pub-id pub-id-type="pmid">35663726</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Krizhevsky</surname> <given-names>A.</given-names></name> <name><surname>Sutskever</surname> <given-names>I.</given-names></name> <name><surname>Hinton</surname> <given-names>G. E.</given-names></name></person-group> (<year>2017</year>). <article-title>Imagenet classification with deep convolutional neural networks</article-title>. <source>Commun. ACM.</source> <volume>60</volume>, <fpage>84</fpage>&#x02013;<lpage>90</lpage>. <pub-id pub-id-type="doi">10.1145/3065386</pub-id></citation>
</ref>
<ref id="B17">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>T. Y.</given-names></name> <name><surname>Doll&#x000E1;r</surname> <given-names>P.</given-names></name> <name><surname>Girshick</surname> <given-names>R.</given-names></name> <name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Hariharan</surname> <given-names>B.</given-names></name> <name><surname>Belongie</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>&#x0201C;Feature pyramid networks for object detection,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on computer Vision and Pattern Recognition</source> (<publisher-loc>Honolulu, HI</publisher-loc>: <publisher-name>Vision and Pattern Recognition (CVPR</publisher-name>)), <fpage>2117</fpage>&#x02013;<lpage>2125</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2017.106</pub-id></citation>
</ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>T. Y.</given-names></name> <name><surname>Maire</surname> <given-names>M.</given-names></name> <name><surname>Belongie</surname> <given-names>S.</given-names></name> <name><surname>Hays</surname> <given-names>J.</given-names></name> <name><surname>Perona</surname> <given-names>P.</given-names></name> <name><surname>Ramanan</surname> <given-names>D.</given-names></name> <etal/></person-group>. (<year>2014</year>). <article-title>&#x0201C;Microsoft coco: Common objects in context,&#x0201D;</article-title> in <source>Computer Vision&#x02013;ECCV 2014, 13th. European Conference</source>, Zurich, Switzerland, September 6-12, 2014.</citation>
</ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>S.</given-names></name> <name><surname>Huang</surname> <given-names>D.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Receptive field block net for accurate and fast object detection,&#x0201D;</article-title> in <source>Proceedings of the European Conference on Computer Vision (ECCV)</source>, <fpage>385</fpage>&#x02013;<lpage>400</lpage>.</citation>
</ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>S.</given-names></name> <name><surname>Qi</surname> <given-names>L.</given-names></name> <name><surname>Qin</surname> <given-names>H.</given-names></name> <name><surname>Shi</surname> <given-names>J.</given-names></name> <name><surname>Jia</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Path aggregation network for instance segmentation,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>, <fpage>8759</fpage>&#x02013;<lpage>8768</lpage>.</citation>
</ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>W.</given-names></name> <name><surname>Anguelov</surname> <given-names>D.</given-names></name> <name><surname>Erhan</surname> <given-names>D.</given-names></name> <name><surname>Szegedy</surname> <given-names>C.</given-names></name> <name><surname>Reed</surname> <given-names>S.</given-names></name> <name><surname>Fu</surname> <given-names>C. Y.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>&#x0201C;Ssd: Single shot multibox detector,&#x0201D;</article-title> in <source>Computer Vision&#x02013;ECCV 2016, 14th. European Conference</source>, Amsterdam, The Netherlands, October 11&#x02013;14, 2016.</citation>
</ref>
<ref id="B22">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ma</surname> <given-names>N.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Zheng</surname> <given-names>H. T.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;ShuffleNet V2: Practical guidelines for efficient CNN architecture design,&#x0201D;</article-title> in <source>Computer Vision - ECCV 2018. Lecture Notes in Computer Science(), Vol 11218</source>, eds <person-group person-group-type="editor"><name><surname>Ferrari</surname> <given-names>V.</given-names></name> <name><surname>Hebert</surname> <given-names>M.</given-names></name> <name><surname>Sminchisescu</surname> <given-names>C.</given-names></name> <name><surname>Weiss</surname> <given-names>Y.</given-names></name></person-group> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>). <pub-id pub-id-type="doi">10.1007/978-3-030-01264-9_8</pub-id></citation>
</ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pei</surname> <given-names>W.</given-names></name> <name><surname>Shi</surname> <given-names>Z.</given-names></name> <name><surname>Gong</surname> <given-names>K.</given-names></name></person-group> (<year>2023</year>). <article-title>Small target detection with remote sensing images based on an improved YOLOv5 algorithm</article-title>. <source>Front. Neurorobotic.</source> <volume>16</volume>, <fpage>1074862</fpage>. <pub-id pub-id-type="doi">10.3389/fnbot.2022.1074862</pub-id><pub-id pub-id-type="pmid">36925626</pub-id></citation></ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Prasetyo</surname> <given-names>E.</given-names></name> <name><surname>Suciati</surname> <given-names>N.</given-names></name> <name><surname>Fatichah</surname> <given-names>C.</given-names></name></person-group> (<year>2022</year>). <article-title>Yolov4-tiny with wing convolution layer for detecting fish body part</article-title>. <source>Comput. Electr. Agric.</source> <volume>198</volume>, <fpage>107023</fpage>. <pub-id pub-id-type="doi">10.1016/j.compag.2022.107023</pub-id></citation>
</ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Redmon</surname> <given-names>J.</given-names></name> <name><surname>Divvala</surname> <given-names>S.</given-names></name> <name><surname>Girshick</surname> <given-names>R.</given-names></name> <name><surname>Farhadi</surname> <given-names>A.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;You only look once: unified, real-time object detection,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>, <fpage>779</fpage>&#x02013;<lpage>788</lpage>.</citation>
</ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Redmon</surname> <given-names>J.</given-names></name> <name><surname>Farhadi</surname> <given-names>A.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;YOLO9000: better, faster, stronger,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>, <fpage>7263</fpage>&#x02013;<lpage>7271</lpage>.</citation>
</ref>
<ref id="B27">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Redmon</surname> <given-names>J.</given-names></name> <name><surname>Farhadi</surname> <given-names>A.</given-names></name></person-group> (<year>2018</year>). <article-title>Yolov3: An incremental improvement</article-title>. <source>arXiv [Preprint]</source>. arXiv: 1804.02767. Available online at: <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/pdf/1804.02767.pdf">https://arxiv.org/pdf/1804.02767.pdf</ext-link></citation>
</ref>
<ref id="B28">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Ren</surname> <given-names>S.</given-names></name> <name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Girshick</surname> <given-names>R.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2015</year>). <article-title>Faster r-cnn: Towards real-time object detection with region proposal networks</article-title>. <source>arXiv [Preprint]</source>. arXiv: 1506.01497. Available online at: <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/pdf/1506.01497.pdf">https://arxiv.org/pdf/1506.01497.pdf</ext-link><pub-id pub-id-type="pmid">27295650</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sandler</surname> <given-names>M.</given-names></name> <name><surname>Howard</surname> <given-names>A.</given-names></name> <name><surname>Zhu</surname> <given-names>M.</given-names></name> <name><surname>Zhmoginov</surname> <given-names>A.</given-names></name> <name><surname>Chen</surname> <given-names>L. C.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Mobilenetv2: Inverted residuals and linear bottlenecks,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>, <fpage>4510</fpage>&#x02013;<lpage>4520</lpage>.</citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sharma</surname> <given-names>V.</given-names></name> <name><surname>Mir</surname> <given-names>R. N. A.</given-names></name></person-group> (<year>2020</year>). <article-title>Comprehensive and systematic look up into deep learning based object detection techniques: a review</article-title>. <source>Comput. Sci. Rev.</source> <volume>38</volume>, <fpage>100301</fpage>. <pub-id pub-id-type="doi">10.1016/j.cosrev.2020.100301</pub-id></citation>
</ref>
<ref id="B31">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Tan</surname> <given-names>M.</given-names></name> <name><surname>Pang</surname> <given-names>R.</given-names></name> <name><surname>Le</surname> <given-names>Q. V.</given-names></name></person-group> (<year>2020</year>). <article-title>Efficientdet: scalable and efficient object detection</article-title>. <source>In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Seattle, WA</publisher-loc>), <fpage>10781</fpage>&#x02013;<lpage>10790</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR42600.2020.01079</pub-id><pub-id pub-id-type="pmid">36146369</pub-id></citation></ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>C. Y.</given-names></name> <name><surname>Bochkovskiy</surname> <given-names>A.</given-names></name> <name><surname>Liao</surname> <given-names>H. Y. M.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Scaled-yolov4: scaling cross stage partial network,&#x0201D;</article-title> in <source>Proceedings of the IEEE/cvf Conference on Computer Vision and Pattern Recognition</source>, <fpage>13029</fpage>&#x02013;<lpage>13038</lpage>.</citation>
</ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>C. Y.</given-names></name> <name><surname>Liao</surname> <given-names>H. Y. M.</given-names></name> <name><surname>Wu</surname> <given-names>Y. H.</given-names></name> <name><surname>Chen</surname> <given-names>P. Y.</given-names></name> <name><surname>Hsieh</surname> <given-names>J. W.</given-names></name> <name><surname>Yeh</surname> <given-names>I. H.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>&#x0201C;CSPNet: a new backbone that can enhance learning capability of CNN,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops</source>, <fpage>390</fpage>&#x02013;<lpage>391</lpage>.</citation>
</ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>C. Y.</given-names></name> <name><surname>Liao</surname> <given-names>H. Y. M.</given-names></name> <name><surname>Yeh</surname> <given-names>I. H.</given-names></name></person-group> (<year>2022</year>). <source>Designing Network Design Strategies Through Gradient Path Analysis</source>.</citation>
</ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>L.</given-names></name> <name><surname>Zhou</surname> <given-names>K.</given-names></name> <name><surname>Chu</surname> <given-names>A.</given-names></name> <name><surname>Wang</surname> <given-names>G.</given-names></name> <name><surname>Wang</surname> <given-names>L.</given-names></name></person-group> (<year>2021</year>). <article-title>An improved light-weight traffic sign recognition algorithm based on YOLOv4-tiny</article-title>. <source>IEEE Access</source> <volume>9</volume>, <fpage>124963</fpage>&#x02013;<lpage>124971</lpage>. <pub-id pub-id-type="doi">10.1109/ACCESS.2021.3109798</pub-id></citation>
</ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wei</surname> <given-names>H.</given-names></name> <name><surname>Zhang</surname> <given-names>Q.</given-names></name> <name><surname>Qian</surname> <given-names>Y.</given-names></name> <etal/></person-group>. (<year>2023</year>). <article-title>MTSDet: multi-scale traffic sign detection with attention and path aggregation</article-title>. <source>Appl. Int.</source> <volume>53</volume>, <fpage>238</fpage>&#x02013;<lpage>250</lpage>. <pub-id pub-id-type="doi">10.1007/s10489-022-03459-7</pub-id></citation>
</ref>
<ref id="B37">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Woo</surname> <given-names>S.</given-names></name> <name><surname>Park</surname> <given-names>J.</given-names></name> <name><surname>Lee</surname> <given-names>J. Y.</given-names></name> <name><surname>Kweon</surname> <given-names>I. S.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;CBAM: Convolutional block attention module,&#x0201D;</article-title> in <source>Computer Vision - ECCV 2018. Lecture Notes in Computer Science(), Vol 11211</source>, eds <person-group person-group-type="editor"><name><surname>Ferrari</surname> <given-names>V.</given-names></name> <name><surname>Hebert</surname> <given-names>M.</given-names></name> <name><surname>Sminchisescu</surname> <given-names>C.</given-names></name> <name><surname>Weiss</surname> <given-names>Y.</given-names></name></person-group> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>). <pub-id pub-id-type="doi">10.1007/978-3-030-01234-2_1</pub-id></citation>
</ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>Liao</surname> <given-names>S.</given-names></name></person-group> (<year>2022</year>). <article-title>Traffic sign detection based on SSD combined with receptive field module and path aggregation network</article-title>. <source>Comput. Int. Neurosci.</source> <volume>2022</volume>, <fpage>436</fpage>. <pub-id pub-id-type="doi">10.1155/2022/4285436</pub-id><pub-id pub-id-type="pmid">35676967</pub-id></citation></ref>
<ref id="B39">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>T. T.</given-names></name> <name><surname>Tong</surname> <given-names>C.</given-names></name></person-group> (<year>2022</year>). <article-title>Real-time detection network for tiny traffic sign using multi-scale attention module</article-title>. <source>Sci. China Technol. Sci.</source> <volume>65</volume>, <fpage>396</fpage>&#x02013;<lpage>406</lpage>. <pub-id pub-id-type="doi">10.1007/s11431-021-1950-9</pub-id></citation>
</ref>
<ref id="B40">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yu</surname> <given-names>F.</given-names></name> <name><surname>Wang</surname> <given-names>D.</given-names></name> <name><surname>Shelhamer</surname> <given-names>E.</given-names></name> <name><surname>Darrell</surname> <given-names>T.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Deep layer aggregation,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Salt Lake City, UT</publisher-loc>: <publisher-name>Computer Vision and Pattern Recognition (CVPR))</publisher-name>, <fpage>2403</fpage>&#x02013;<lpage>2412</lpage>.</citation>
</ref>
<ref id="B41">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>J.</given-names></name> <name><surname>Huang</surname> <given-names>M.</given-names></name> <name><surname>Jin</surname> <given-names>X.</given-names></name> <name><surname>Li</surname> <given-names>X.</given-names></name></person-group> (<year>2017</year>). <article-title>A real-time Chinese traffic sign detection algorithm based on modified YOLOv2</article-title>. <source>Algorithms</source> <volume>10</volume>, <fpage>127</fpage>. <pub-id pub-id-type="doi">10.3390/a10040127</pub-id></citation>
</ref>
<ref id="B42">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Ding</surname> <given-names>S.</given-names></name> <name><surname>Yang</surname> <given-names>Z.</given-names></name></person-group> (<year>2022</year>). <article-title>Traffic sign detection algorithm based on improved attention mechanism</article-title>. <source>J. Comput. Appl.</source> <volume>42</volume>, <fpage>2378</fpage>.</citation>
</ref>
<ref id="B43">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Zhou</surname> <given-names>X.</given-names></name> <name><surname>Lin</surname> <given-names>M.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Shufflenet: An extremely efficient convolutional neural network for mobile devices,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Salt Lake City, UT</publisher-loc>), <fpage>6848</fpage>&#x02013;<lpage>6856</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2018.00716</pub-id></citation>
</ref>
<ref id="B44">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhou</surname> <given-names>X.</given-names></name> <name><surname>Wang</surname> <given-names>D.</given-names></name> <name><surname>Kr&#x000E4;henb&#x000FC;hl</surname> <given-names>P.</given-names></name></person-group> (<year>2019</year>). <source>Objects as</source> points.</citation>
</ref>
<ref id="B45">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhu</surname> <given-names>Z.</given-names></name> <name><surname>Liang</surname> <given-names>D.</given-names></name> <name><surname>Zhang</surname> <given-names>S.</given-names></name> <name><surname>Huang</surname> <given-names>X.</given-names></name> <name><surname>Li</surname> <given-names>B.</given-names></name> <name><surname>Hu</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>&#x0201C;Traffic-sign detection and classification in the wild,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Las Vegas, NV</publisher-loc>), 2110&#x02013;2118. <pub-id pub-id-type="doi">10.1109/CVPR.2016.232</pub-id><pub-id pub-id-type="pmid">37050452</pub-id></citation></ref>
</ref-list> 
</back>
</article> 