<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neurorobot.</journal-id>
<journal-title>Frontiers in Neurorobotics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neurorobot.</abbrev-journal-title>
<issn pub-type="epub">1662-5218</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnbot.2022.1082346</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Siamese hierarchical feature fusion transformer for efficient tracking</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Dai</surname> <given-names>Jiahai</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2068111/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Fu</surname> <given-names>Yunhao</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Wang</surname> <given-names>Songxin</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Chang</surname> <given-names>Yuchun</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Electronic Information Engineering, College of Electronic Science and Engineering, Jilin University</institution>, <addr-line>Changchun</addr-line>, <country>China</country></aff>
<aff id="aff2"><sup>2</sup><institution>Department of Computer Science and Technology, College Science and Technology, Shanghai University of Finance and Economics</institution>, <addr-line>Shanghai</addr-line>, <country>China</country></aff>
<aff id="aff3"><sup>3</sup><institution>Department of Electronic Science and Technology, School of Microelectronics, Dalian University of Technology</institution>, <addr-line>Dalian</addr-line>, <country>China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Xin Ning, Institute of Semiconductors (CAS), China</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Sahraoui Dhelim, University College Dublin, Ireland; Nadeem Javaid, COMSATS University, Islamabad Campus, Pakistan; Achyut Shankar, Amity University, India</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Yuchun Chang <email>cyc&#x00040;dlut.edu.cn</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>01</day>
<month>12</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>16</volume>
<elocation-id>1082346</elocation-id>
<history>
<date date-type="received">
<day>28</day>
<month>10</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>14</day>
<month>11</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2022 Dai, Fu, Wang and Chang.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Dai, Fu, Wang and Chang</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>Object tracking is a fundamental task in computer vision. Recent years, most of the tracking algorithms are based on deep networks. Trackers with deeper backbones are computationally expensive and can hardly meet the real-time requirements on edge platforms. Lightweight networks are widely used to tackle this issue, but the features extracted by a lightweight backbone are inadequate for discriminating the object from the background in complex scenarios, especially for small objects tracking task. In this paper, we adopted a lightweight backbone and extracted features from multiple levels. A hierarchical feature fusion transformer (HFFT) was designed to mine the interdependencies of multi-level features in a novel model&#x02014;SiamHFFT. Therefore, our tracker can exploit comprehensive feature representations in an end-to-end manner, and the proposed model is capable of handling small target tracking in complex scenarios on a CPU at a rate of 29 FPS. Comprehensive experimental results on UAV123, UAV123&#x00040;10fps, LaSOT, VOT2020, and GOT-10k benchmarks with multiple trackers demonstrate the effectiveness and efficiency of SiamHFFT. In particular, our SiamHFFT achieves good performance both in accuracy and speed, which has practical implications in terms of improving small object tracking performance in the real world.</p></abstract>
<kwd-group>
<kwd>visual tracking</kwd>
<kwd>hierarchical feature</kwd>
<kwd>transformer</kwd>
<kwd>lightweight backbone</kwd>
<kwd>real-time</kwd>
</kwd-group>
<contract-sponsor id="cn001">National Natural Science Foundation of China<named-content content-type="fundref-id">10.13039/501100001809</named-content></contract-sponsor>
<counts>
<fig-count count="8"/>
<table-count count="5"/>
<equation-count count="20"/>
<ref-count count="70"/>
<page-count count="15"/>
<word-count count="9138"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>Introduction</title>
<p>Visual tracking is an important task in computer vision that provides underlying technical support for more complex tasks; and is an essential procedure for advanced computer vision applications. Additionally, visual tracking has been widely used in various fields such as unmanned aerial vehicles (UAVs) (Cao et al., <xref ref-type="bibr" rid="B4">2021</xref>), autonomous driving (Zhang and Processing, <xref ref-type="bibr" rid="B66">2021</xref>), and video surveillance (Zhang G. et al., <xref ref-type="bibr" rid="B64">2021</xref>). However, several challenges remain that hamper tracking performance, including edge computing devices and difficult external environments with occlusion, illumination variation, and background clutter.</p>
<p>Over the past few years, visual object tracking has made significant advancements based on the development of convolutional neural networks due to the breakthroughs that have been made to generate more powerful backbones, such as deeper networks (He et al., <xref ref-type="bibr" rid="B23">2016</xref>; Chen B. et al., <xref ref-type="bibr" rid="B6">2022</xref>), efficient network structure (Howard et al., <xref ref-type="bibr" rid="B26">2017</xref>), attention mechanism (Hu et al., <xref ref-type="bibr" rid="B27">2018</xref>). Inspired by the way of the human brain process the overload information (Wolfe and Horowitz, <xref ref-type="bibr" rid="B57">2004</xref>), the attention mechanism is utilized to enhance the vital features and surpass the unnecessary information of the input feature. Due to the powerful feature representation ability, the attention mechanism becomes an important means to enhance the input features, such as channel attention (Hu et al., <xref ref-type="bibr" rid="B27">2018</xref>), spatial attention (Wang F. et al., <xref ref-type="bibr" rid="B54">2017</xref>; Wang N. et al., <xref ref-type="bibr" rid="B55">2018</xref>), temporal attention (Hou et al., <xref ref-type="bibr" rid="B25">2020</xref>), global attention (Zhang et al., <xref ref-type="bibr" rid="B67">2020a</xref>), and self-attention mechanism (Wang et al., <xref ref-type="bibr" rid="B56">2018</xref>). Among them, the self-attention based models, the transformer was initially designed for natural language processing (NLP) (Vaswani et al., <xref ref-type="bibr" rid="B53">2017</xref>) task, where the attention mechanism is utilized to perform the machine translation tasks and achieved great improvements. Later, the pre-training model BERT (Devlin et al., <xref ref-type="bibr" rid="B16">2018</xref>) achieve breakthrough progress in NLP tasks, further advance the development of the Transformer model. Since then, both academia and industry have set off a boom in the research and application of pre-trained models based on Transformer, and gradually extended from NLP to CV. For example, Vision Transformer (ViT) (Dosovitskiy et al., <xref ref-type="bibr" rid="B17">2020</xref>), DETR (Carion et al., <xref ref-type="bibr" rid="B5">2020</xref>), have surpassed previous SOTA in the fields of image classification, inspection, and video, respectively. Various variant models based on Transformer structure have been proposed, multi-task indicators in various fields have been continuously refreshed, and the deep learning community has entered a new era. Meanwhile, muti-level features fusion can effectively alleviate the deficiency of the transformer in handling the tracking of small objects.</p>
<p>Although transformer models provide enhancements in feature representation and result in promotion in terms of accuracy and robustness, trackers based on transformers have high computational costs that hinder them from meeting the real-time demands of tracking tasks on edge hardware devices, providing a disadvantage for the landing of the application. Therefore, how to balance the efficiency and efficacy of object trackers remains a significant challenge. Generally, discriminative feature representation is essential for tracking. Therefore, deeper backbones and online updaters are utilized in tracking frameworks, however these methods are computationally expensive leading to increased run time and budget. Typically, the lightweight backbone is also limited as it typically provides inadequate feature extraction, rendering the tracking model less robust for small objects or complex scenarios.</p>
<p>In this study, we employed a lightweight backbone network to avoid the efficiency loss caused by the computations of deep networks. To address the insufficient feature representations extracted by shallow networks, we extracted features from multiple levels of the backbone to enrich the feature representations. Furthermore, to leverage the advantages of transformers in global relationship modeling, we designed a hierarchical feature fusion module to integrate multi-level features comprehensively using multi-head attention mechanisms. The proposed Siamese hierarchical feature fusion transformer (SiamHFFT) tracker achieved robust performance in complex scenarios while maintaining real-time tracking speed on a CPU and it can be deployed on consumer CPUs. The main contributions of this study can be summarized as follows:</p>
<list list-type="simple">
<list-item><p>(1) We proposed a novel type of tracking network based on a Siamese architecture, which consisting of feature extraction, reshape module, Transformer-like feature fusion module, and head prediction modules.</p></list-item>
<list-item><p>(2) We designed a feature fusion transformer to exploit the hierarchical features in the Siamese tracking framework in an end-to-end manner, which is capable of advancing discriminability for small object tracking task.</p></list-item>
<list-item><p>(3) Comprehensive evaluations on five challenging benchmarks demonstrate the proposed tracker achieved promising results among state-of-the-art trackers. Besides, our tracker can run at a real-time speed. This efficient method can be deployed on resource-limited platforms.</p></list-item>
</list>
<p>The remainder of this paper is organized as follows. Section Related work describes related work on tracking networks and transformers. Section Method introduces the methodology used for implementing the proposed HFFT and network model. Section Experiments presents the results of experiments conducted to verify the proposed model. Finally, Section Conclusion contains our concluding remarks.</p>
</sec>
<sec id="s2">
<title>Related work</title>
<sec>
<title>Siamese tracking</title>
<p>In recent years, Siamese-based networks have become a ubiquitous framework in the visual tracking field (Javed et al., <xref ref-type="bibr" rid="B29">2021</xref>). Tracking an arbitrary object can be considered as learning similarity measure function learning problems. SiamFC (Bertinetto et al., <xref ref-type="bibr" rid="B2">2016</xref>) introduced a correlation layer as a fusion tensor into the tracking framework for the first time, which pioneered the Siamese tracking procedure. Instead of directly estimating the target position according to the response map, SiamRPN (Li B. et al., <xref ref-type="bibr" rid="B32">2018</xref>) attaches a region proposal extraction subnetwork (RPN) to the Siamese network and formulates the tracking as a one-shot detection task. Based on the results of classification and regression branches, SiamRPN achieves enhanced tracking accuracy. DaSiamRPN (Zhu et al., <xref ref-type="bibr" rid="B70">2018</xref>) uses a distractor-aware module to solve the problem of inaccurate tracking caused by the imbalance of positive and negative samples of the training set. C-RPN (Fan and Ling, <xref ref-type="bibr" rid="B19">2019</xref>) and Cract (Fan and Ling, <xref ref-type="bibr" rid="B20">2020</xref>) incorporate multiple stages into the Siamese tracking architecture to improve tracking accuracy. To address unreliable predicted fixed-ratio bounding boxes when a tracker drifts rapidly, an anchor-free mechanism was also introduced into the tracking task. To rectify the inaccurate bounding box estimation strategy of the anchor-based mechanism, Ocean (Zhang et al., <xref ref-type="bibr" rid="B68">2020b</xref>) directly regresses the location of each point located in the ground truth. SiamBAN (Chen et al., <xref ref-type="bibr" rid="B12">2020</xref>) adopts box adaptive heads to handle the classification and regression problem parallelly. SiamFC&#x0002B;&#x0002B; (Xu et al., <xref ref-type="bibr" rid="B58">2020</xref>) and SiamCAR (Guo et al., <xref ref-type="bibr" rid="B21">2020</xref>) draw on the FCOS architecture and add a branch to measure the accuracy of the classification results. Compared with anchor-based trackers, anchor-free-based trackers utilize fewer parameters and do not need prior information for the bounding box, these anchor-free-based trackers can achieve a real-time speed.</p>
<p>As feature representation plays a vital role in the tracking process (Marvasti-Zadeh et al., <xref ref-type="bibr" rid="B43">2021</xref>), several works delicate to obtain discriminative features from different perspectives, such as adopting deeper or wider backbones, and using attention mechanisms to advance the feature representation. In the recent 3 years, the Transformer is capable of using global context information and preserving more semantic information. The introduction of the Transformer model in the tracking community boots the tracking accuracy to a great extent (Chen X. et al., <xref ref-type="bibr" rid="B11">2021</xref>; Lin et al., <xref ref-type="bibr" rid="B36">2021</xref>; Liu et al., <xref ref-type="bibr" rid="B41">2021</xref>; Chen et al., <xref ref-type="bibr" rid="B9">2022b</xref>; Mayer et al., <xref ref-type="bibr" rid="B44">2022</xref>). However, the promotion of the accuracy of these trackers&#x00027; increasingly complex models relies heavily on powerful GPUs, leading to the inability to deploy such models on edge devices, which hinders the further practical application of the models.</p>
<p>In this study, to optimize the trade-off between tracking accuracy and speed, we designed an efficient algorithm that employs a concise model consisting of a lightweight backbone network, a feature reshaping model, a feature fusion module, and a prediction head. Our model is capable of handling complex scenarios, and the proposed tracker can also achieve real-time speed on a CPU.</p>
</sec>
<sec>
<title>Transformer in vision tasks</title>
<p>As a new type of neural network, transformer shows superior performance in the field of AI applications (Han et al., <xref ref-type="bibr" rid="B22">2022</xref>). Unlike the structure of CNNs and RNNs, Transformer adopts the self-attention mechanism, which has been proved to have strong feature representation ability and better parallel computing capability, making it more advantageous in several tasks.</p>
<p>The transformer model was first proposed by Vaswani et al. (<xref ref-type="bibr" rid="B53">2017</xref>) for application to natural language processing (NLP) tasks. In contrast to convolutional neural networks (CNNs) and recurrent neural networks (RNNs), self-attention facilitates both parallel computation and short maximum path lengths. Unlike earlier self-attention models based on RNNs for input representations (Lin Z. et al., <xref ref-type="bibr" rid="B39">2017</xref>; Paulus et al., <xref ref-type="bibr" rid="B50">2017</xref>), the attention mechanisms in transformer model are implemented with attention-based encoders and decoders instead of convolutional or recurrent layers.</p>
<p>Because transformers were originally designed for sequence-to-sequence learning on textual data and have exhibited good performance, their ability to integrate global information has been gradually unveiled and transformers have been extended to other modern deep learning applications such as image classification (Liu et al., <xref ref-type="bibr" rid="B40">2020</xref>; Chen C. -F. R. et al., <xref ref-type="bibr" rid="B7">2021</xref>; He et al., <xref ref-type="bibr" rid="B24">2021</xref>), reinforcement learning (Parisotto et al., <xref ref-type="bibr" rid="B49">2020</xref>; Chen L. et al., <xref ref-type="bibr" rid="B8">2021</xref>), face alignment (Ning et al., <xref ref-type="bibr" rid="B48">2020</xref>), object detection (Beal et al., <xref ref-type="bibr" rid="B1">2020</xref>; Carion et al., <xref ref-type="bibr" rid="B5">2020</xref>), image recognition (Dosovitskiy et al., <xref ref-type="bibr" rid="B17">2020</xref>) and object tracking (Yan et al., <xref ref-type="bibr" rid="B61">2019</xref>, <xref ref-type="bibr" rid="B59">2021a</xref>; Cao et al., <xref ref-type="bibr" rid="B4">2021</xref>; Lin et al., <xref ref-type="bibr" rid="B36">2021</xref>; Zhang J. et al., <xref ref-type="bibr" rid="B65">2021</xref>; Chen B. et al., <xref ref-type="bibr" rid="B6">2022</xref>; Chen et al., <xref ref-type="bibr" rid="B9">2022b</xref>; Mayer et al., <xref ref-type="bibr" rid="B44">2022</xref>). Based on CNNs and transformers, the DERT (Carion et al., <xref ref-type="bibr" rid="B5">2020</xref>) applies a transformer to object detection tasks. To improve upon previous CNN models, DERT eliminates post-processing steps that rely on manual priors such as non-maximum suppression (NMS) and anchor generators; and constructs a complete end-to-end detection framework. ViT (Dosovitskiy et al., <xref ref-type="bibr" rid="B17">2020</xref>) mainly converts images into serialized data through token processing and introduces the concept of patches, where input images are divided into smaller patches and each patch is converted into a bidirectional encoder representation from transformers-like structure. Similar to the concept of patches in ViT, Swin Transformer (Liu et al., <xref ref-type="bibr" rid="B41">2021</xref>) uses the concept of windows, but the calculations of different windows do not interfere with each other, hence, the computational complexity of the Swin Transformer is significantly reduced.</p>
<p>In the tracking community, transformers have achieved remarkable performance. STARK (Yan et al., <xref ref-type="bibr" rid="B59">2021a</xref>) utilizes an end-to-end transformer tracking architecture based on spatiotemporal information. SwinTrack (Lin et al., <xref ref-type="bibr" rid="B36">2021</xref>) incorporates a general position-encoding solution for feature extraction and feature fusion, enabling full interaction between the target object and search region during tracking process. TrTr (Zhao et al., <xref ref-type="bibr" rid="B69">2021</xref>) used the transformer architecture to perform target classification and bounding box regression and designed a plug-in online update module for classification to further improve tracking performance. DTT (Yu et al., <xref ref-type="bibr" rid="B62">2021</xref>) also feed these architectures to predict the location and the bounding box of the target. Cao et al. (<xref ref-type="bibr" rid="B4">2021</xref>) proposed an efficient and effective hierarchical feature transformer (HiFT) for aerial tracking. HCAT (Chen et al., <xref ref-type="bibr" rid="B9">2022b</xref>) utilizes a novel feature sparsification module to reduce computational complexity and a hierarchical cross-attention transformer that employs a full cross-attention structure to improve efficiency and enhance representation ability. The hierarchical-based methods, both HiFT and HCAT show good tracking performance. However, transformer-based trackers lack robustness in small objects. In this paper, we propose a novel hierarchical feature fusion module based on a transformer to enable a tracker to achieve real-time speed while maintains good accuracy.</p>
</sec>
<sec>
<title>Feature aggregation network</title>
<p>Feature aggregation plays a vital role in the multi-level feature process, and is used to improve cross-scale feature interaction and multi-scale feature fusion, thereby enhancing the representation of features and enhancing network performance. Zhang G. et al. (<xref ref-type="bibr" rid="B64">2021</xref>) proposed a hierarchical aggregation transformer (HAT) framework consisting of transformer-based feature calibration (TFC) and deeply supervised aggregation (DSA) modules. The TFC module can merge and preserve semantic and detail information at multiple levels, and the DSA module aggregates the hierarchical features of the backbone with multi-granularity supervision. Feature pyramid networks (FPN) (Lin T.-Y. et al., <xref ref-type="bibr" rid="B37">2017</xref>) introduce cross-scale feature interactions and achieve good results through the fusion of multiple layers. Qingyun et al. (<xref ref-type="bibr" rid="B51">2021</xref>) introduced a cross-modality fusion transformer, that makes full use of the complementarity between different modalities to improve the performance of features. However, the main challenge of a simple feature fusion strategy is how to fuse high-level semantic information and low-level detailed features. To address these issues, we propose an aggregation structure based on hierarchical transformers, which can fully mine the coherence among multi-level features at different scales, and achieve discriminative feature representation ability.</p>
</sec>
</sec>
<sec id="s3">
<title>Method</title>
<sec>
<title>Overview</title>
<p>In this section, we describe the proposed SiamHFFT model. As can be seen in <xref ref-type="fig" rid="F1">Figure 1</xref>, our model follows a Siamese tracking framework. There are four key components in our model, namely the feature extraction module, reshape module, feature fusion module, and prediction head. During tracking, the feature extraction module extracts feature from the template and search region. The features of the two branches from the last three layers of the backbone are correlated separately, and the outputs are denoted as <italic>M</italic><sub>2</sub>, <italic>M</italic><sub>3</sub>, and <italic>M</italic><sub>4</sub> in order. We then feed the correlated features into the reshaping module, which can transform the channel dimensions of the backbone features and flatten features in the spatial dimension. The feature fusion module is implemented by fusing features using our hierarchical feature fusion transformer (HFFT) and a self-attention module. Finally, we used the prediction head module to perform bounding box regression and binary classification on the enhanced features to generate tracking results.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Architecture of the proposed SiamHFFT tracking framework. This framework contains four fundamental components: a feature extraction network, reshaping module, feature fusion module, and prediction head. The backbone network is used to extract hierarchical features. The reshaping module is designed to perform convolution operations and flatten features. The feature fusion transformer consists of the proposed HFFT module and a self-attention module (SAM). Finally, bounding boxes are estimated based on the regression and classification results.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-1082346-g0001.tif"/>
</fig>
</sec>
<sec>
<title>Feature extraction and reshaping</title>
<p>Similar to most Siamese tracking networks, the proposed method uses template frame patch (<italic>Z</italic> &#x02208; &#x0211D;<sup>3&#x000D7;80&#x000D7;80</sup>) and search frame patch (<italic>X</italic> &#x02208; &#x0211D;<sup>3&#x000D7;320&#x000D7;320</sup>) as inputs. For the backbone, our method can use an arbitrary deep CNN such as ResNet, MobileNet (Sandler et al., <xref ref-type="bibr" rid="B52">2018</xref>), AlexNet, or ShuffleNet V2 (Ma et al., <xref ref-type="bibr" rid="B42">2018</xref>). In this study, because a deeper network is unsuitable for deployment with limited computing resources, we adopted ShuffleNetV2 as a backbone network. This network is utilized for both template and search branch feature extraction.</p>
<p>To obtain robust and discriminative feature representations, we incorporate detailed structural information into our visual representations by extracting hierarchical features with different scales and semantic information in stage two, three and four of feature extraction. We denote feature tokens from the template branch as <italic>F</italic><sub><italic>i</italic></sub>(<italic>Z</italic>) and those from the search branch as <italic>F</italic><sub><italic>i</italic></sub>(<italic>X</italic>), where <italic>i</italic> represents the stage number of feature extraction and <italic>i</italic> &#x02208; {2, 3, 4}.</p>
<p>Next, a convolution operation is performed on the feature maps from the multi stages correlation, which is defined as:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002A;</mml:mo><mml:msub><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mn>3</mml:mn><mml:mo>,</mml:mo><mml:mn>4</mml:mn><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M2"><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mrow><mml:mi>H</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>, and <italic>C</italic>, <italic>H</italic>, and <italic>W</italic> denote the channel, width, and height of the feature map respectively. Additionally, <italic>C</italic><sub><italic>i</italic></sub> &#x02208; {116, 232, 464} and &#x0002A; denotes the cross-correlation operator. Next, we use the reshaping module which consists of 1 &#x000D7; 1 convolutional kernels, to change the channel dimensions of the features from Equation (1). We then flatten the features in the spatial dimension because a unified channel can not only effectively reduce computing resource requirements, but is also an essential component for improving the performance of feature fusion. After these operations, we can obtain a reshaped feature map <inline-formula><mml:math id="M3"><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>H</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000D7;</mml:mo><mml:mi>C</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, where <italic>C</italic> &#x0003D; 192.</p>
</sec>
<sec>
<title>Feature fusion and prediction head</title>
<p>As illustrated in <xref ref-type="fig" rid="F1">Figure 1</xref>, following the convolution and flattening operations in the reshaping module, the correlation features from different stages are unified in the channel dimension. To explore the interdependencies among multi-level features fully, we designed the HFFT, which is detailed in this section.</p>
<p><bold>Multi-Head Attention (Vaswani et al.</bold>, <xref ref-type="bibr" rid="B53"><bold>2017</bold></xref><bold>):</bold> Generally, transformers have been successfully applied to enhance feature representations in various bi-modal vision tasks. In the proposed feature fusion module, the attention mechanism is also a fundamental component. It is implemented using an attention function and operated on queries <italic>Q</italic>, keys <italic>K</italic> and values <italic>V</italic> using the scale dot-production method, which is defined as:</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M4"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Q</mml:mi><mml:mo>,</mml:mo><mml:mi>K</mml:mi><mml:mo>,</mml:mo><mml:mi>V</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>s</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>Q</mml:mi><mml:msup><mml:mrow><mml:mi>K</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x022A4;</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:msqrt><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msqrt></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>V</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>C</italic> is the key dimensionality for normalizing the attention, and<inline-formula><mml:math id="M5"><mml:msqrt><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msqrt></mml:math></inline-formula> is a scaling factor to avoid gradient vanishing in the loss function. Specifically, <inline-formula><mml:math id="M6"><mml:mi>Q</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>C</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is the <italic>q</italic> input in <xref ref-type="fig" rid="F2">Figure 2B</xref>, which denotes a collection of <italic>N</italic> features; similarly, <italic>K</italic> and <italic>V</italic> are the <italic>k</italic> and <italic>v</italic> inputs, respectively, which represent a collection of <italic>M</italic>features (i.e., <italic>K, V</italic> &#x02208; &#x0211D;<sup><italic>M</italic>&#x000D7;<italic>C</italic></sup>). Notably, <italic>Q, K, V</italic> represent the mathematical implementation of the attention function and do not have practical meaning.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p><bold>(A)</bold> Structure of a dual-input tasks; <bold>(B)</bold> Structure of a multi-input tasks. Unlike the original dual-input tasks, multi-input tasks can be used to learn the interdependencies of multi-level features and enhance the feature representation of the model in an end-to-end manner.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-1082346-g0002.tif"/>
</fig>
<p>According to Vaswani et al. (<xref ref-type="bibr" rid="B53">2017</xref>), extending the attention function in Equation (2) to multiple heads is beneficial for enabling the mechanism to learn various attention distributions and enhancing its feature representation ability. This extension can be formulated as follows:</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M7"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>M</mml:mi><mml:mi>u</mml:mi><mml:mi>l</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>H</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>d</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Q</mml:mi><mml:mo>,</mml:mo><mml:mi>K</mml:mi><mml:mo>,</mml:mo><mml:mi>V</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E4"><label>(4)</label><mml:math id="M8"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Q</mml:mi><mml:msubsup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mi>K</mml:mi><mml:msubsup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mi>V</mml:mi><mml:msubsup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mi>h</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M9"><mml:msubsup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, <inline-formula><mml:math id="M10"><mml:msubsup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula><mml:math id="M11"><mml:msubsup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msup></mml:math></inline-formula>, and <italic>W</italic><sup><italic>o</italic></sup> &#x02208; &#x0211D;<sup><italic>C</italic>&#x000D7;<italic>C</italic></sup>. Here, <italic>h</italic> is the number of attention heads, which is defined as <inline-formula><mml:math id="M12"><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:mfrac></mml:math></inline-formula>. In this study, we adopted and <italic>h</italic> &#x0003D; 6 as default values.</p>
<p><bold>Application to Dual-Input Tasks:</bold> The structure of a dual-input task is presented in <xref ref-type="fig" rid="F2">Figure 2A</xref>, where <italic>Q, K</italic>, and <italic>V</italic> for normal NLP/vision tasks (Nguyen et al., <xref ref-type="bibr" rid="B47">2020</xref>) share the same modality. In recent years, this mechanism has been extended to dual-inputs and applied to vision tasks (Chen X. et al., <xref ref-type="bibr" rid="B11">2021</xref>; Chen et al., <xref ref-type="bibr" rid="B10">2022a</xref>,<xref ref-type="bibr" rid="B9">b</xref>). However, the original attention mechanism cannot distinguish between the position information of different input feature sequences. The original mechanism only considers the absolute position and adds absolute positional encodings to inputs. It considers the attention from a source feature &#x003D5; to a target feature &#x003B8; as:</p>
<disp-formula id="E5"><label>(5)</label><mml:math id="M13"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003D5;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>M</mml:mi><mml:mi>u</mml:mi><mml:mi>l</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>H</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>d</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B8;</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>&#x003D5;</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003D5;</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>&#x003D5;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>P</italic><sub>&#x003B8;</sub> and <italic>P</italic><sub>&#x003D5;</sub> are the spatial positional encodings of features &#x003B8; and &#x003D5;, respectively. Spatial positional encoding is generated using a sine function. Equation (5) can be used not only as a single-direction attention enhancement, but also as a co-attention mechanism in which both directions are considered. Furthermore, self-attention from a feature to itself is also defined as a special case:</p>
<disp-formula id="E6"><label>(6)</label><mml:math id="M14"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>M</mml:mi><mml:mi>u</mml:mi><mml:mi>l</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>H</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>d</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B8;</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>&#x003B8;</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>As shown in <xref ref-type="fig" rid="F2">Figure 2A</xref>, following Equations (5) and (6), the designed transformer blocks are processed independently. Therefore, the two modules can be used sequentially or in parallel. Additionally, a multilayer perceptron (MLP) module is used to enhance the fitting ability of the model. The MLP module is a fully connected network consisting of two linear projections with a Gaussian error linear unit (GELU) activation function between them, which can be denoted as:</p>
<disp-formula id="E7"><label>(7)</label><mml:math id="M15"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>M</mml:mi><mml:mi>L</mml:mi><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>F</mml:mi><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>G</mml:mi><mml:mi>E</mml:mi><mml:mi>L</mml:mi><mml:mi>U</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>F</mml:mi><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p><bold>Application to Multi-Input Tasks</bold>: To extend the attention mechanism to multiple inputs that are capable of handling multimodal vision tasks, pyramid structures, etc., we denote the total input number as S. The structure of a multi-input task is presented in <xref ref-type="fig" rid="F2">Figure 2B</xref>. If we consider each possibility, there are a total of <italic>S</italic>(<italic>S</italic> &#x02212; 1) source-target cases and <italic>S</italic> self-attention cases. Now, we denote the multiple inputs as {&#x003B8;, &#x003D5;<sub>1</sub>, &#x02026;, &#x003D5;<sub><italic>S</italic>&#x02212;1</sub>}, where the target &#x003B8; &#x02208; &#x0211D;<sup><italic>N</italic>&#x000D7;<italic>C</italic></sup> and source <inline-formula><mml:math id="M16"><mml:msub><mml:mrow><mml:mi>&#x003D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>M</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>C</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. Notably, &#x003B8; and &#x003D5;<sub><italic>i</italic></sub> must have the same size as <italic>C</italic>. We then compute all the source-target cases as {<italic>A</italic><sub>&#x003D5;<sub>1</sub></sub>(&#x003B8;), &#x02026;, <italic>A</italic><sub>&#x003D5;<sub><italic>S</italic>&#x02212;1</sub></sub>(&#x003B8;)}. Next, we concatenate all source-to-target attention cases with self-attention <italic>A</italic><sub>&#x003B8;</sub>(&#x003B8;), which can be formulated as:</p>
<disp-formula id="E8"><label>(8)</label><mml:math id="M17"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003D5;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>S</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M18"><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>S</mml:mi><mml:mi>C</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. After concatenation, the dimensions of the enhanced features in the channel change to match the size <italic>SC</italic> of the original feature. To accelerate these calculations further, we apply a fully connected layer to reduce the channel dimensions to:</p>
<disp-formula id="E9"><label>(9)</label><mml:math id="M19"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>L</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M20"><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>C</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. Through this process, we can obtain more discriminative features efficiently by aggregating features from different attention mechanisms.</p>
<p><bold>HFFT</bold>: As is shown in <xref ref-type="fig" rid="F2">Figure 2B</xref>, in our model, we make full use of the hierarchical features <inline-formula><mml:math id="M21"><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>H</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000D7;</mml:mo><mml:mi>C</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> (<italic>i</italic> &#x02208; {2, 3, 4}) and generate tracking-tailored features. To integrate low-level spatial information with high-level semantic information, we feed the reshaped features from the output of Equation (1), namely <inline-formula><mml:math id="M22"><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula><mml:math id="M23"><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, and <inline-formula><mml:math id="M24"><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, into the HFFT module, where <inline-formula><mml:math id="M25"><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is used for target feature, <inline-formula><mml:math id="M26"><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula><mml:math id="M27"><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> represent source features. The importance of different aspects feature information is assigned by applying the cross-attention operator to <inline-formula><mml:math id="M28"><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> and <inline-formula><mml:math id="M29"><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, which is beneficial for obtaining more discriminative features. We apply self-attention to <inline-formula><mml:math id="M30"><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, which can preserve the details of target information during tracking. Furthermore, positional information is encoded during the calculation process to enhance spatial information during the tracking process. The attention mechanisms are implemented using the operation of <italic>K, Q, V</italic>. Then, comprehensive features can be obtained by concatenating the outputs. Due to the complexity of a model increases with its input size, a fully connected layer is utilized to resize outputs. We also adopt residual connections around each sub-layer. Additionally, we use an MLP module to enhance the fitting ability of the model, and layer normalization (LN) is performed before the MLP and final output steps. The entire process of the HFFT can be expressed as:</p>
<disp-formula id="E10"><mml:math id="M31"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mn>4</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E11"><mml:math id="M32"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>L</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E12"><mml:math id="M33"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>L</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mo>&#x0002B;</mml:mo><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E13"><label>(10)</label><mml:math id="M34"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>L</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mi>M</mml:mi><mml:mi>L</mml:mi><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p><bold>SAM</bold>: The SAM is a feature enhancement module. The structure of the SAM is presented in <xref ref-type="fig" rid="F3">Figure 3</xref>. The SAM adaptively integrates information from different feature maps using multi-head self-attention in the residual form. In the proposed model, the SAM take the out of Equation (10) <italic>X</italic><sub><italic>out</italic></sub> as input. The mathematical process of the SAM can be summarized as:</p>
<disp-formula id="E14"><mml:math id="M35"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>L</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>M</mml:mi><mml:mi>u</mml:mi><mml:mi>l</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>H</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>d</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>X</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>X</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E15"><label>(11)</label><mml:math id="M36"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mi>S</mml:mi><mml:mi>A</mml:mi><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>L</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>M</mml:mi><mml:mi>L</mml:mi><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Architecture of the proposed SAM.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-1082346-g0003.tif"/>
</fig>
<p><bold>Prediction Head</bold>: The enhanced features are reshaped back to the original feature size before being fed into the prediction head. The head network consists of two branches: a classification branch and bounding box regression branch. Each branch consists of a three-layer perceptron. The former is utilized to distinguish the target from the background, and the latter is used for estimating the location of the target by using a bounding box. Overall, the model is trained using a combination loss function formulated as:</p>
<disp-formula id="E16"><label>(12)</label><mml:math id="M37"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>L</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>u</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>L</italic><sub><italic>cls</italic></sub>, <italic>L</italic><sub><italic>giou</italic></sub>, and <italic>L</italic><sub><italic>loc</italic></sub> represent the binary cross-entropy, GIOU loss, and L1-norm loss, respectively. &#x003BB;<sub><italic>cls</italic></sub>, &#x003BB;<sub><italic>giou</italic></sub>, and &#x003BB;<sub><italic>loc</italic></sub> are coefficients that balance the contributions of each type of losses.</p>
</sec>
</sec>
<sec id="s4">
<title>Experiments</title>
<p>This section presents the details of the experimental implementation of the proposed model. To validate the performance of the proposed tracker, we compared our method to state-of-the-art methods on four popular benchmarks. Additionally, ablation studies were conducted to analyse the effectiveness of key modules.</p>
<sec>
<title>Implementation details</title>
<p>The tracking algorithm was implemented in Python based on PyTorch. The proposed model was trained on a PC with an Intel i7-11700k, 3.6 GHz CPU, 64 GB of RAM, and an NVIDIA 3080Ti RTX GPUs. The training splits of LaSOT (Fan et al., <xref ref-type="bibr" rid="B18">2019</xref>), GOT-10k (Huang et al., <xref ref-type="bibr" rid="B28">2019</xref>), COCO (Lin et al., <xref ref-type="bibr" rid="B38">2014</xref>), and TrackingNet (Muller et al., <xref ref-type="bibr" rid="B46">2018</xref>) were used to train the model. We randomly selected two image pairs from the same video sequences with a maximum gap of 100 frames to generate the search patches and template patches. The sizes of search patches were set to 320 &#x000D7; 320 &#x000D7; 3 and template patches were resized to sizes of 80 &#x000D7; 80 &#x000D7; 3. The parameters for the backbone network were initialized using ShuffleNetV2, which was pretrained on ImageNet. All models were trained for 150 epochs with a batch size of 32. Each epoch contained 60,000 sampling pairs. The coefficient parameters in Equation (12) were set to &#x003BB;<sub><italic>cls</italic></sub> &#x0003D; 2, &#x003BB;<sub><italic>giou</italic></sub> &#x0003D; 2, and &#x003BB;<sub><italic>loc</italic></sub> &#x0003D; 5. In the offline training phrase, the parameters of the model are optimized by ADAMW optimizer. The learning rates of the backbone network were set to le-5, and le-4 for the remaining parts.</p>
</sec>
<sec>
<title>Comparisions to state-of-the-art methods</title>
<p>We compared SiamHFFT to state-of-the-art trackers on four benchmarks: LaSOT, UAV123 (Mueller et al., <xref ref-type="bibr" rid="B45">2016</xref>), UAV123&#x00040;10fps, and VOT2020 (Kristan et al., <xref ref-type="bibr" rid="B30">2020</xref>). The evaluation results are presented in the following paragraphs. It is worthy note that the performance (accuracy and success scores) of the comparision methods on these compared benchmarks are obtained by the public tracking results files, which are released by their authors.</p>
<p><bold>Evaluation on LaSOT:</bold> LaSOT is a large-scale long-term tracking benchmark consisting of 1,400 sequences. We used test splits and the one pass evaluation (OPE) to evaluate the performances of the compared trackers. That is, initialize the tracking algorithm according to the target position given in the first frame of the video sequence, and then run the prediction of the target position and size in the whole video to obtain the tracking accuracy or success rate.</p>
<p><xref ref-type="fig" rid="F4">Figures 4</xref>, <xref ref-type="fig" rid="F5">5</xref> report the plots of the precision and success scores of the comparision trackers, respectively. The precision score measures the center location error (CLE), which calculates the average Euclidean distance between the estimated bounding box and the ground truth. The CLE is calculated as follows:</p>
<disp-formula id="E17"><label>(13)</label><mml:math id="M38"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>C</mml:mi><mml:mi>L</mml:mi><mml:mi>E</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mtext class="textrm" mathvariant="normal">=</mml:mtext><mml:mtext>&#x000A0;</mml:mtext><mml:msqrt><mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>&#x0002B;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:msqrt></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Precision scores of compared trackers on LaSOT.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-1082346-g0004.tif"/>
</fig>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Success scores of compared trackers on LaSOT.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-1082346-g0005.tif"/>
</fig>
<p>As the CLEs of frame are obtained, the precision plots (<xref ref-type="fig" rid="F4">Figure 4</xref>) show the percentage of frames in which the estimated CLE is lower than a certain threshold (usually set to 20 pixels) in the total frames of the video sequence.</p>
<p>The Success curve (<xref ref-type="fig" rid="F5">Figure 5</xref>) refers to the percentage of the number of frames whose predicted overlap rate between the estimated bounding box and the ground truth is higher than the given threshold (usually set to 0.5) to the total number of frames in the video sequence. The overlap rate is calculated as follows:</p>
<disp-formula id="E18"><label>(14)</label><mml:math id="M39"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>S</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02229;</mml:mo><mml:msub><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0222A;</mml:mo><mml:msub><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>b</italic><sub><italic>t</italic></sub> denotes the estimated bounding box, <italic>b</italic><sub><italic>g</italic></sub> represents the ground truth bounding box, &#x02229; refers to intersection operator, &#x0222A; stands for union operator, and || denotes the number of pixels in the resulted region.</p>
<p>The curves of the proposed SiamHFFT are depicted in green. Overall, our tracker ranks the third in precision, and achieves the second-best score in success, with 61% at the precision score and 62% success score. Compared with the trackers with deeper backbones, such as SiamCAR, SiamBAN, and SiamRPN&#x0002B;&#x0002B; (Li B. et al., <xref ref-type="bibr" rid="B31">2019</xref>), our tracker exhibits competitive performance with a lighter structure. The DiMP achieves the best performance both in precision and success. Our SiamHFFT tracker outperforms other Siamese-based trackers, even with deeper backbones and dedicated-designed structures.</p>
<p><bold>Evaluation on UAV123:</bold> UAV123 is an aerial tracking benchmark consisting of 123 videos containing small objects, target occlusions, out of view, and distractors. To validate the performance of our tracker, we evaluated the performances of our trackers and other state-of-the-art trackers, including SiamFC, ECO (Danelljan et al., <xref ref-type="bibr" rid="B14">2017</xref>), ATOM (Danelljan et al., <xref ref-type="bibr" rid="B13">2019</xref>), SiamAttn (Yu et al., <xref ref-type="bibr" rid="B63">2020</xref>), SiamRPN&#x0002B;&#x0002B;, SiamCAR, DiMP (Bhat et al., <xref ref-type="bibr" rid="B3">2019</xref>), SiamBAN, and HiFT. <xref ref-type="table" rid="T1">Table 1</xref> lists the results in terms of success, precision, and speed on GPU. Additionally, the backbones of the trackers are also reported for an intuitive comparision. The best performance for each criterion is indicated in red.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Quantitative evaluation on UAV123 in term of precision (Prec.), success (Succ.) and GPU speed (FPS).</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th/>
<th valign="top" align="center"><bold>SiamFC</bold></th>
<th valign="top" align="center"><bold>ECO</bold></th>
<th valign="top" align="center"><bold>ATOM</bold></th>
<th valign="top" align="center"><bold>SiamAttn</bold></th>
<th valign="top" align="center"><bold>SiamRPN&#x0002B;&#x0002B;</bold></th>
<th valign="top" align="center"><bold>SiamCAR</bold></th>
<th valign="top" align="center"><bold>DiMP</bold></th>
<th valign="top" align="center"><bold>SiamBAN</bold></th>
<th valign="top" align="center"><bold>HiFT</bold></th>
<th valign="top" align="center"><bold>SiamHFFT</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Feat.</td>
<td valign="top" align="center">Alex</td>
<td valign="top" align="center">VGG</td>
<td valign="top" align="center">R18</td>
<td valign="top" align="center">R50</td>
<td valign="top" align="center">R50</td>
<td valign="top" align="center">R50</td>
<td valign="top" align="center">R50</td>
<td valign="top" align="center">R50</td>
<td valign="top" align="center">Alex</td>
<td valign="top" align="center">ShuffleNet</td>
</tr>
<tr>
<td valign="top" align="left">Prec.</td>
<td valign="top" align="center">72.5</td>
<td valign="top" align="center">75.2</td>
<td valign="top" align="center">83.7</td>
<td valign="top" align="center">84.5</td>
<td valign="top" align="center">76.9</td>
<td valign="top" align="center">76</td>
<td valign="top" align="center"><inline-formula><mml:math id="M1a"><mml:mrow><mml:mstyle mathcolor="#ee1f23"><mml:mn>84</mml:mn><mml:mo>.</mml:mo><mml:mn>9</mml:mn></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center">83.3</td>
<td valign="top" align="center">78.7</td>
<td valign="top" align="center">82.9</td>
</tr>
<tr>
<td valign="top" align="left">Succ.</td>
<td valign="top" align="center">49.4</td>
<td valign="top" align="center">52.8</td>
<td valign="top" align="center">64.2</td>
<td valign="top" align="center">65</td>
<td valign="top" align="center">57.9</td>
<td valign="top" align="center">61.4</td>
<td valign="top" align="center"><inline-formula><mml:math id="M2b"><mml:mrow><mml:mstyle mathcolor="#ee1f23"><mml:mn>65</mml:mn><mml:mo>.</mml:mo><mml:mn>4</mml:mn></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center">63.1</td>
<td valign="top" align="center">58.9</td>
<td valign="top" align="center">62.6</td>
</tr>
<tr>
<td valign="top" align="left">FPS</td>
<td valign="top" align="center"><inline-formula><mml:math id="M3c"><mml:mrow><mml:mstyle mathcolor="#ee1f23"><mml:mn>130</mml:mn></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center">45</td>
<td valign="top" align="center">46</td>
<td valign="top" align="center">45</td>
<td valign="top" align="center">35</td>
<td valign="top" align="center">52</td>
<td valign="top" align="center">45</td>
<td valign="top" align="center">40</td>
<td valign="top" align="center">/</td>
<td valign="top" align="center">68</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The best performance are shown in red.</p>
</table-wrap-foot>
</table-wrap>
<p>Among the trackers, those with deeper backbones, such as DiMP, ATOM, and SiamBAN, achieve better performance in term of both precision and success rate. SiamFC, HiFT, and the proposed SiamHFFT utilize lightweight backbone. SiamFC achieves the best performance in speed, but this naive network structure does not achieve satisfactory results in terms of precision and success rate. HiFT adopts a feature transformer to enhance feature representations. Compared to HiFT, our tracker exhibits a clear advantage in term of precision (82.8 vs. 78.7%) and success rate (62.5 vs. 58.9%), which validates the effectiveness of the proposed tracker. According to the last row in <xref ref-type="table" rid="T1">Table 1</xref>, all compared trackers can run in real-time on a GPU at an average speed of 68 FPS, proving that SiamHFFT maintains a suitable balance between performance and efficiency.</p>
<p><xref ref-type="fig" rid="F6">Figure 6</xref> depicts the qualitative results by multiple algorithms on a subset of sequences in UAV123 benchmarks. We choose three sets of the challenging video sequences: Car18_1, Person21_1, and Group3_4_1. All of the three video sequences are shot by the camera of the UAV, the video frames undergo multiple challenges, for example scale variation, changes of different viewpoint, and so on. Generally, the given target appears in small size during the tracking process. The bounding boxes estimated by the trackers are marked in different colors to give an intuitive contrast. The bounding box of our SiamHFFT is shown in red, and it is obvious that our tracker can handle these complex scenarios well, especially for the small object tracking task.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Qualitative experimental results in several challenging sequences on UAV123 dataset. <bold>(A)</bold> Video sequences of the Car, <bold>(B)</bold> video sequences of the Person, and <bold>(C)</bold> video sequences of the Group.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-1082346-g0006.tif"/>
</fig>
<p><bold>UAV123&#x00040;10fps</bold>: UAV123&#x00040;10fps is a subset of UAV123 obtained by down-sampling the original videos with an image rate of 10 FPS. We use SiamFC, AutoTrack (Li et al., <xref ref-type="bibr" rid="B35">2020</xref>), TADT (Li X. et al., <xref ref-type="bibr" rid="B34">2019</xref>), MCCT (Wang et al., <xref ref-type="bibr" rid="B56">2018</xref>), SiamRPN&#x0002B;&#x0002B;, DeepSTRCF (Li F. et al., <xref ref-type="bibr" rid="B33">2018</xref>), CCOT (Danelljan et al., <xref ref-type="bibr" rid="B15">2016</xref>), ECO, and HIFT as comparisions. Among these trackers, AutoTrack, TADT, MCCT, CCOT, ECO and DeepSTRCF are correlation filter based trackers, which has a lightweight structure and less parameters than deep learning based trackers, and the model can be deployed on limited source device. Compared with UAV 123 benchmark, challenge in UAV123&#x00040;10fps dataset are more abrupt and severe. The experimental results are listed in <xref ref-type="table" rid="T2">Table 2</xref>. Compared with the correlation filter based trackers, the deep trackers, HiFT and SiamRPN&#x0002B;&#x0002B; achieve higher precision and success scores, the performance of SiamFC is closer to these correlation based trackers, SiamFC utilize the AlexNet as the backbone, but the model does not further enhance the feature representation. Our SiamHFFT model yields the best precision (76.5%) and success rate (59.5%), which has an advantage over HiFT by 1.1, 2.1%, demonstrating the effectiveness of the HFFT module, and superior robustness capacity compared to other prevalent trackers.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Overall evaluation on UAV123&#x00040;10fps.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th/>
<th valign="top" align="center"><bold>SiamFC</bold></th>
<th valign="top" align="center"><bold>AutoTrack</bold></th>
<th valign="top" align="center"><bold>TADT</bold></th>
<th valign="top" align="center"><bold>MCCT</bold></th>
<th valign="top" align="center"><bold>SiamRPN&#x0002B;&#x0002B;</bold></th>
<th valign="top" align="center"><bold>DeepSTRCF</bold></th>
<th valign="top" align="center"><bold>CCOT</bold></th>
<th valign="top" align="center"><bold>ECO</bold></th>
<th valign="top" align="center"><bold>HiFT</bold></th>
<th valign="top" align="center"><bold>SiamHFFT</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Prec.</td>
<td valign="top" align="center">67.8</td>
<td valign="top" align="center">67.6</td>
<td valign="top" align="center">68.4</td>
<td valign="top" align="center">68.1</td>
<td valign="top" align="center">74.0</td>
<td valign="top" align="center">68.0</td>
<td valign="top" align="center">70.4</td>
<td valign="top" align="center">70.9</td>
<td valign="top" align="center">75.4</td>
<td valign="top" align="center"><inline-formula><mml:math id="M1d"><mml:mrow><mml:mstyle mathcolor="#ee1f23"><mml:mn>76</mml:mn><mml:mo>.</mml:mo><mml:mn>5</mml:mn></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
</tr>
<tr>
<td valign="top" align="left">Succ.</td>
<td valign="top" align="center">47.2</td>
<td valign="top" align="center">48.1</td>
<td valign="top" align="center">50.7</td>
<td valign="top" align="center">49.2</td>
<td valign="top" align="center">55.5</td>
<td valign="top" align="center">49.9</td>
<td valign="top" align="center">50.2</td>
<td valign="top" align="center">51.9</td>
<td valign="top" align="center">57.4</td>
<td valign="top" align="center"><inline-formula><mml:math id="M1e"><mml:mrow><mml:mstyle mathcolor="#ee1f23"><mml:mn>59</mml:mn><mml:mo>.</mml:mo><mml:mn>6</mml:mn></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The best performance are shown in red.</p>
</table-wrap-foot>
</table-wrap>
<p><bold>Evaluation on VOT2020</bold>: We also test SiamHFFT on the VOT2020 benchmark against HCAT, LightTrack (Yan et al., <xref ref-type="bibr" rid="B60">2021b</xref>), ATOM and DiMP. VOT2020 consists of 60 videos with mask annotations and adopts the expected average overlap (EAO) as the metric for evaluating the performance of the trackers, which is calculated by:</p>
<disp-formula id="E19"><label>(15)</label><mml:math id="M40"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mover accent="true"><mml:mrow><mml:mi>&#x003D5;</mml:mi></mml:mrow><mml:mo>&#x00304;</mml:mo></mml:mover><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:msubsup><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup><mml:mi>&#x003D5;</mml:mi><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>N</italic><sub><italic>S</italic></sub> denotes the length of the video sequences, &#x003D5;<italic>N</italic><sub><italic>S</italic></sub> denotes the average accuracy of a video sequence whose length is <italic>N</italic><sub><italic>S</italic></sub>. Finally, the EAO value can be obtained by calculating the average value of the video sequences of <italic>N</italic><sub><italic>S</italic></sub> length.</p>
<p>The experimental results are presented in <xref ref-type="table" rid="T3">Table 3</xref>. Our tracker achieves an EAO value of 0.231, robustness of 0.646, and accuracy of 0.459. The performance of SiamHFFT is comparable to that of the state-of-the-art models for each criterion.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Evaluation on VOT2020.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th/>
<th valign="top" align="center"><bold>HCAT</bold></th>
<th valign="top" align="center"><bold>LightTrack</bold></th>
<th valign="top" align="center"><bold>ATOM</bold></th>
<th valign="top" align="center"><bold>DiMP</bold></th>
<th valign="top" align="center"><bold>SiamHFFT</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">EAO</td>
<td valign="top" align="center"><inline-formula><mml:math id="M1f"><mml:mrow><mml:mstyle mathcolor="#ee1f23"><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>276</mml:mn></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center">0.242</td>
<td valign="top" align="center">0.271</td>
<td valign="top" align="center">0.274</td>
<td valign="top" align="center">0.231</td>
</tr>
<tr>
<td valign="top" align="left">Accuracy</td>
<td valign="top" align="center">0.455</td>
<td valign="top" align="center">0.422</td>
<td valign="top" align="center"><inline-formula><mml:math id="M1g"><mml:mrow><mml:mstyle mathcolor="#ee1f23"><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>462</mml:mn></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center">0.457</td>
<td valign="top" align="center">0.459</td>
</tr>
<tr>
<td valign="top" align="left">Robustness</td>
<td valign="top" align="center"><inline-formula><mml:math id="M1h"><mml:mrow><mml:mstyle mathcolor="#ee1f23"><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>747</mml:mn></mml:mstyle></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center">0.689</td>
<td valign="top" align="center">0.734</td>
<td valign="top" align="center">0.740</td>
<td valign="top" align="center">0.646</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The best performance are shown in red.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>Speed, FLOPs and params</title>
<p>To verify the efficiency of our tracker, we conducted a set of experiments on the GOT-10k benchmark, which is a large-scale tracking dataset consisting of more than 10,000 videos, covering a wide range of 560 types of common moving objects. Following the test protocols of GOT-10k, all of the evaluated trackers are trained with the same training data, and are tested with the same test data. We evaluated the performance of SiamHFFT against TransT, STARK, DiMP, SiamRPN&#x0002B;&#x0002B;, ECO, ATOM, and LightTrack. Our SiamHFFT is conducted on PC while the data of other trackers on GOT-10k is obtained from Chen et al. (<xref ref-type="bibr" rid="B9">2022b</xref>). Both average overlap (AO) and speed were considered to evaluate the performance of the trackers. We visualize the AO performance with respect to the frames-per-seconds (FPS) tracking speed. The comparision results are presented in <xref ref-type="fig" rid="F7">Figure 7</xref>. Each tracker is represented by a circle, and the radius of the circle <italic>r</italic> is calculated as follows:</p>
<disp-formula id="E20"><label>(16)</label><mml:math id="M41"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>r</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mi>k</mml:mi><mml:mfrac><mml:mrow><mml:mi>s</mml:mi><mml:mi>p</mml:mi><mml:mi>e</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mo>/</mml:mo><mml:mi>A</mml:mi><mml:mi>v</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>g</mml:mi><mml:mi>e</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mi>p</mml:mi><mml:mi>e</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>O</mml:mi></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Speed and performance comparisions on GOT-10k. The horizontal axis represents model speed on a CPU and the vertical axis represents the AO score.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-1082346-g0007.tif"/>
</fig>
<p>where <italic>k</italic> denotes a scale factor, we set <italic>k</italic>&#x0003D;10. The higher value of <italic>r</italic> indicates the better performance. All trackers were tested on CPU platform, and real-time line (26 fps) performance is represented by a dotted line to measure the real-time capacity of the trackers, trackers locate on the right side of the line are considered to achieve the real-time performance. According to <xref ref-type="fig" rid="F7">Figure 7</xref>, only SiamHFFT and LightTrack can meet the real-time requirement on the CPU. Among these comparision trackers, TransT utilized a modified ResNet50 as backbone and a transformer-based network to obtain discriminative features, and achieve the highest AO score, but it sacrifices the speed which runs a low speed on CPU. Similarly, STARK, DiMP, prDiMP, SiamRPN&#x0002B;&#x0002B; can only obtain satisfactory AO scores at the expense of speed. The correlation filter-based tracker, ECO, also adopts the deep features which does not achieve a satisfactory speed on CPU. Our tracker exhibits an average speed of 28 FPS on the CPU, not only reach the real-time requirement, but the area of the circle representing our method is the second large of all the trackers.</p>
<p>To validate the lightness of our model, we compared the floating-point operations (FLOPs) and Params of the model with STARK-S50 and SiamRPN&#x0002B;&#x0002B;. FLOPs represent the theoretical calculation volume of the model, which means the number of calculations required for the inference. Params refer to the amount of the parameters in the model, which directly determines the size of the model and also directly affects the memory consumption when a model making inferences. The comparison results are illustrated in <xref ref-type="table" rid="T4">Table 4</xref>. It is worth note that our SiamHFFT tracker achieve a promising result over other trackers. The FLOPs and Parameters are 16 &#x000D7; and 5 &#x000D7; less than those of STARK-S50. This shows that our method can use fewer parameters and lower memory consumption, making it possible for deployments in the edge hardware environments.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Comparision about the FLOPs and params.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Trackers</bold></th>
<th valign="top" align="center"><bold>FLOPs (G)</bold></th>
<th valign="top" align="center"><bold>Params (M)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">STARK-S50</td>
<td valign="top" align="center">10.5</td>
<td valign="top" align="center">23.3</td>
</tr>
<tr>
<td valign="top" align="left">SiamRPN&#x0002B;&#x0002B;</td>
<td valign="top" align="center">48.9</td>
<td valign="top" align="center">54</td>
</tr>
<tr>
<td valign="top" align="left">SiamHFFT</td>
<td valign="top" align="center">0.6</td>
<td valign="top" align="center">4.4</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>Ablation studies</title>
<p>This section presents ablation studies conducted to verify the effectiveness of our framework. We selected several challenging frames from the UAV123 dataset and visualized the tracking results using heatmaps, as shown in <xref ref-type="fig" rid="F8">Figure 8</xref>. The first column presents the given target which is highlighted with a red box, and the remaining columns present the visualized results of the predicted target prior to the current frame.</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>Visualization of the confidence maps of three trackers on several sequences from the UAV123 dataset. The response visualization results are an intuitive reflection of tracker performance.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-1082346-g0008.tif"/>
</fig>
<p>The second column presents the visualization results of the baseline, which only adopts ShuffleNetV2 as backbone with the reshaping module and the prediction head. The response area of the baseline is much larger than the original target size and has obscure edges affected by distractors in the frames.</p>
<p>The third column presents the visualization results of the baseline with the HFFT module. Compared with the baseline alone, the response area is smaller and clearer because the HFFT module enhances the critical semantic and spatial features of the target, enabling the model to generate more discriminative response maps. With the HFFT module, our tracker achieves significant improvement in tracking accuracy, which validates the effectiveness of the HFFT module for handling small objects.</p>
<p>The last column presents the response map generated by the proposed SiamHFFT, which adopts the entire operation module, backbone, reshaping module, HFFT module and the SAM, where the classification and regression head are utilized to estimate the location of a target. According to the visualization results of the response maps, our SiamHFFT model has clear advantages over other modified versions. The response areas are more precise and discriminative relative to the distractors.</p>
<p>We also test the performance on UAV123 benchmark with different backbones, we use the accuracy score to measure the performance variation. Experimental result is shown in <xref ref-type="table" rid="T5">Table 5</xref>, we choose two lightweight networks, AlexNet and ShuffleNetV2, to make a comparision. Similar to <xref ref-type="fig" rid="F8">Figure 8</xref>, the effectiveness of the HFFT module is measured in a quantitative manner. The model adopts ShuffleNetV2 as backbone has better performance on all of the three criteria. The experiment results of <xref ref-type="table" rid="T4">Table 4</xref> also demonstrate the effectiveness of the HFFT module.</p>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>Experimental results on UAV 123 benchmark with different backbones.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th/>
<th valign="top" align="center"><bold>Baseline</bold></th>
<th valign="top" align="center"><bold>Baseline&#x0002B;HFFT</bold></th>
<th valign="top" align="center"><bold>SiamHFFT</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">AlexNet</td>
<td valign="top" align="center">73.6</td>
<td valign="top" align="center">77.2</td>
<td valign="top" align="center">78.9</td>
</tr>
<tr>
<td valign="top" align="left">ShuffleNetV2</td>
<td valign="top" align="center">74.1</td>
<td valign="top" align="center">81.6</td>
<td valign="top" align="center">82.8</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec sec-type="conclusions" id="s5">
<title>Conclusion</title>
<p>In this paper, an HFFT tracking method based on a Siamese network was proposed. To integrate and optimize multi-level features, we designed a novel feature fusion transformer that can reinforce semantic information and spatial details during the tracking process. Additionally, based on our lightweight backbone, excessive computation for feature extraction is avoided, which accelerates object tracking speed. To validate the effectiveness of our trackers, extensive experiments were conducted on five benchmarks. Our method achieves excellent results on small target datasets such as UVA123 and UAV123&#x00040;10fps, and also shows good performance on genetic public visual tracking datasets, such as LaSOT, VOT2020, and GOT-10k. Our method can potentially inspire further research on small object tracking, particularly for UAV tracking.</p>
</sec>
<sec sec-type="data-availability" id="s6">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/<xref ref-type="sec" rid="s10">Supplementary material</xref>, further inquiries can be directed to the corresponding author/s.</p>
</sec>
<sec id="s7">
<title>Author contributions</title>
<p>Conceptualization, methodology, software, validation, formal analysis, data curation, and writing&#x02014;original draft preparation: JD. Investigation: SW. Resources, writing&#x02014;review and editing, supervision, and funding acquisition: YC. Visualization: YF. All authors have read and agreed to the published version of the manuscript.</p>
</sec>
<sec sec-type="funding-information" id="s8">
<title>Funding</title>
<p>YC is the leader of the funding for the research of National Science Foundation of China (No. 11975066), he received his Ph.D. degree from Jilin University, in 2002. He is a Professor with Jilin University, and also with Dalian University of Technology. His research interests include CMOS image sensor and digital signal processing of images.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<sec sec-type="supplementary-material" id="s10">
<title>Supplementary material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fnbot.2022.1082346/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fnbot.2022.1082346/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.docx" id="SM1" mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Beal</surname> <given-names>J.</given-names></name> <name><surname>Kim</surname> <given-names>E.</given-names></name> <name><surname>Tzeng</surname> <given-names>E.</given-names></name> <name><surname>Park</surname> <given-names>D. H.</given-names></name> <name><surname>Zhai</surname> <given-names>A.</given-names></name> <name><surname>Kislyuk</surname> <given-names>D. J.</given-names></name></person-group> (<year>2020</year>). <article-title>Toward transformer-based object detection</article-title>. <source>arXiv [Preprint]</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2012.09958</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bertinetto</surname> <given-names>L.</given-names></name> <name><surname>Valmadre</surname> <given-names>J.</given-names></name> <name><surname>Henriques</surname> <given-names>J. F.</given-names></name> <name><surname>Vedaldi</surname> <given-names>A.</given-names></name> <name><surname>Torr</surname> <given-names>P. H.</given-names></name></person-group> (<year>2016</year>). <article-title>Fully-convolutional siamese networks for object tracking</article-title>, in <source>European Conference on Computer Vision</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>850</fpage>&#x02013;<lpage>865</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-48881-3_56</pub-id><pub-id pub-id-type="pmid">36081007</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bhat</surname> <given-names>G.</given-names></name> <name><surname>Danelljan</surname> <given-names>M.</given-names></name> <name><surname>Gool</surname> <given-names>L. V.</given-names></name> <name><surname>Timofte</surname> <given-names>R.</given-names></name></person-group> (<year>2019</year>). <article-title>Learning discriminative model prediction for tracking</article-title>, in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source> (<publisher-loc>Seoul</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>6182</fpage>&#x02013;<lpage>6191</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2019.00628</pub-id></citation>
</ref>
<ref id="B4">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Cao</surname> <given-names>Z.</given-names></name> <name><surname>Fu</surname> <given-names>C.</given-names></name> <name><surname>Ye</surname> <given-names>J.</given-names></name> <name><surname>Li</surname> <given-names>B.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name></person-group> (<year>2021</year>). <article-title>HiFT: hierarchical feature transformer for aerial tracking</article-title>, in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source> (<publisher-loc>Montreal, QC</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>15457</fpage>&#x02013;<lpage>15466</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV48922.2021.01517</pub-id></citation>
</ref>
<ref id="B5">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Carion</surname> <given-names>N.</given-names></name> <name><surname>Massa</surname> <given-names>F.</given-names></name> <name><surname>Synnaeve</surname> <given-names>G.</given-names></name> <name><surname>Usunier</surname> <given-names>N.</given-names></name> <name><surname>Kirillov</surname> <given-names>A.</given-names></name> <name><surname>Zagoruyko</surname> <given-names>S.</given-names></name></person-group> (<year>2020</year>). <article-title>End-to-end object detection with transformers</article-title>, in <source>European Conference on Computer Vision</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>213</fpage>&#x02013;<lpage>229</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-58452-8_13</pub-id><pub-id pub-id-type="pmid">36417746</pub-id></citation></ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>B.</given-names></name> <name><surname>Li</surname> <given-names>P.</given-names></name> <name><surname>Bai</surname> <given-names>L.</given-names></name> <name><surname>Qiao</surname> <given-names>L.</given-names></name> <name><surname>Shen</surname> <given-names>Q.</given-names></name> <name><surname>Li</surname> <given-names>B.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>Backbone is all your need: a simplified architecture for visual object tracking</article-title>. <source>arXiv preprint arXiv:2203.05328</source>. <pub-id pub-id-type="doi">10.1007/978-3-031-20047-2_22</pub-id></citation>
</ref>
<ref id="B7">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>C. -F. R.</given-names></name> <name><surname>Fan</surname> <given-names>Q.</given-names></name> <name><surname>Panda</surname> <given-names>R.</given-names></name></person-group> (<year>2021</year>). <article-title>Crossvit: cross-attention multi-scale vision transformer for image classification</article-title>, in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source> (<publisher-loc>Montreal, QC</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>357</fpage>&#x02013;<lpage>366</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV48922.2021.00041</pub-id></citation>
</ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>L.</given-names></name> <name><surname>Lu</surname> <given-names>K.</given-names></name> <name><surname>Rajeswaran</surname> <given-names>A.</given-names></name> <name><surname>Lee</surname> <given-names>K.</given-names></name> <name><surname>Grover</surname> <given-names>A.</given-names></name> <name><surname>Laskin</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Decision transformer: reinforcement learning <italic>via</italic> sequence modeling</article-title>. <source>Adv. Neural Inform. Process. Syst</source>. <volume>34</volume>, <fpage>15084</fpage>&#x02013;<lpage>15097</lpage>.</citation>
</ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>X.</given-names></name> <name><surname>Wang</surname> <given-names>D.</given-names></name> <name><surname>Li</surname> <given-names>D.</given-names></name> <name><surname>Lu</surname> <given-names>H. J.</given-names></name></person-group> (<year>2022b</year>). <article-title>Efficient visual tracking via hierarchical cross-attention transformer</article-title>. <source>arXiv [Preprint]</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2203.13537</pub-id></citation>
</ref>
<ref id="B10">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>X.</given-names></name> <name><surname>Yan</surname> <given-names>B.</given-names></name> <name><surname>Zhu</surname> <given-names>J.</given-names></name> <name><surname>Wang</surname> <given-names>D.</given-names></name> <name><surname>Lu</surname> <given-names>H. J.</given-names></name></person-group> (<year>2022a</year>). <article-title>High-performance transformer tracking</article-title>. <source>arXiv preprint arXiv:2203.13533</source> (<publisher-loc>New Orleans, LA</publisher-loc>: <publisher-name>IEEE</publisher-name>). <pub-id pub-id-type="doi">10.1109/CVPR46437.2021.0080</pub-id><pub-id pub-id-type="pmid">36236362</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>X.</given-names></name> <name><surname>Yan</surname> <given-names>B.</given-names></name> <name><surname>Zhu</surname> <given-names>J.</given-names></name> <name><surname>Wang</surname> <given-names>D.</given-names></name> <name><surname>Yang</surname> <given-names>X.</given-names></name> <name><surname>Lu</surname> <given-names>H.</given-names></name></person-group> (<year>2021</year>). <article-title>Transformer tracking</article-title>, in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-name>IEEE</publisher-name>), <fpage>8126</fpage>&#x02013;<lpage>8135</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR46437.2021.00803</pub-id></citation>
</ref>
<ref id="B12">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>Z.</given-names></name> <name><surname>Zhong</surname> <given-names>B.</given-names></name> <name><surname>Li</surname> <given-names>G.</given-names></name> <name><surname>Zhang</surname> <given-names>S.</given-names></name> <name><surname>Ji</surname> <given-names>R.</given-names></name></person-group> (<year>2020</year>). <article-title>Siamese box adaptive network for visual tracking</article-title>, in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, <fpage>6668</fpage>&#x02013;<lpage>6677</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR42600.2020.00670</pub-id></citation>
</ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Danelljan</surname> <given-names>M.</given-names></name> <name><surname>Bhat</surname> <given-names>G.</given-names></name> <name><surname>Khan</surname> <given-names>F. S.</given-names></name> <name><surname>Felsberg</surname> <given-names>M.</given-names></name></person-group> (<year>2019</year>). <article-title>Atom: accurate tracking by overlap maximization</article-title>, in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Long Beach, CA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>4660</fpage>&#x02013;<lpage>4669</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2019.00479</pub-id></citation>
</ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Danelljan</surname> <given-names>M.</given-names></name> <name><surname>Bhat</surname> <given-names>G.</given-names></name> <name><surname>Shahbaz Khan</surname> <given-names>F.</given-names></name> <name><surname>Felsberg</surname> <given-names>M.</given-names></name></person-group> (<year>2017</year>). <article-title>Eco: efficient convolution operators for tracking</article-title>, in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Honolulu, HI</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>6638</fpage>&#x02013;<lpage>6646</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2017.733</pub-id><pub-id pub-id-type="pmid">31105338</pub-id></citation></ref>
<ref id="B15">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Danelljan</surname> <given-names>M.</given-names></name> <name><surname>Robinson</surname> <given-names>A.</given-names></name> <name><surname>Shahbaz Khan</surname> <given-names>F.</given-names></name> <name><surname>Felsberg</surname> <given-names>M.</given-names></name></person-group> (<year>2016</year>). <article-title>Beyond correlation filters: learning continuous convolution operators for visual tracking</article-title>, in <source>European Conference on Computer Vision</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>472</fpage>&#x02013;<lpage>488</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-46454-1_29</pub-id></citation>
</ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Devlin</surname> <given-names>J.</given-names></name> <name><surname>Chang</surname> <given-names>M.-W.</given-names></name> <name><surname>Lee</surname> <given-names>K.</given-names></name> <name><surname>Toutanova</surname> <given-names>K. J.</given-names></name></person-group> (<year>2018</year>). <article-title>Bert: pre-training of deep bidirectional transformers for language understanding</article-title>. <source>arXiv [Preprint]</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1810.04805</pub-id></citation>
</ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dosovitskiy</surname> <given-names>A.</given-names></name> <name><surname>Beyer</surname> <given-names>L.</given-names></name> <name><surname>Kolesnikov</surname> <given-names>A.</given-names></name> <name><surname>Weissenborn</surname> <given-names>D.</given-names></name> <name><surname>Zhai</surname> <given-names>X.</given-names></name> <name><surname>Unterthiner</surname> <given-names>T.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>An image is worth 16 &#x000D7; 16 words: transformers for image recognition at scale</article-title>. <source>arXiv [Preprint]</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2010.11929</pub-id></citation>
</ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Fan</surname> <given-names>H.</given-names></name> <name><surname>Lin</surname> <given-names>L.</given-names></name> <name><surname>Yang</surname> <given-names>F.</given-names></name> <name><surname>Chu</surname> <given-names>P.</given-names></name> <name><surname>Deng</surname> <given-names>G.</given-names></name> <name><surname>Yu</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>Lasot: A high-quality benchmark for large-scale single object tracking</article-title>, in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Long Beach, CA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>5374</fpage>&#x02013;<lpage>5383</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2019.00552</pub-id><pub-id pub-id-type="pmid">32858907</pub-id></citation></ref>
<ref id="B19">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Fan</surname> <given-names>H.</given-names></name> <name><surname>Ling</surname> <given-names>H.</given-names></name></person-group> (<year>2019</year>). <article-title>Siamese cascaded region proposal networks for real-time visual tracking</article-title>, in <source>Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition</source>, <fpage>7952</fpage>&#x02013;<lpage>7961</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2019.00814</pub-id></citation>
</ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fan</surname> <given-names>H.</given-names></name> <name><surname>Ling</surname> <given-names>H. J.</given-names></name></person-group> (<year>2020</year>). <article-title>Cract: cascaded regression-align-classification for robust visual tracking</article-title>. <source>arXiv preprint arXiv:2011.12483</source>. <pub-id pub-id-type="doi">10.1109/IROS51168.2021.9636803</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B21">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Guo</surname> <given-names>D.</given-names></name> <name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Cui</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>Z.</given-names></name> <name><surname>Chen</surname> <given-names>S.</given-names></name></person-group> (<year>2020</year>). <article-title>SiamCAR: siamese fully convolutional classification and regression for visual tracking</article-title>, in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Seattle, WA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>6269</fpage>&#x02013;<lpage>6277</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR42600.2020.00630</pub-id></citation>
</ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Han</surname> <given-names>K.</given-names></name> <name><surname>Wang</surname> <given-names>Y.</given-names></name> <name><surname>Chen</surname> <given-names>H.</given-names></name> <name><surname>Chen</surname> <given-names>X.</given-names></name> <name><surname>Guo</surname> <given-names>J.</given-names></name> <name><surname>Liu</surname> <given-names>Z.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>A survey on vision transformer</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell.</source> <fpage>1</fpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2022.3152247</pub-id><pub-id pub-id-type="pmid">35180075</pub-id></citation></ref>
<ref id="B23">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Ren</surname> <given-names>S.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>Deep residual learning for image recognition</article-title>, in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Las Vegas, NV</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>770</fpage>&#x02013;<lpage>778</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2016.90</pub-id><pub-id pub-id-type="pmid">32166560</pub-id></citation></ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>He</surname> <given-names>X.</given-names></name> <name><surname>Chen</surname> <given-names>Y.</given-names></name> <name><surname>Lin</surname> <given-names>Z. J. R. S.</given-names></name></person-group> (<year>2021</year>). <article-title>Spatial-spectral transformer for hyperspectral image classification</article-title>. <source>Remote Sens</source>. <volume>13</volume>, <fpage>498</fpage>. <pub-id pub-id-type="doi">10.3390/rs13030498</pub-id></citation>
</ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hou</surname> <given-names>R.</given-names></name> <name><surname>Ma</surname> <given-names>B.</given-names></name> <name><surname>Chang</surname> <given-names>H.</given-names></name> <name><surname>Gu</surname> <given-names>X.</given-names></name> <name><surname>Shan</surname> <given-names>S.</given-names></name> <name><surname>Chen</surname> <given-names>X. J. I. T.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>IAUnet: global context-aware feature learning for person reidentification</article-title>. <source>IEEE Trans Neural Netw Learn Syst</source>. <volume>32</volume>, <fpage>4460</fpage>&#x02013;<lpage>4474</lpage>. <pub-id pub-id-type="doi">10.1109/TNNLS.2020.3017939</pub-id><pub-id pub-id-type="pmid">32877342</pub-id></citation></ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Howard</surname> <given-names>A. G.</given-names></name> <name><surname>Zhu</surname> <given-names>M.</given-names></name> <name><surname>Chen</surname> <given-names>B.</given-names></name> <name><surname>Kalenichenko</surname> <given-names>D.</given-names></name> <name><surname>Wang</surname> <given-names>W.</given-names></name> <name><surname>Weyand</surname> <given-names>T.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Mobilenets: efficient convolutional neural networks for mobile vision applications</article-title>. <source>arXiv [Preprint]</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1704.04861</pub-id></citation>
</ref>
<ref id="B27">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hu</surname> <given-names>J.</given-names></name> <name><surname>Shen</surname> <given-names>L.</given-names></name> <name><surname>Sun</surname> <given-names>G.</given-names></name></person-group> (<year>2018</year>). <article-title>Squeeze-and-excitation networks</article-title>, in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Salt Lake City, UT</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>7132</fpage>&#x02013;<lpage>7141</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2018.00745</pub-id></citation>
</ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>L.</given-names></name> <name><surname>Zhao</surname> <given-names>X.</given-names></name> <name><surname>Huang</surname> <given-names>K. J.</given-names></name> <name><surname>Intelligence</surname> <given-names>M.</given-names></name></person-group> (<year>2019</year>). <article-title>Got-10k: a large high-diversity benchmark for generic object tracking in the wild</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell.</source> <volume>43</volume>, <fpage>1562</fpage>&#x02013;<lpage>1577</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2019.2957464</pub-id><pub-id pub-id-type="pmid">31804928</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Javed</surname> <given-names>S.</given-names></name> <name><surname>Danelljan</surname> <given-names>M.</given-names></name> <name><surname>Khan</surname> <given-names>F. S.</given-names></name> <name><surname>Khan</surname> <given-names>M. H.</given-names></name> <name><surname>Felsberg</surname> <given-names>M.</given-names></name> <name><surname>Matas</surname> <given-names>J.</given-names></name></person-group> (<year>2021</year>). <article-title>Visual object tracking with discriminative filters and siamese networks: a survey and outlook</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell</source>. <fpage>1</fpage>&#x02013;<lpage>20</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2022.3212594</pub-id><pub-id pub-id-type="pmid">36215368</pub-id></citation></ref>
<ref id="B30">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kristan</surname> <given-names>M.</given-names></name> <name><surname>Leonardis</surname> <given-names>A.</given-names></name> <name><surname>Matas</surname> <given-names>J.</given-names></name> <name><surname>Felsberg</surname> <given-names>M.</given-names></name> <name><surname>Pflugfelder</surname> <given-names>R.</given-names></name> <name><surname>K&#x000E4;m&#x000E4;r&#x000E4;inen</surname> <given-names>J.-K.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>The eighth visual object tracking VOT2020 challenge results</article-title>, in <source>European Conference on Computer Vision</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>547</fpage>&#x02013;<lpage>601</lpage>.</citation>
</ref>
<ref id="B31">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>B.</given-names></name> <name><surname>Wu</surname> <given-names>W.</given-names></name> <name><surname>Wang</surname> <given-names>Q.</given-names></name> <name><surname>Zhang</surname> <given-names>F.</given-names></name> <name><surname>Xing</surname> <given-names>J.</given-names></name> <name><surname>Yan</surname> <given-names>J.</given-names></name></person-group> (<year>2019</year>). <article-title>Siamrpn&#x0002B;&#x0002B;: evolution of siamese visual tracking with very deep networks</article-title>, in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Long Beach, CA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>4282</fpage>&#x02013;<lpage>4291</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2019.00441</pub-id></citation>
</ref>
<ref id="B32">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>B.</given-names></name> <name><surname>Yan</surname> <given-names>J.</given-names></name> <name><surname>Wu</surname> <given-names>W.</given-names></name> <name><surname>Zhu</surname> <given-names>Z.</given-names></name> <name><surname>Hu</surname> <given-names>X.</given-names></name></person-group> (<year>2018</year>). <article-title>High performance visual tracking with siamese region proposal network</article-title>, in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Salt Lake City, UT</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>8971</fpage>&#x02013;<lpage>8980</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2018.00935</pub-id></citation>
</ref>
<ref id="B33">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>F.</given-names></name> <name><surname>Tian</surname> <given-names>C.</given-names></name> <name><surname>Zuo</surname> <given-names>W.</given-names></name> <name><surname>Zhang</surname> <given-names>L.</given-names></name> <name><surname>Yang</surname> <given-names>M.-H.</given-names></name></person-group> (<year>2018</year>). <article-title>Learning spatial-temporal regularized correlation filters for visual tracking</article-title>, in <source>Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition</source> (<publisher-loc>Salt Lake City, UT</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>4904</fpage>&#x02013;<lpage>4913</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2018.00515</pub-id></citation>
</ref>
<ref id="B34">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>X.</given-names></name> <name><surname>Ma</surname> <given-names>C.</given-names></name> <name><surname>Wu</surname> <given-names>B.</given-names></name> <name><surname>He</surname> <given-names>Z.</given-names></name> <name><surname>Yang</surname> <given-names>M.-H.</given-names></name></person-group> (<year>2019</year>). <article-title>Target-aware deep tracking</article-title>, in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Long Beach, CA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1369</fpage>&#x02013;<lpage>1378</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2019.00146</pub-id></citation>
</ref>
<ref id="B35">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Fu</surname> <given-names>C.</given-names></name> <name><surname>Ding</surname> <given-names>F.</given-names></name> <name><surname>Huang</surname> <given-names>Z.</given-names></name> <name><surname>Lu</surname> <given-names>G.</given-names></name></person-group> (<year>2020</year>). <article-title>AutoTrack: towards high-performance visual tracking for UAV with automatic spatio-temporal regularization</article-title>, in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Seattle, WA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>11923</fpage>&#x02013;<lpage>11932</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR42600.2020.01194</pub-id></citation>
</ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>L.</given-names></name> <name><surname>Fan</surname> <given-names>H.</given-names></name> <name><surname>Xu</surname> <given-names>Y.</given-names></name> <name><surname>Ling</surname> <given-names>H. J.</given-names></name></person-group> (<year>2021</year>). <article-title>Swintrack: a simple and strong baseline for transformer tracking</article-title>. <source>arXiv [Preprint]</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2112.00995</pub-id></citation>
</ref>
<ref id="B37">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>T.-Y.</given-names></name> <name><surname>Doll&#x000E1;r</surname> <given-names>P.</given-names></name> <name><surname>Girshick</surname> <given-names>R.</given-names></name> <name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Hariharan</surname> <given-names>B.</given-names></name> <name><surname>Belongie</surname> <given-names>S.</given-names></name></person-group> (<year>2017</year>). <article-title>Feature pyramid networks for object detection</article-title>, in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Honolulu, HI</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2117</fpage>&#x02013;<lpage>2125</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2017.106</pub-id></citation>
</ref>
<ref id="B38">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>T.-Y.</given-names></name> <name><surname>Maire</surname> <given-names>M.</given-names></name> <name><surname>Belongie</surname> <given-names>S.</given-names></name> <name><surname>Hays</surname> <given-names>J.</given-names></name> <name><surname>Perona</surname> <given-names>P.</given-names></name> <name><surname>Ramanan</surname> <given-names>D.</given-names></name> <etal/></person-group>. (<year>2014</year>). <article-title>Microsoft coco: common objects in context</article-title>, in <source>European Conference on Computer Vision</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>740</fpage>&#x02013;<lpage>755</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-10602-1_48</pub-id></citation>
</ref>
<ref id="B39">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>Z.</given-names></name> <name><surname>Feng</surname> <given-names>M.</given-names></name> <name><surname>Santos</surname> <given-names>C. N.</given-names></name> <name><surname>Yu</surname> <given-names>M.</given-names></name> <name><surname>Xiang</surname> <given-names>B.</given-names></name> <name><surname>Zhou</surname> <given-names>B.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>A structured self-attentive sentence embedding</article-title>. <source>arXiv preprint arXiv:1703.03130</source>.</citation>
</ref>
<ref id="B40">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>L.</given-names></name> <name><surname>Hamilton</surname> <given-names>W.</given-names></name> <name><surname>Long</surname> <given-names>G.</given-names></name> <name><surname>Jiang</surname> <given-names>J.</given-names></name> <name><surname>Larochelle</surname> <given-names>H. J.</given-names></name></person-group> (<year>2020</year>). <article-title>A universal representation transformer layer for few-shot image classification</article-title>. <source>arXiv preprint arXiv:2006.11702</source>.</citation>
</ref>
<ref id="B41">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>Z.</given-names></name> <name><surname>Lin</surname> <given-names>Y.</given-names></name> <name><surname>Cao</surname> <given-names>Y.</given-names></name> <name><surname>Hu</surname> <given-names>H.</given-names></name> <name><surname>Wei</surname> <given-names>Y.</given-names></name> <name><surname>Zhang</surname> <given-names>Z.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Swin transformer: hierarchical vision transformer using shifted windows</article-title>, in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source> (<publisher-loc>Montreal, QC</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>10012</fpage>&#x02013;<lpage>10022</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV48922.2021.00986</pub-id></citation>
</ref>
<ref id="B42">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ma</surname> <given-names>N.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Zheng</surname> <given-names>H.-T.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). <article-title>Shufflenet v2: practical guidelines for efficient cnn architecture design</article-title>, in <source>Proceedings of the European Conference on Computer Vision (ECCV)</source> (<publisher-loc>Munich</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>116</fpage>&#x02013;<lpage>131</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-01264-9_8</pub-id></citation>
</ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Marvasti-Zadeh</surname> <given-names>S. M.</given-names></name> <name><surname>Cheng</surname> <given-names>L.</given-names></name> <name><surname>Ghanei-Yakhdan</surname> <given-names>H.</given-names></name> <name><surname>Kasaei</surname> <given-names>S.</given-names></name></person-group> (<year>2021</year>). <article-title>Deep learning for visual tracking: a comprehensive survey</article-title>. <source>IEEE Trans. Intell. Transp. Syst.</source> <volume>23</volume>, <fpage>3943</fpage>&#x02013;<lpage>3968</lpage>. <pub-id pub-id-type="doi">10.1109/TITS.2020.3046478</pub-id><pub-id pub-id-type="pmid">35880102</pub-id></citation></ref>
<ref id="B44">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mayer</surname> <given-names>C.</given-names></name> <name><surname>Danelljan</surname> <given-names>M.</given-names></name> <name><surname>Bhat</surname> <given-names>G.</given-names></name> <name><surname>Paul</surname> <given-names>M.</given-names></name> <name><surname>Paudel</surname> <given-names>D. P.</given-names></name> <name><surname>Yu</surname> <given-names>F.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>Transforming model prediction for tracking</article-title>, in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>New Orleans, LA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>8731</fpage>&#x02013;<lpage>8740</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR52688.2022.00853</pub-id><pub-id pub-id-type="pmid">35586572</pub-id></citation></ref>
<ref id="B45">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mueller</surname> <given-names>M.</given-names></name> <name><surname>Smith</surname> <given-names>N.</given-names></name> <name><surname>Ghanem</surname> <given-names>B.</given-names></name></person-group> (<year>2016</year>). <article-title>A benchmark and simulator for uav tracking</article-title>, in <source>European Conference on Computer Vision</source> <publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>445</fpage>&#x02013;<lpage>461</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-46448-0_27</pub-id></citation>
</ref>
<ref id="B46">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Muller</surname> <given-names>M.</given-names></name> <name><surname>Bibi</surname> <given-names>A.</given-names></name> <name><surname>Giancola</surname> <given-names>S.</given-names></name> <name><surname>Alsubaihi</surname> <given-names>S.</given-names></name> <name><surname>Ghanem</surname> <given-names>B.</given-names></name></person-group> (<year>2018</year>). <article-title>Trackingnet: a large-scale dataset and benchmark for object tracking in the wild</article-title>, in <source>Proceedings of the European Conference on Computer Vision (ECCV)</source> (<publisher-loc>Munich</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>300</fpage>&#x02013;<lpage>317</lpage>.<pub-id pub-id-type="pmid">32858907</pub-id></citation></ref>
<ref id="B47">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Nguyen</surname> <given-names>V.-Q.</given-names></name> <name><surname>Suganuma</surname> <given-names>M.</given-names></name> <name><surname>Okatani</surname> <given-names>T.</given-names></name></person-group> (<year>2020</year>). <article-title>Efficient attention mechanism for visual dialog that can handle all the interactions between multiple inputs</article-title>, in <source>European Conference on Computer Vision</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>223</fpage>&#x02013;<lpage>240</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-58586-0_14</pub-id></citation>
</ref>
<ref id="B48">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ning</surname> <given-names>X.</given-names></name> <name><surname>Duan</surname> <given-names>P.</given-names></name> <name><surname>Li</surname> <given-names>W.</given-names></name> <name><surname>Zhang</surname> <given-names>S. J. I. S. P. L.</given-names></name></person-group> (<year>2020</year>). <article-title>Real-time 3D face alignment using an encoder-decoder network with an efficient deconvolution layer</article-title>. <source>IEEE Signal Process. Lett</source>. <volume>27</volume>, <fpage>1944</fpage>&#x02013;<lpage>1948</lpage>. <pub-id pub-id-type="doi">10.1109/LSP.2020.3032277</pub-id></citation>
</ref>
<ref id="B49">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Parisotto</surname> <given-names>E.</given-names></name> <name><surname>Song</surname> <given-names>F.</given-names></name> <name><surname>Rae</surname> <given-names>J.</given-names></name> <name><surname>Pascanu</surname> <given-names>R.</given-names></name> <name><surname>Gulcehre</surname> <given-names>C.</given-names></name> <name><surname>Jayakumar</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>Stabilizing transformers for reinforcement learning</article-title>, in <source>International Conference on Machine Learning: PMLR</source> (<publisher-loc>Vienna</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>7487</fpage>&#x02013;<lpage>7498</lpage>.</citation>
</ref>
<ref id="B50">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Paulus</surname> <given-names>R.</given-names></name> <name><surname>Xiong</surname> <given-names>C.</given-names></name> <name><surname>Socher</surname> <given-names>R. J.</given-names></name></person-group> (<year>2017</year>). <article-title>A deep reinforced model for abstractive summarization</article-title>. <source>arXiv [Preprint]</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1705.04304</pub-id></citation>
</ref>
<ref id="B51">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Qingyun</surname> <given-names>F.</given-names></name> <name><surname>Dapeng</surname> <given-names>H.</given-names></name> <name><surname>Zhaokui</surname> <given-names>W. J.</given-names></name></person-group> (<year>2021</year>). <article-title>Cross-modality fusion transformer for multispectral object detection</article-title>. <source>arXiv [Preprint]</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2111.00273</pub-id></citation>
</ref>
<ref id="B52">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sandler</surname> <given-names>M.</given-names></name> <name><surname>Howard</surname> <given-names>A.</given-names></name> <name><surname>Zhu</surname> <given-names>M.</given-names></name> <name><surname>Zhmoginov</surname> <given-names>A.</given-names></name> <name><surname>Chen</surname> <given-names>L.-C.</given-names></name></person-group> (<year>2018</year>). <article-title>Mobilenetv2: Inverted residuals and linear bottlenecks</article-title>, in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Salt Lake City, UT</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>4510</fpage>&#x02013;<lpage>4520</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2018.00474</pub-id></citation>
</ref>
<ref id="B53">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vaswani</surname> <given-names>A.</given-names></name> <name><surname>Shazeer</surname> <given-names>N.</given-names></name> <name><surname>Parmar</surname> <given-names>N.</given-names></name> <name><surname>Uszkoreit</surname> <given-names>J.</given-names></name> <name><surname>Jones</surname> <given-names>L.</given-names></name> <name><surname>Gomez</surname> <given-names>A. N.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Attention is all you need</article-title>. <source>Adv. Neural Inform. Process. Syst</source>. <volume>30</volume>, <fpage>6000</fpage>&#x02013;<lpage>6010</lpage>.</citation>
</ref>
<ref id="B54">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>F.</given-names></name> <name><surname>Jiang</surname> <given-names>M.</given-names></name> <name><surname>Qian</surname> <given-names>C.</given-names></name> <name><surname>Yang</surname> <given-names>S.</given-names></name> <name><surname>Li</surname> <given-names>C.</given-names></name> <name><surname>Zhang</surname> <given-names>H.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Residual attention network for image classification</article-title>, in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Honolulu, HI</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>3156</fpage>&#x02013;<lpage>3164</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2017.683</pub-id><pub-id pub-id-type="pmid">35931949</pub-id></citation></ref>
<ref id="B55">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>N.</given-names></name> <name><surname>Zhou</surname> <given-names>W.</given-names></name> <name><surname>Tian</surname> <given-names>Q.</given-names></name> <name><surname>Hong</surname> <given-names>R.</given-names></name> <name><surname>Wang</surname> <given-names>M.</given-names></name> <name><surname>Li</surname> <given-names>H.</given-names></name></person-group> (<year>2018</year>). <article-title>Multi-cue correlation filters for robust visual tracking</article-title>, in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Salt Lake City, UT</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>4844</fpage>&#x02013;<lpage>4853</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2018.00509</pub-id></citation>
</ref>
<ref id="B56">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>X.</given-names></name> <name><surname>Girshick</surname> <given-names>R.</given-names></name> <name><surname>Gupta</surname> <given-names>A.</given-names></name> <name><surname>He</surname> <given-names>K.</given-names></name></person-group> (<year>2018</year>). <article-title>Non-local neural networks</article-title>, in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Salt Lake City, UT</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>7794</fpage>&#x02013;<lpage>7803</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2018.00813</pub-id></citation>
</ref>
<ref id="B57">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wolfe</surname> <given-names>J. M.</given-names></name> <name><surname>Horowitz</surname> <given-names>T. S.</given-names></name></person-group> (<year>2004</year>). <article-title>What attributes guide the deployment of visual attention and how do they do it?</article-title> <source>Nat. Rev. Neurosci</source>. <volume>5</volume>, <fpage>495</fpage>&#x02013;<lpage>501</lpage>. <pub-id pub-id-type="doi">10.1038/nrn1411</pub-id><pub-id pub-id-type="pmid">15152199</pub-id></citation></ref>
<ref id="B58">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Xu</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>Z.</given-names></name> <name><surname>Li</surname> <given-names>Z.</given-names></name> <name><surname>Yuan</surname> <given-names>Y.</given-names></name> <name><surname>Yu</surname> <given-names>G.</given-names></name></person-group> (<year>2020</year>). <article-title>Siamfc&#x0002B;&#x0002B;: Towards robust and accurate visual tracking with target estimation guidelines</article-title>, in <source>Proceedings of the AAAI Conference on Artificial Intelligence</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>AAAI</publisher-name>), <fpage>12549</fpage>&#x02013;<lpage>12556</lpage>. <pub-id pub-id-type="doi">10.1609/aaai.v34i07.6944</pub-id></citation>
</ref>
<ref id="B59">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yan</surname> <given-names>B.</given-names></name> <name><surname>Peng</surname> <given-names>H.</given-names></name> <name><surname>Fu</surname> <given-names>J.</given-names></name> <name><surname>Wang</surname> <given-names>D.</given-names></name> <name><surname>Lu</surname> <given-names>H.</given-names></name></person-group> (<year>2021a</year>). <article-title>Learning spatio-temporal transformer for visual tracking</article-title>, in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source> (<publisher-loc>Montreal, QC</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>10448</fpage>&#x02013;<lpage>10457</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV48922.2021.01028</pub-id><pub-id pub-id-type="pmid">30760050</pub-id></citation></ref>
<ref id="B60">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yan</surname> <given-names>B.</given-names></name> <name><surname>Peng</surname> <given-names>H.</given-names></name> <name><surname>Wu</surname> <given-names>K.</given-names></name> <name><surname>Wang</surname> <given-names>D.</given-names></name> <name><surname>Fu</surname> <given-names>J.</given-names></name> <name><surname>Lu</surname> <given-names>H.</given-names></name></person-group> (<year>2021b</year>). <article-title>LightTrack: finding lightweight neural networks for object tracking <italic>via</italic> one-shot architecture search</article-title>, in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>IEEE</publisher-loc>), <fpage>15180</fpage>&#x02013;<lpage>15189</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR46437.2021.01493</pub-id></citation>
</ref>
<ref id="B61">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yan</surname> <given-names>B.</given-names></name> <name><surname>Zhao</surname> <given-names>H.</given-names></name> <name><surname>Wang</surname> <given-names>D.</given-names></name> <name><surname>Lu</surname> <given-names>H.</given-names></name> <name><surname>Yang</surname> <given-names>X.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x00027;Skimming-perusal&#x00027;tracking: a framework for real-time and robust long-term tracking</article-title>, in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source> (<publisher-loc>Seoul</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2385</fpage>&#x02013;<lpage>2393</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2019.00247</pub-id></citation>
</ref>
<ref id="B62">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yu</surname> <given-names>B.</given-names></name> <name><surname>Tang</surname> <given-names>M.</given-names></name> <name><surname>Zheng</surname> <given-names>L.</given-names></name> <name><surname>Zhu</surname> <given-names>G.</given-names></name> <name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Feng</surname> <given-names>H.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>High-performance discriminative tracking with transformers</article-title>, in: <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source> (<publisher-loc>Montreal, QC</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>9856</fpage>&#x02013;<lpage>9865</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV48922.2021.00971</pub-id></citation>
</ref>
<ref id="B63">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yu</surname> <given-names>Y.</given-names></name> <name><surname>Xiong</surname> <given-names>Y.</given-names></name> <name><surname>Huang</surname> <given-names>W.</given-names></name> <name><surname>Scott</surname> <given-names>M. R.</given-names></name></person-group> (<year>2020</year>). <article-title>Deformable siamese attention networks for visual object tracking</article-title>, in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Seattle, WA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>6728</fpage>&#x02013;<lpage>6737</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR42600.2020.00676</pub-id><pub-id pub-id-type="pmid">32858872</pub-id></citation></ref>
<ref id="B64">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>G.</given-names></name> <name><surname>Zhang</surname> <given-names>P.</given-names></name> <name><surname>Qi</surname> <given-names>J.</given-names></name> <name><surname>Lu</surname> <given-names>H.</given-names></name></person-group> (<year>2021</year>). <article-title>Hat: hierarchical aggregation transformers for person re-identification</article-title>, in <source>Proceedings of the 29th ACM International Conference on Multimedia</source>, <fpage>516</fpage>&#x02013;<lpage>525</lpage>. <pub-id pub-id-type="doi">10.1145/3474085.3475202</pub-id></citation>
</ref>
<ref id="B65">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>J.</given-names></name> <name><surname>Huang</surname> <given-names>B.</given-names></name> <name><surname>Ye</surname> <given-names>Z.</given-names></name> <name><surname>Kuang</surname> <given-names>L.-D.</given-names></name> <name><surname>Ning</surname> <given-names>X. J. S. R.</given-names></name></person-group> (<year>2021</year>). <article-title>Siamese anchor-free object tracking with multiscale spatial attentions</article-title>. <source>Sci. Rep</source>. <volume>11</volume>, <fpage>1</fpage>&#x02013;<lpage>14</lpage>. <pub-id pub-id-type="doi">10.1038/s41598-021-02095-4</pub-id><pub-id pub-id-type="pmid">34824320</pub-id></citation></ref>
<ref id="B66">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>W. J. M. S.</given-names></name> <name><surname>Processing</surname> <given-names>S.</given-names></name></person-group> (<year>2021</year>). <article-title>A robust lateral tracking control strategy for autonomous driving vehicles</article-title>. <source>Mech. Syst. Signal Process</source>. <volume>150</volume>, <fpage>107238</fpage>. <pub-id pub-id-type="doi">10.1016/j.ymssp.2020.107238</pub-id></citation>
</ref>
<ref id="B67">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>Z.</given-names></name> <name><surname>Lan</surname> <given-names>C.</given-names></name> <name><surname>Zeng</surname> <given-names>W.</given-names></name> <name><surname>Jin</surname> <given-names>X.</given-names></name> <name><surname>Chen</surname> <given-names>Z.</given-names></name></person-group> (<year>2020a</year>). <article-title>Relation-aware global attention for person re-identification</article-title>, in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Seattle, WA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>3186</fpage>&#x02013;<lpage>3195</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR42600.2020.00325</pub-id></citation>
</ref>
<ref id="B68">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>Z.</given-names></name> <name><surname>Peng</surname> <given-names>H.</given-names></name> <name><surname>Fu</surname> <given-names>J.</given-names></name> <name><surname>Li</surname> <given-names>B.</given-names></name> <name><surname>Hu</surname> <given-names>W.</given-names></name></person-group> (<year>2020b</year>). <article-title>Ocean: object-aware anchor-free tracking</article-title>, in <source>European Conference on Computer Vision</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>771</fpage>&#x02013;<lpage>787</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-58589-1_46</pub-id></citation>
</ref>
<ref id="B69">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>M.</given-names></name> <name><surname>Okada</surname> <given-names>K.</given-names></name> <name><surname>Inaba</surname> <given-names>M. J.</given-names></name></person-group> (<year>2021</year>). <article-title>Trtr: visual tracking with transformer</article-title>. <source>arXiv [Preprint]</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2105.03817</pub-id></citation>
</ref>
<ref id="B70">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhu</surname> <given-names>Z.</given-names></name> <name><surname>Wang</surname> <given-names>Q.</given-names></name> <name><surname>Li</surname> <given-names>B.</given-names></name> <name><surname>Wu</surname> <given-names>W.</given-names></name> <name><surname>Yan</surname> <given-names>J.</given-names></name> <name><surname>Hu</surname> <given-names>W.</given-names></name></person-group> (<year>2018</year>). <article-title>Distractor-aware siamese networks for visual object tracking</article-title>, in <source>Proceedings of the European Conference on Computer Vision (ECCV)</source> (<publisher-loc>Munich</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>101</fpage>&#x02013;<lpage>117</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-01240-3_7</pub-id></citation>
</ref>
</ref-list>
</back>
</article>