<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neurosci.</journal-id>
<journal-title>Frontiers in Neuroscience</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neurosci.</abbrev-journal-title>
<issn pub-type="epub">1662-453X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnins.2024.1349204</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>In defense of local descriptor-based few-shot object detection</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Zhou</surname> <given-names>Shichao</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2121630/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Li</surname> <given-names>Haoyan</given-names></name>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Wang</surname> <given-names>Zhuowei</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/2593794/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Zhang</surname> <given-names>Zekai</given-names></name>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
</contrib-group>
<aff><institution>Key Laboratory of Information and Communication Systems, Ministry of Information Industry, Beijing Information Science and Technology University</institution>, <addr-line>Beijing</addr-line>, <country>China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Yuqi Han, Beijing Institute of Technology, China</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Fukun Bi, North China University of Technology, China</p>
<p>Jianzhi Hong, Wuhan University, China</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Shichao Zhou <email>sczhou&#x00040;bistu.edu.cn</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>12</day>
<month>02</month>
<year>2024</year>
</pub-date>
<pub-date pub-type="collection">
<year>2024</year>
</pub-date>
<volume>18</volume>
<elocation-id>1349204</elocation-id>
<history>
<date date-type="received">
<day>04</day>
<month>12</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>17</day>
<month>01</month>
<year>2024</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2024 Zhou, Li, Wang and Zhang.</copyright-statement>
<copyright-year>2024</copyright-year>
<copyright-holder>Zhou, Li, Wang and Zhang</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>State-of-the-art image object detection computational models require an intensive parameter fine-tuning stage (using deep convolution network, etc). with tens or hundreds of training examples. In contrast, human intelligence can robustly learn a new concept from just a few instances (i.e., few-shot detection). The distinctive perception mechanisms between these two families of systems enlighten us to revisit classical handcraft local descriptors (e.g., SIFT, HOG, etc.) as well as non-parametric visual models, which innately require no learning/training phase. Herein, we claim that the inferior performance of these local descriptors mainly results from a lack of global structure sense. To address this issue, we refine local descriptors with spatial contextual attention of neighbor affinities and then embed the local descriptors into discriminative subspace guided by Kernel-InfoNCE loss. Differing from conventional quantization of local descriptors in high-dimensional feature space or isometric dimension reduction, we actually seek a brain-inspired few-shot feature representation for the object manifold, which combines data-independent primitive representation and semantic context learning and thus helps with generalization. The obtained embeddings as pattern vectors/tensors permit us an accelerated but non-parametric visual similarity computation as the decision rule for final detection. Our approach to few-shot object detection is nearly learning-free, and experiments on remote sensing imageries (approximate 2-D affine space) confirm the efficacy of our model.</p></abstract>
<kwd-group>
<kwd>few-shot learning</kwd>
<kwd>local descriptors</kwd>
<kwd>contextual features</kwd>
<kwd>kernel method</kwd>
<kwd>visual similarity</kwd>
</kwd-group>
<counts>
<fig-count count="6"/>
<table-count count="1"/>
<equation-count count="8"/>
<ref-count count="34"/>
<page-count count="10"/>
<word-count count="6330"/>
</counts>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Perception Science</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1 Introduction</title>
<p>Human intelligence can robustly learn a new concept from just a few of instances (Lake et al., <xref ref-type="bibr" rid="B20">2015</xref>). For example, a child can generalize the concept of &#x0201C;airplane&#x0201D; from a single picture in a book. Yet existing supervised machine learning models need large amounts of labeled data and intensive parameters fine-tuning stage (Hinton and Salakhutdinov, <xref ref-type="bibr" rid="B15">2006</xref>; Lecun et al., <xref ref-type="bibr" rid="B21">2015</xref>). This motivates the setting we are interested in: &#x0201C;few-shot&#x0201D; object detection or localization, which involves searching for objects in a larger target image, given only a few query objects of these categories.</p>
<p>Generally, data augmentation and regularization techniques can alleviate over-fitting in low sample complexity settings for state-of-the-art image object detection computational models (e.g., deep convolution network), but do not solve it (Vinyals et al., <xref ref-type="bibr" rid="B32">2016</xref>). Furthermore, a naive but much more practical approach, such as fine-tuning the model on new data, would severely over-fit. Due to the degradation on this few-shot setting, and inspired by the few-shot learning ability of humans, two recent strategies have made significant progress. One of the strategies is meta-learning, which decomposes training into an auxiliary meta-learning phase where transferable knowledge is learned, resulting in models that once trained can &#x0201C;learn&#x0201D; on new such tasks with relatively few examples (Huisman et al., <xref ref-type="bibr" rid="B16">2021</xref>). The other approach is metric learning, which employs many instances of known categories to learn an embedding into a metric space where new categories are classified via proximity to the few labeled training examples embedded in the same space (Kaya and Bilge, <xref ref-type="bibr" rid="B18">2019</xref>). Actually, both of the strategies still rely on large amounts of training samples. For the former, large amounts of training instances are elaborately organized into many meta-tasks, in which the training or support sets consist of several instances. For the latter, flexible combination and permutation of instances pairs/tuples demanded by the metric learning implicitly augment the training sets. Here, we claim the crucial limitation of the aforementioned methods lies in the over-parametric aspect of the utilized deep model, in which extensive training examples need to be learned by the model into its parameters.</p>
<p>In contrast, classical handcraft local descriptors and non-parametric models [e.g., SIFT (Lowe, <xref ref-type="bibr" rid="B24">2004</xref>), and nearest neighbor classifier (Boiman et al., <xref ref-type="bibr" rid="B3">2008</xref>)] allow novel examples to be rapidly assimilated while not suffering from catastrophic forgetting. Such kind of models have several intriguing advantages that are not shared by most learning-based approaches: (a) Require no training stages (i.e., lazy learning); (b) Avoid over-fitting of model parameters; (c) Can naturally handle a large number of categories via changing class/exemplars instantaneously.</p>
<p>Despite the aforementioned advantages, the large performance gap between traditional handcraft features, non-parametric models, and state-of-the-art deep learning-based approaches led to the perception that classical methods are not useful. Here, we claim that the capabilities of classical methods have been under-valued, especially in the few-shot setting. Specifically, the arrangements of local feature descriptors rather than themselves account for the inferior discriminative, which can be further explained as following two aspects:</p>
<list list-type="order">
<list-item><p>Power law descriptor distribution gives rise to quantization errors in high-dimensional space. It is well known that densely sampled image local descriptors follow a power-law or heavy-tail distributions (Boiman et al., <xref ref-type="bibr" rid="B3">2008</xref>), which imply that most descriptors would be rather isolated and found in low-density regions in the high-dimensional vector space. Furthermore, such isolated descriptors tend to be informative because they are only found in few categories but rare in other ones. In contrast, the frequent descriptors tend to appear abundantly and share among most of the classes and thus are the least discriminative for feature representation. In other words, there are almost no intuitive &#x0201C;clusters&#x0201D; in the high-dimensional space to group &#x0201C;visual vocabulary&#x0201D; with kmeans-based methods, which would consecutively degrade descriptors quantization as well as histogram scoring for global image impression.</p></list-item>
<list-item><p>Geometry preservation-based dimension reduction of descriptors makes no sense for discriminativity enhancement in the few-shot setting. It is well believed that the dimension reduction of local descriptors is essential for computational tractability and avoiding over-fitting. However, it entirely differs from the feature representation in the few-shot setting, which has not enough training instances (i.e., sparsity) to form a credible object manifold in high-dimensional feature space. In this case, the geometry preservation-based embedding of local descriptors cannot guarantee the feature discriminativity. Because the local descriptor groups only compose object instances rather than be the object instances themselves, that is, there will be no maximization interclass difference as well as separability for the embeddings in the established low-dimensional space.</p></list-item>
</list>
<p>To address these issues, we incorporate desirable characteristics from both parametric and non-parametric models namely, rapid acquisition of query examples while providing reliable generalization. Previous work on visual similarity in non-parametric setups has been influential on our model (Biswas and Milanfar, <xref ref-type="bibr" rid="B1">2016</xref>). Herein, we propose a remarkably simple local descriptors based few-shot object detector, which requires less training costs. We focus on the context and structure information among local descriptors, which are inherently discriminative in identifying objects. Specifically, we refine local descriptors with a spatial contextual attention of neighbor affinities and then embed the local descriptors into discriminative subspace guided by Kernel-InfoNCE loss, which permits us an accelerated but non-trivial object-specific similarity computation as the decision rule for detection.</p>
<p>This paper is organized as follows. Section 2 briefly reports past works, which can be classified into two categories. Section 3 analyzes our motivations on the modeling of brain-inspired feature representation. Section 4 details the proposed approach. In Section 5, we compare our method with relevant few-shot object detection approaches on real-world datasets, and related analyses are also demonstrated. The conclusion is drawn in Section 6.</p></sec>
<sec id="s2">
<title>2 Related work</title>
<p>CNN-based representation learning methods have witnessed the improvement of object detectors (Liu et al., <xref ref-type="bibr" rid="B23">2016</xref>; Redmon et al., <xref ref-type="bibr" rid="B27">2016</xref>). Some of the proposed elementary tricks, such as ROI pooling (Girshick, <xref ref-type="bibr" rid="B11">2015</xref>) and multi-scale feature aggregation (Lin et al., <xref ref-type="bibr" rid="B22">2017</xref>), indeed adapt to few-shot settings. However, these methods generally require large amounts of training data because of their over-parametric and large-scale networks. Here, we conclude two essential paradigms related to solving the aforementioned issue: meta learning and handcraft feature representations.</p>
<sec>
<title>2.1 Meta-learning</title>
<p>Meta-learning is a quite general learning mechanism interpreted as a &#x0201C;multi-task adaption process,&#x0201D; which mimics the capacity of human learning to learn. Given base training data (i.e., knowledge of prior tasks) and novel object categories of few supervisions to be adapted, Meta-learning devotes to a model that simultaneously detects objects from both base and novel domains.</p>
<p>Existing meta-learning methods are further categorized as data augmentation (Shorten and Khoshgoftaar, <xref ref-type="bibr" rid="B29">2019</xref>), metric learning (Wang et al., <xref ref-type="bibr" rid="B33">2019</xref>), and optimization learning (Bohdal et al., <xref ref-type="bibr" rid="B2">2021</xref>). The data argumentation methods learn to generate additional examples for novel object categories to be accommodated. The metric learning methods train model to predict whether two instances belong to the same category. The optimization learning approaches specify optimization or loss functions which force faster adaptation of parameters to new categories with few examples.</p>
<p>Following some of the aforementioned meta-learning methods, many researchers contributed few-shot detection methods that fully exploited training data from base categories while quickly adapting the classical detection framework to predict novel classes (Finn et al., <xref ref-type="bibr" rid="B10">2017</xref>), that is, most methods treat few-shot detection as an extended few-shot classification problem, ignoring the role of features for object localization. Furthermore, one can see that the data-hungry properties still exist in the meta-learning-based methods because large-scale training samples in both base class and novel ones are required, which hinders their applications in practical scenarios.</p>
</sec>
<sec>
<title>2.2 Hand-craft feature representations</title>
<p>For classical feature extraction methods, images are often represented by the collection of delicately designed local image descriptors with prior knowledge [e.g., SIFT and LARK (Seo and Milanfar, <xref ref-type="bibr" rid="B28">2010</xref>)]. Specifically, these descriptors typically model the local similarity/dis-connectivity of the gray-scale, which results from the statistical facts that the image is often replete with self-similar patterns as well as abundantly appeared edges and corners.</p>
<p>Furthermore, the arrangement of the local descriptors also contributes to the feature discriminativity. For instance, classical &#x0201C;Bag of Words (BoWs)&#x0201D; employed normalized patches or SIFT descriptors over Difference of Gaussian, Harris-scale or Harris-affine keypoints (Mikolajczyk and Schmid, <xref ref-type="bibr" rid="B25">2002</xref>), vector quantized using k-means variants. Grauman and Darrell (<xref ref-type="bibr" rid="B12">2005</xref>) proposed a fast kernel function that maps local descriptors to multi-resolution histograms and computes a weighted histogram intersection in feature representation space. By considering the relative position of descriptors, Biswas and Milanfar (<xref ref-type="bibr" rid="B1">2016</xref>) estimated a low-dimensional subspace where the original high-dimensional descriptors are embedded with their geometry intact.</p></sec>
</sec>
<sec id="s3">
<title>3 Motivation</title>
<p>Compared with meta-learning paradigms, we endorse the handcraft feature representation methods because of their encapsulation of prior knowledge and naturally learning-free property. In addition, while contextual information is important, this issue could not be addressed by the biological-implausible BoWs, or obscured by the geometric-preserved low-dimensional embedding, which actually considers the relative position of local descriptors rather than the entire object instances, that is, the resulting scored histogram or low-dimensional embeddings cannot guarantee the desired discriminativity.</p>
<p>Here, we desire a brain-inspired representation learning mechanism for the challenging few-shot object detection task. It is in general acknowledged that the influence of extrinsic information on the visual representations in the brain increases with its level in the hierarchy (Kruger et al., <xref ref-type="bibr" rid="B19">2013</xref>). This fact inspires us with dual principles of <italic>reusability and composition</italic>. Two observations argue for these motivations.</p>
<list list-type="order">
<list-item><p>Reusable and less data-dependent local features. In the visual world, physical objects and scenes decompose naturally into a hierarchy of meaningful and generic parts, which could be described by local features. On the other hand, the notion of the feature itself has been already based upon the reusability assumption that similar attributes will be shared among different entities from scene to scene. These reusable local features would be sufficient to compose the large ensemble of shapes and objects that are in the repertoire of human vision (Jin and Geman, <xref ref-type="bibr" rid="B17">2006</xref>). In addition, we believe that the local features are inherently data-independent because there is no report on any learning or adaptation processes in the retina and also quite some evidence on a high influence of genetic prestructuring for orientation maps in V1 (Kruger et al., <xref ref-type="bibr" rid="B19">2013</xref>).</p></list-item>
<list-item><p>Semantic contexts and compositional representation learning. It is often observed that reliable object detection is notoriously difficult when purely utilizes low-level visual cues (i.e., local features) in a bottom-up inference framework, without more global contextual constraints that contribute to semantic comprehension. Actually, the semantic contexts participate in compositional representation held by humans that perceive and organize information as syntactically constrained arrangements of reusable parts. More importantly, we believe that the compositional representation should really be learned rather than hand-craft since of its semantic flexibility. This inference is supported by the neuroscience research: <italic>learning can alter the visual feature selectivity of neurons, but the measurable changes at the single-cell level induced by learning appear to be much smaller at earlier levels in the visual hierarchy such as V1 compared to later stages such as V4 or IT</italic> (Kruger et al., <xref ref-type="bibr" rid="B19">2013</xref>). Hence, it is perhaps no coincidence that there is an apparent compositional structure in the ventral visual pathways of the more highly evolved visual systems.</p></list-item>
</list></sec>
<sec id="s4">
<title>4 Proposed method</title>
<p>Our brain-inspired few-shot feature representation involves object parsing, understanding, and localization from images. Specific algorithms consist of three aspects:</p>
<list list-type="order">
<list-item><p>Feature representation: extract handcraft local descriptors from image patches;</p></list-item>
<list-item><p>Feature learning: learn contextual information among patches guided by Kernel-InfoNCE loss;</p></list-item>
<list-item><p>Object inference: predict object presence with cosine similarity measure.</p></list-item>
</list>
<p>The core of our algorithm is the first two steps as the inference step is a naive sliding window searching process. Practically, we unify feature representation and learning into a feed-forward hierarchical network that enjoys end-to-end training, as shown in <xref ref-type="fig" rid="F1">Figure 1</xref>. We first introduce the proposed model (i.e., feature representation) and its training in the few-shot setting and then describe its application to the object inference.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Overview of proposed model. We construct a layered model on image patches. In the bottom layer, a group of image patches is categorized as &#x0201C;positive&#x0201D; or &#x0201C;negative.&#x0201D; In the middle layer, HoG features are extracted for each patch. In the top layer, contextual relationships among these features are built with the guidance of Kernel-InfoNCE Loss.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-18-1349204-g0001.tif"/>
</fig>
<sec>
<title>4.1 Patch based representation</title>
<p>Given only a few images (queries) containing objects of interest, we would like to know where the objects of interest lie. Note that the few-shot setting can not support modern deep neural network training without the pretraining stage. In this case, we employ patch-based image representation with hand-craft local descriptors, as shown in <xref ref-type="fig" rid="F2">Figure 2</xref>. Intuitively, the dense sampling will produce many more image patches than the original large queries (i.e., implicit data augmentation), and then, a fine-grained visual parsing of the object will make sense. Moreover, the handcraft descriptors of local image patches are inherently embedded in prior knowledge, which need not be learned with amounts of training samples.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Patch-based representation. Patches are obtained from the original image. Each patch is manually labeled as &#x0201C;<italic>pos</italic><sub><italic>i</italic></sub>&#x0201D; or &#x0201C;<italic>neg</italic><sub><italic>i</italic></sub>.&#x0201D; HoG features are extracted for each patch.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-18-1349204-g0002.tif"/>
</fig>
<sec>
<title>4.1.1 Dense sampling and labeling</title>
<p>Assuming that the size of a image is <italic>M</italic>&#x000D7;<italic>N</italic>, we sample a dense grid of patches <bold>X</bold> &#x0003D; {<bold>x</bold><sub>1</sub>, <bold>x</bold><sub>2</sub>, .., <bold>x</bold><sub><italic>m</italic></sub>} as the observations. For any image patch <bold>x</bold> &#x02208; &#x0211D;<sup><italic>p</italic></sup>, we allocate a binary label <bold>y</bold> to indicate presence (<bold>y</bold> &#x0003D; 1) or absence (<bold>y</bold> &#x0003D; 0) of the object. The corresponding label group <bold>Y</bold> &#x0003D; {<bold>y</bold><sub>1</sub>, <bold>y</bold><sub>2</sub>, &#x02026;<bold>y</bold><sub><italic>m</italic></sub>} carry the information of global object presence. The relationship between <bold>X</bold> and <bold>Y</bold> is modeled by conditional probability <italic>p</italic>(<bold>Y</bold>|<bold>X</bold>).</p>
<p>With the dense sampled image patches, we can then utilize conventional intersection and concurrency ratio (IoU) to quantify whether or not the patch (partially) covers the object for following the supervised learning stage. Specifically, given pixel or bounding box-based annotations, we label a patch as positive if the IoU is greater than a threshold <italic>t</italic>. In this way, we can obtain a group of binary patch-based label masks as well as latent contextual information (illustrated in the next subsection) from each query image, and thus, the limited queries are fully utilized.</p></sec>
<sec>
<title>4.1.2 Data-independent feature representation</title>
<p>In our current implementation, handcraft feature descriptors for representing the raw image patch were chosen to be HoG (Dalal and Triggs, <xref ref-type="bibr" rid="B8">2005</xref>) without loss of generality. This type of descriptor can capture local texture information by calculating the gradient histograms (i.e., gradient direction and intensity of local regions). Actually, the essential gradient-like computations mimic the function of luminance sensitive cells with a center-surround receptive field, which emphasizes spatial change in luminance. Notably, this type of transformation into a representation emphasizing spatial change is performed at a very early stage, immediately following the receptor level, before any other visual processing takes place (Kruger et al., <xref ref-type="bibr" rid="B19">2013</xref>). Hence, we advocate this data-independent and universal feature representation for the few-shot setting.</p>
</sec>
</sec>
<sec>
<title>4.2 Context learning with Kernel-InfoNCE loss</title>
<p>Patch-based representation from <bold>x</bold><sub><italic>i</italic></sub> or their naive cascades <bold>x</bold> usually contain only local information about the objects, resulting in semantic ambiguities. The semantics or visual grammars are inherently discriminative cues for object detection and recognition. Thus, a further consideration of context information (i.e., the compositional structure) among patch-based representation is necessary. More importantly, while prior knowledge about image statistics points to the usefulness of gradient-like computations at the patch representation stage, there is no similar prior knowledge that would allow to design sensible transformations for the subsequent processing stage corresponding to the depths of the hierarchical visual cortex. Hence, we argue that it is one of few tractable ways of deep learning that enables the computational model to obtain the context information. <xref ref-type="fig" rid="F3">Figure 3</xref> gives the overview of the network model.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Context-aware learning. We constructed a network model, guided by the Kernel-InfoNCE loss, to learn the contextual relationships of patches.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-18-1349204-g0003.tif"/>
</fig>
<sec>
<title>4.2.1 Kernel-based context representation</title>
<p>Inspired by the well-established Reproducing Kernel Hilbert Space (RKHS) theory, we use kernel-based matrix to represent the contextual information between any two local descriptors within the global image. Theoretically, there is a nice duality homogeneity between inner products of (deep) feature representations and kernels. This duality can be utilized to refine neural network modules using kernels and vice-versa (Rasmussen, <xref ref-type="bibr" rid="B26">2003</xref>).</p>
<p>Our specific implementation involves a group of <italic>n</italic> local descriptors <bold>X</bold> &#x0003D; {<bold>x</bold><sub>1</sub>, <bold>x</bold><sub>2</sub>, .., <bold>x</bold><sub><italic>m</italic></sub>} within hand-craft feature space. For these descriptors, we can construct a kernel matrix <bold>K</bold><sub><bold>X</bold></sub>, in which <bold>K</bold><sub><italic>i, j</italic></sub> &#x0003D; <bold>K</bold>(<italic>i, j</italic>) denotes the probability of <bold>x</bold><sub><italic>i</italic></sub> and <bold>x</bold><sub><italic>j</italic></sub> being semantically relevant defined in the initial data annotation step. Here, we adapt the most conventional translation-invariant kernel function <inline-formula><mml:math id="M1"><mml:mi>k</mml:mi><mml:mo>:</mml:mo><mml:mrow><mml:mstyle mathvariant="script"><mml:mi>X</mml:mi></mml:mstyle></mml:mrow><mml:mo>&#x000D7;</mml:mo><mml:mrow><mml:mstyle mathvariant="script"><mml:mi>X</mml:mi></mml:mstyle></mml:mrow><mml:mo>&#x02192;</mml:mo><mml:mi>&#x0211D;</mml:mi></mml:math></inline-formula>, where <italic>k</italic>(<bold>x</bold><sub><italic>i</italic></sub>, <bold>x</bold><sub><italic>j</italic></sub>) is equivalent to <inline-formula><mml:math id="M2"><mml:msup><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x022C6;</mml:mo></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> for <inline-formula><mml:math id="M3"><mml:msup><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x022C6;</mml:mo></mml:mrow></mml:msup><mml:mo>:</mml:mo><mml:mrow><mml:mstyle mathvariant="script"><mml:mi>X</mml:mi></mml:mstyle></mml:mrow><mml:mo>&#x02192;</mml:mo><mml:mi>&#x0211D;</mml:mi></mml:math></inline-formula>. Classical Moore&#x02013;Aronszajn&#x00027;s theorem states that if <bold>K</bold><sub><bold>X</bold></sub> is a symmetric, positive definite kernel matrix on <inline-formula><mml:math id="M4"><mml:mrow><mml:mstyle mathvariant="script"><mml:mi>X</mml:mi></mml:mstyle></mml:mrow></mml:math></inline-formula>, there is a unique Hilbert space <inline-formula><mml:math id="M5"><mml:mrow><mml:mstyle mathvariant="script"><mml:mi>H</mml:mi></mml:mstyle></mml:mrow></mml:math></inline-formula> on <inline-formula><mml:math id="M6"><mml:mrow><mml:mstyle mathvariant="script"><mml:mi>X</mml:mi></mml:mstyle></mml:mrow></mml:math></inline-formula> for which <italic>k</italic> is a reproducing kernel. Note that it is almost impossible to calculate the semantic relevance of two descriptors using a predefined reproducing kernel in the low-level feature space because the implicitly defined feature mapping would not ideally align with the semantic meanings of specific tasks. Thus, we need to further refine the local descriptors so as to ensure the adopted translation-invariant kernel function <italic>k</italic><sup>&#x022C6;</sup> is still sufficient in representation, learning and inference.</p>
<p>Based on the established kernel, we devote to learn deep feature mapping <inline-formula><mml:math id="M7"><mml:mi>f</mml:mi><mml:mo>:</mml:mo><mml:mrow><mml:mstyle mathvariant="script"><mml:mi>X</mml:mi></mml:mstyle></mml:mrow><mml:mo>&#x02192;</mml:mo><mml:mrow><mml:mstyle mathvariant="script"><mml:mi>Z</mml:mi></mml:mstyle></mml:mrow></mml:math></inline-formula>. Denote <bold>z</bold> &#x0225C; <italic>f</italic>(<bold>x</bold>) as the deep embedding of <bold>x</bold>, such that the induced Gram matrix <bold>K</bold><sub><bold>Z</bold></sub> representing the semantically relevant for embeddings <bold>z</bold> closely approximates original <bold>K</bold><sub><bold>X</bold></sub> as far as possible. Here, one can see that the adopted kernel representing contextual information helps guide to learn the deep feature embeddings.</p></sec>
<sec>
<title>4.2.2 Kernel-InfoNCE loss</title>
<p>In this subsection, we follow the framework of kernel-based contrastive learning with Markov random fields (MRFs) (Van Assel et al., <xref ref-type="bibr" rid="B31">2022</xref>; Tan et al., <xref ref-type="bibr" rid="B30">2023</xref>). A whole comparison between <bold>K</bold><sub><bold>X</bold></sub> and <bold>K</bold><sub><bold>Z</bold></sub> may be difficult since there are little object samples in our few-shot setting. Consequently, we alternatively compare the MRFs of two kernels <bold>K</bold><sub><bold>X</bold></sub> and <bold>K</bold><sub><bold>Z</bold></sub>. Each MRF introduces a probability distribution of unweighted directed subgraphs on the local descriptors, denoted as <bold>W</bold><sub><bold>X</bold></sub> and <bold>W</bold><sub><bold>Z</bold></sub>, respectively (Van Assel et al., <xref ref-type="bibr" rid="B31">2022</xref>). And then the cross-entropy loss between <bold>W</bold><sub><bold>X</bold></sub> and <bold>W</bold><sub><bold>Z</bold></sub> is naturally minimized to push the <bold>K</bold><sub><bold>Z</bold></sub> toward <bold>K</bold><sub><bold>X</bold></sub>. Differently, we have artificially specified the Kernel matrix <bold>K</bold><sub><bold>X</bold></sub> (i.e., ground truth) for our few-shot and supervised learning scenarios.</p>
<p>The Kernel-InfoNCE loss, a variant of InfoNCE loss function (Chen et al., <xref ref-type="bibr" rid="B5">2020</xref>), is represented as follows:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M8"><mml:mrow><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mtext>Kernel-InfoNCE&#x000A0;</mml:mtext></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mi>log</mml:mi><mml:mfrac><mml:mrow><mml:mi>k</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo>,</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mstyle displaystyle='true'><mml:msubsup><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>n</mml:mi></mml:msubsup><mml:mrow><mml:mi>k</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mo>,</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mstyle></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>
<p>where <italic>k</italic>(<bold>x</bold>, <bold>x</bold><sub><italic>i</italic></sub>) denotes kernel function supported on the descriptor <bold>x</bold><sub><italic>i</italic></sub>. Technically, the Kernel-InfoNCE loss steers low-level descriptors toward a feature space in which the positive pairs are grouped and kept away from negative ones. More importantly, the relationships between any sample pairs instead of individual instances are fully considered.</p>
<p>Furthermore, MRFs are employed to represent the (partial) kernel matrix in a statistical manner. Specifically, each MRF defines a probability distribution that describes an unweighted directed subgraph over kernel matrix, which is defined as <xref ref-type="disp-formula" rid="E2">Equation 2</xref>:</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M9"><mml:mrow><mml:msub><mml:mi>S</mml:mi><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle></mml:msub><mml:mo>&#x0225C;</mml:mo><mml:mo>&#x0007B;</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mo>&#x0007B;</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x0007D;</mml:mo></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msup><mml:mo>&#x0007C;</mml:mo><mml:mo>&#x02200;</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mo stretchy='false'>[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy='false'>]</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup><mml:mo>,</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>&#x0007D;</mml:mo></mml:mrow></mml:math></disp-formula>
<p>And the probability of randomly sampled subgraph (i.e., local descriptors) <italic>P</italic>(<bold>W</bold>; <bold>K</bold><sub><bold>X</bold></sub>) is proportional to</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M10"><mml:mrow><mml:msub><mml:mtext>&#x003A0;</mml:mtext><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mo stretchy='false'>[</mml:mo><mml:mi>n</mml:mi><mml:mo stretchy='false'>]</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:msub><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>K</mml:mi></mml:mstyle><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>X</mml:mi></mml:mstyle></mml:msub><mml:msup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup></mml:mrow></mml:math></disp-formula>
<p>where successive multiplication in <xref ref-type="disp-formula" rid="E3">Equation 3</xref> represents the likelihood of subgraph been sampled. In this case, the Kernel-InfoNCE loss could be refined by classical cross-entropy between <bold>W</bold><sub><bold>X</bold></sub> and <bold>W</bold><sub><bold>Z</bold></sub>, which is defined as follows:</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M11"><mml:mrow><mml:msub><mml:mstyle mathvariant='script' mathsize='normal'><mml:mi>H</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mstyle mathsize='normal'><mml:mtext>K</mml:mtext></mml:mstyle><mml:mtext>X</mml:mtext></mml:msub></mml:mrow></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mo>&#x02212;</mml:mo><mml:msub><mml:mi mathvariant='double-struck'>E</mml:mi><mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>X</mml:mi></mml:mstyle></mml:msub></mml:mrow></mml:msub><mml:mo stretchy='false'>[</mml:mo><mml:mi>log</mml:mi><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>P</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>X</mml:mi></mml:mstyle></mml:msub><mml:mo>;</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>K</mml:mi></mml:mstyle><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>]</mml:mo></mml:mrow></mml:math></disp-formula>
<p>Benefiting to the independence of each row in the <bold>W</bold><sub><italic>i</italic></sub>, <xref ref-type="disp-formula" rid="E4">Equation 4</xref> could be further simplified as</p>
<disp-formula id="E5"><label>(5)</label><mml:math id="M12"><mml:mrow><mml:msub><mml:mstyle mathvariant='script' mathsize='normal'><mml:mi>H</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mstyle mathvariant='normal'><mml:mtext>K</mml:mtext></mml:mstyle><mml:mtext>X</mml:mtext></mml:msub></mml:mrow></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mo>&#x02212;</mml:mo><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>n</mml:mi></mml:munderover><mml:mrow><mml:msub><mml:mi mathvariant='double-struck'>E</mml:mi><mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>X</mml:mi></mml:mstyle></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mo>&#x000B7;</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:mstyle><mml:mo stretchy='false'>[</mml:mo><mml:mi>log</mml:mi><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>P</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mo>&#x000B7;</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>X</mml:mi></mml:mstyle></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mo>&#x000B7;</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mo>;</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>K</mml:mi></mml:mstyle><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>]</mml:mo></mml:mrow></mml:math></disp-formula>
<p>We define <bold>P</bold>(<bold>W</bold><sub><bold>Z</bold></sub>(<italic>i, j</italic>) &#x0003D; 1) as the probability of node <italic>i</italic> pointing to node <italic>j</italic>, i.e., the semantic relevance between sample <italic>i</italic> and <italic>j</italic> represented by kernel matrix. Since <bold>W</bold><sub><bold>X</bold></sub>(<italic>i</italic>, &#x000B7;) and <bold>W</bold><sub><bold>Z</bold></sub>(<italic>i</italic>, &#x000B7;) can have multiple non-zero elements, this probability is no longer binary but is based on the ratio of <bold>K</bold><sub><bold>Z</bold></sub>(<italic>i, j</italic>) to the sum of the weights of all outdegrees of node <italic>i</italic>. In this context, the cross-entropy 5 can be adjusted as follows:</p>
<disp-formula id="E6"><label>(6)</label><mml:math id="M13"><mml:mrow><mml:msub><mml:mstyle mathvariant='script' mathsize='normal'><mml:mi>H</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mstyle mathvariant='normal'><mml:mtext>K</mml:mtext></mml:mstyle><mml:mtext>X</mml:mtext></mml:msub></mml:mrow></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mo>&#x02212;</mml:mo><mml:munder><mml:mrow><mml:mover><mml:mo>&#x02211;</mml:mo><mml:mi>n</mml:mi></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02260;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:munder><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>P</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>X</mml:mi></mml:mstyle></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mi>log</mml:mi><mml:mfrac><mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>K</mml:mi></mml:mstyle><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mstyle displaystyle='true'><mml:msub><mml:mo>&#x02211;</mml:mo><mml:mi>k</mml:mi></mml:msub><mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>K</mml:mi></mml:mstyle><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mstyle></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>
<p>One can see that the RHS of <xref ref-type="disp-formula" rid="E6">Equation 6</xref> is exactly the Kernel-InfoNCE loss defined in <xref ref-type="disp-formula" rid="E1">Equation 1</xref>. Technically, this formula indicates that it first samples the augmented pairs (<italic>i, j</italic>) for each row <bold>i</bold> with <bold>P</bold>(<bold>W</bold><sub><bold>X</bold></sub>(<italic>i, j</italic>) &#x0003D; 1) and then optimizes the classical InfoNCE loss so as to push <bold>K</bold><sub><bold>Z</bold></sub> toward <bold>K</bold><sub><bold>X</bold></sub> with the deep feature representations <inline-formula><mml:math id="M14"><mml:mi>f</mml:mi><mml:mo>:</mml:mo><mml:mrow><mml:mstyle mathvariant="script"><mml:mi>X</mml:mi></mml:mstyle></mml:mrow><mml:mo>&#x02192;</mml:mo><mml:mrow><mml:mstyle mathvariant="script"><mml:mi>Z</mml:mi></mml:mstyle></mml:mrow></mml:math></inline-formula>.</p>
</sec>
</sec>
<sec>
<title>4.3 Global inference leveraging cosine similarity measure</title>
<p>Given the context-embedded deep features <bold>Z</bold> &#x0225C; <italic>f</italic>(<bold>X</bold>) of queries, we can localize similar objects within a complete image, which is a process we refer to as global inference. To comprehensively evaluate the obtained deep features, an indiscriminate sliding window scanning is employed to predict object presence without any ROI or saliency detection preprocess. <xref ref-type="fig" rid="F4">Figure 4</xref> gives an example of global inference.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Context-aware learning. We constructed a network model, guided by the Kernel-InfoNCE loss, to learn the contextual relationships of patches.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-18-1349204-g0004.tif"/>
</fig>
<p>Native cosine similarity measure (i.e., inner product) is adopted to quantify the visual similarity between two deep features within each sliding window. Our specific implementation involves two vectors <bold>Z</bold><sub><italic>Q</italic></sub> and <bold>Z</bold><sub><italic>T</italic><sub><italic>i</italic></sub></sub>, which represent the feature vectors, obtained via deep mapping <italic>f</italic>, of a query sample and the <italic>i</italic>th window, respectively. Mathematically, the visual similarity between <bold>Z</bold><sub><italic>Q</italic></sub> and <bold>Z</bold><sub><italic>T</italic><sub><italic>i</italic></sub></sub> is defined as follows:</p>
<disp-formula id="E7"><label>(7)</label><mml:math id="M15"><mml:mrow><mml:mi>&#x003C1;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle><mml:mi>Q</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>T</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mo>&#x0003C;</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle><mml:mi>Q</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>T</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:msub><mml:mo>&#x0003E;</mml:mo><mml:mi>F</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>trace</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:msubsup><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle><mml:mi>Q</mml:mi><mml:mi>T</mml:mi></mml:msubsup><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>T</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo>&#x02016;</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle><mml:mi>Q</mml:mi></mml:msub><mml:msub><mml:mo>&#x02016;</mml:mo><mml:mi>F</mml:mi></mml:msub><mml:mo>&#x02016;</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>T</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:msub><mml:mo>&#x02016;</mml:mo><mml:mi>F</mml:mi></mml:msub></mml:mrow></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mtext>&#x02003;</mml:mtext><mml:mi>&#x003F5;</mml:mi><mml:mo stretchy='false'>[</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>]</mml:mo></mml:mrow></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M16"><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>Z</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle><mml:mrow><mml:mo>[</mml:mo></mml:mrow></mml:mstyle><mml:mfrac><mml:mrow><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>Z</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>Z</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mi>F</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:mo>,</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>,</mml:mo><mml:mfrac><mml:mrow><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>Z</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>Z</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>Q</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mi>F</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:mstyle><mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mstyle></mml:math></inline-formula>, <inline-formula><mml:math id="M17"><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>Z</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle><mml:mrow><mml:mo>[</mml:mo></mml:mrow></mml:mstyle><mml:mfrac><mml:mrow><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>Z</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>Z</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mi>F</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:mo>,</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>,</mml:mo><mml:mfrac><mml:mrow><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>Z</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>Z</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mi>F</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:mstyle><mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mstyle></mml:math></inline-formula>. When we focus on each column vector <italic>z</italic>, <xref ref-type="disp-formula" rid="E7">Equation 7</xref> can be rewritten as follows:</p>
<disp-formula id="E8"><label>(8)</label><mml:math id="M18"><mml:mrow><mml:msub><mml:mi>&#x003C1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x003C1;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle><mml:mi>Q</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>T</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>m</mml:mi></mml:munderover><mml:mrow><mml:mfrac><mml:mrow><mml:msubsup><mml:mi>z</mml:mi><mml:mi>q</mml:mi><mml:mi>n</mml:mi></mml:msubsup><mml:msubsup><mml:mi>z</mml:mi><mml:mrow><mml:msub><mml:mi>t</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mi>n</mml:mi></mml:msubsup></mml:mrow><mml:mrow><mml:mo>&#x02016;</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle><mml:mi>Q</mml:mi></mml:msub><mml:msub><mml:mo>&#x02016;</mml:mo><mml:mi>F</mml:mi></mml:msub><mml:mo>&#x02016;</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>T</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:msub><mml:mo>&#x02016;</mml:mo><mml:mi>F</mml:mi></mml:msub></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>m</mml:mi></mml:munderover><mml:mi>&#x003C1;</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mi>q</mml:mi><mml:mi>n</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mrow><mml:msub><mml:mi>t</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mi>n</mml:mi></mml:msubsup><mml:mo stretchy='false'>)</mml:mo><mml:mfrac><mml:mrow><mml:mo>&#x02016;</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mi>q</mml:mi><mml:mi>n</mml:mi></mml:msubsup><mml:mo>&#x02016;</mml:mo><mml:mo>&#x02016;</mml:mo><mml:msubsup><mml:mi>z</mml:mi><mml:mrow><mml:msub><mml:mi>t</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mi>n</mml:mi></mml:msubsup><mml:mo>&#x02016;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x02016;</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle><mml:mi>Q</mml:mi></mml:msub><mml:msub><mml:mo>&#x02016;</mml:mo><mml:mi>F</mml:mi></mml:msub><mml:mo>&#x02016;</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>Z</mml:mi></mml:mstyle><mml:mrow><mml:msub><mml:mi>T</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:msub><mml:msub><mml:mo>&#x02016;</mml:mo><mml:mi>F</mml:mi></mml:msub></mml:mrow></mml:mfrac></mml:mrow></mml:math></disp-formula>
<p>Based upon the cosine similarity measure as well as sliding window scanning, we can obtain a confidence map in which the element indicates the likelihood of object presence and then place the bounding box at the high confidence region. To avoid false alarms, two simple tricks are considered in the current implementation. First, all of the likelihood value in confidence map are re-scaled with the Lawley-Hotelling Trace statistic <inline-formula><mml:math id="M19"><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003C1;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003C1;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x003C1;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:mrow></mml:mfrac></mml:math></inline-formula> (Calinski et al., <xref ref-type="bibr" rid="B4">2006</xref>), which suppresses the small correlation values of <xref ref-type="disp-formula" rid="E7">Equation 7</xref> or <xref ref-type="disp-formula" rid="E8">Equation 8</xref>. Second, a conventional non-maximal value suppression process is adopted to eliminate redundant bounding boxes.</p></sec></sec>
<sec id="s5">
<title>5 Experimental setup, results and discussions</title>
<p>To evaluate the effectiveness and robustness of the proposed method, we compare it with other three handcraft-based (few-shot) feature representation methods or detectors: LARK-PCA (Seo and Milanfar, <xref ref-type="bibr" rid="B28">2010</xref>), LARK-LPP (Biswas and Milanfar, <xref ref-type="bibr" rid="B1">2016</xref>) and sparse codes with LLE variants (hereafter called SMT) (Chen et al., <xref ref-type="bibr" rid="B6">2018</xref>, <xref ref-type="bibr" rid="B7">2022</xref>). Such a choice of comparison methods induces a relatively fair comparison because all of them only utilize very few queries for feature learning, rather than rely on large amounts of annotated samples (i.e., base class) for model pre-training. Notably, it is the limited queries accessibility and high utilization efficiency that accord with practical demands of few-shot learning and object detection task.</p>
<p>Our experiments were conducted on a high-performance server with the following configurations: Intel Gold 6330 CPU &#x00040; 2.00 GHz, NVIDIA RTX 3090 GPU, and 24GB of RAM. To fully leverage the computational power and enable efficient programming, we utilized the PyTorch framework with GPU acceleration.</p>
<sec>
<title>5.1 Experimental setup</title>
<sec>
<title>5.1.1 Benchmark</title>
<p>We conduct the experiment on Levir dataset (Zou and Shi, <xref ref-type="bibr" rid="B34">2017</xref>), from which 414 remote sensing imageries containing ocean-going ships are selected. Such choice results from a deliberated trade-off between complicated real-world scenes and synthetic ones. Technically, the utilized remote sensing imagery approximates 2-D affine plane with desired depth degradation, which is less complicated than the natural scene. Furthermore, the ocean-going ships also partly exist intractable texture clusters (e.g., trajectories and waves), which is more challenging than synthetic data. Practically, it has been mentioned in Deng et al. (<xref ref-type="bibr" rid="B9">2021</xref>); Han et al. (<xref ref-type="bibr" rid="B13">2022</xref>) that understanding of visual data collected from air platforms becomes urgently needed.</p>
<p>Only 21 random sampled images (i.e., objects) in the aforementioned collection were treated as queries, and the left images were designated as target images for model testing. For the queries, we randomly selected 20 ones for context learning and the left one to generate object template. In the training data setting, 4,335 small patches (32 &#x000D7; 32 pixels) with 10 pixels strides were sampled. Patches with an IoU above 0.3 with the target object were classified as positive samples, while those below this threshold were considered as negative ones. The reason for choosing such a small IoU value lies in the motivation that we put more emphasis on patches with salient gradient of grayscale. In addition, this setting enables the proposed model learn to keep the object away from background.</p></sec>
<sec>
<title>5.1.2 Evaluation metrics</title>
<p>Precision-Recall (P-R) curve, plotting precision against recall at various confidence thresholds, is utilized to evaluate quantitative performance for each candidate few-shot detector. In addition, we highlighted equal error rate (EER), a point at which recall equals precision in the P-R curve, to indicate accuracy and reliability across each candidate detector.</p></sec>
<sec>
<title>5.1.3 Parameter setting</title>
<p>To establish the deep feature embedding <inline-formula><mml:math id="M20"><mml:mi>f</mml:mi><mml:mo>:</mml:mo><mml:mrow><mml:mstyle mathvariant="script"><mml:mi>X</mml:mi></mml:mstyle></mml:mrow><mml:mo>&#x02192;</mml:mo><mml:mrow><mml:mstyle mathvariant="script"><mml:mi>Z</mml:mi></mml:mstyle></mml:mrow></mml:math></inline-formula>, we employ conventional ResNet-18 network (He et al., <xref ref-type="bibr" rid="B14">2016</xref>) without any pretraining process, whose built-in weights were optimized with LARS optimizer. We empirically set learning rate as 1 &#x000D7; 10<sup>&#x02212;3</sup>, momentum as 0.9, and weight decay as 1 &#x000D7; 10<sup>&#x02212;6</sup>. The training followed an adaptive schedule across a total of 50 epochs. The inputs to the model are HoG features of patches, and the size of HoG features is 32*32.</p>
</sec>
</sec>
<sec>
<title>5.2 Results and discussions</title>
<sec>
<title>5.2.1 P-R curve and EER evaluation</title>
<p><xref ref-type="fig" rid="F5">Figure 5</xref> demonstrates that the P-R curve generated by our method is better than other ones. Additionally, as shown in <xref ref-type="table" rid="T1">Table 1</xref>, our method achieves a much higher EER value of 0.692, outperforming another three ones: SMT at 0.521, LARK-LPP at 0.369, and LARK-PCA at 0.282. Such a result mainly stems from the capacity to learn deeper contextual information of the target. More importantly, all of the candidate detectors share similar local descriptors (hand-craft LARK, HOG or adaptive sparse codes), which indicates that the traditional local descriptor is actually not that &#x0201C;bad,&#x0201D; and the context information is indeed much more crucial that can not be neglected for the feature discriminativity.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>P-R curve. The values, at the intersection points of the dotted line with slope 1 and the P-R curve, represent the EER of the four evaluated methods.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-18-1349204-g0005.tif"/>
</fig>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Comparative results of the four methods.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th/>
<th valign="top" align="center"><bold>The number of queries</bold></th>
<th valign="top" align="center"><bold>The number of targets</bold></th>
<th valign="top" align="center"><bold>EER</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Proposed method</td>
<td valign="top" align="center">21</td>
<td valign="top" align="center">393</td>
<td valign="top" align="center">0.692</td>
</tr> <tr>
<td valign="top" align="left">SMT</td>
<td valign="top" align="center">21</td>
<td valign="top" align="center">393</td>
<td valign="top" align="center">0.521</td>
</tr> <tr>
<td valign="top" align="left">LARK-LPP</td>
<td valign="top" align="center">12</td>
<td valign="top" align="center">393</td>
<td valign="top" align="center">0.369</td>
</tr> <tr>
<td valign="top" align="left">LARK-PCA</td>
<td valign="top" align="center">12</td>
<td valign="top" align="center">393</td>
<td valign="top" align="center">0.282</td>
</tr></tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>5.2.2 Robustness to angles</title>
<p><xref ref-type="fig" rid="F6">Figure 6</xref> offers a comparative analysis between our method, shown in <xref ref-type="fig" rid="F6">Figure 6A</xref>, and SMT depicted in <xref ref-type="fig" rid="F6">Figure 6B</xref>. Both methods utilize an equal number of query images. The comparison clearly reveals that our approach, even with a single image for generating the object template, is capable of detecting targets across a broader range of angles. On the other hand, SMT, as shown in the third line in <xref ref-type="fig" rid="F6">Figure 6B</xref>, demonstrates limitations when dealing with large-angle changes. The effectiveness of our method stems from the fact that we are exploring the contextual relationships between hand-craft features, enabling the model to optimize spatial relationships and adapt more efficiently to complex angular variations.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Example detections between two methods are shown here. The results from the proposed method are depicted in <bold>(A)</bold>. Conversely, <bold>(B)</bold> showcases the results by using the SMT.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-18-1349204-g0006.tif"/>
</fig>
</sec></sec></sec>
<sec id="s6">
<title>6 Conclusion and future work</title>
<p>In this article, we have utilized classical handcraft local descriptors, which could lay foundation for learning-insensitive, effective, and efficient few-shot object detection. Given only few training samples on query targets, effectively localizing the similar regions in an imagery is a tough task given the few-shot setting that the inherently data-driven DNNs undergo. To explore such concerns, we have resorted to brain-inspired and biological-plausible computational model.</p>
<p>Typically, handcraft local descriptors conventionally adopted in encoding image contents have been embedded in expert visual knowledge and thus naturally have no more need of fine-tuning. However, manually arranging them without sacrificing the descriptor&#x00027;s discriminative power is not straightforward. To address this issue, we have studied a kernel-guided spatial context feature learning (inherently discriminative) by combining handcraft local descriptors with global semantic relevance. Our experimental results with HoG descriptor show that Kernel-InforNCE-guided context learning improves detection in comparison to PCA/LPP (with LARK descriptors) and SMT (with sparse codes) by being aware of global structure, that is, the classical local descriptor is actually not that &#x0201C;bad,&#x0201D; and the context information is much more crucial that can not be neglected for the feature discriminativity.</p>
<p>Our future work involves &#x0201C;adaptive&#x0201D; context learning with present kernel method &#x0201C;white box&#x0201D; deep embedding unrolling visual grammar. First, it is more reasonable that the utilized kernel matrix/graph of which meaningful edge weight assignments needs to be explicitly formulated instead of being implicitly determined in the data augmentation process. Alternatively, we will devote to exploring an explainable (deep) network layer or general module representing spatial contexts of objects. Given these improvements, other state-of-the-art few-shot detectors (e.g., meta-learning-based algorithms) will be further explored and compared in our future work.</p></sec>
<sec sec-type="data-availability" id="s7">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.</p></sec>
<sec sec-type="author-contributions" id="s8">
<title>Author contributions</title>
<p>SZ: Writing&#x02014;original draft. HL: Writing&#x02014;review &#x00026; editing. ZW: Writing&#x02014;review &#x00026; editing. ZZ: Writing&#x02014;review &#x00026; editing.</p></sec>
</body>
<back>
<sec sec-type="funding-information" id="s9">
<title>Funding</title>
<p>The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This work was partially supported by the National Natural Science Foundation of China (NSFC) under Grant 62201068 and the School Foundation of BISTU under Grant 9152124103.</p>
</sec>
<ack><p>The authors are grateful to Zhiquan Tan for clarifying a derivation step in his preprint paper and the reviewers and Prof. Baojun Zhao for their encouraging and insightful advice that leads to this improved version and clearer presentation of the technical content.</p>
</ack>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s10">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Biswas</surname> <given-names>S. K.</given-names></name> <name><surname>Milanfar</surname> <given-names>P.</given-names></name></person-group> (<year>2016</year>). <article-title>One shot detection with laplacian object and fast matrix cosine similarity</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell</source>. <volume>38</volume>, <fpage>546</fpage>&#x02013;<lpage>562</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2015.2453950</pub-id><pub-id pub-id-type="pmid">27046497</pub-id></citation></ref>
<ref id="B2">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bohdal</surname> <given-names>O.</given-names></name> <name><surname>Yang</surname> <given-names>Y.</given-names></name> <name><surname>Hospedales</surname> <given-names>T. M.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;EvoGrad: efficient gradient-based meta-learning and hyperparameter optimization,&#x0201D;</article-title> in <source>Neural Information Processing Systems</source> (<publisher-loc>New York, NY</publisher-loc>).</citation>
</ref>
<ref id="B3">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Boiman</surname> <given-names>O.</given-names></name> <name><surname>Shechtman</surname> <given-names>E.</given-names></name> <name><surname>Irani</surname> <given-names>M.</given-names></name></person-group> (<year>2008</year>). <article-title>&#x0201C;In defense of nearest-neighbor based image classification,&#x0201D;</article-title> in <source>2008 IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Anchorage, AK</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>8</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2008.4587598</pub-id></citation>
</ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Calinski</surname> <given-names>T.</given-names></name> <name><surname>Krzysko</surname> <given-names>M.</given-names></name> <name><surname>Wolynski</surname> <given-names>W.</given-names></name></person-group> (<year>2006</year>). <article-title>A comparison of some tests for determining the number of nonzero canonical correlations</article-title>. <source>Commun. Stat. B, Simul. Comput</source>. <volume>35</volume>, <fpage>727</fpage>&#x02013;<lpage>749</lpage>. <pub-id pub-id-type="doi">10.1080/03610910600716290</pub-id></citation>
</ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>T.</given-names></name> <name><surname>Kornblith</surname> <given-names>S.</given-names></name> <name><surname>Norouzi</surname> <given-names>M.</given-names></name> <name><surname>Hinton</surname> <given-names>G. E.</given-names></name></person-group> (<year>2020</year>). <article-title>A simple framework for contrastive learning of visual representations</article-title>. <source>arXiv [Preprint]. arXiv: 2002.05709</source>.</citation>
</ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>Y.</given-names></name> <name><surname>Paiton</surname> <given-names>D. M.</given-names></name> <name><surname>Olshausen</surname> <given-names>P. A.</given-names></name></person-group> (<year>2018</year>). <article-title>The sparse manifold transform</article-title>. <source>arXiv [Preprint]. arXiv: 1806.08887</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1806.08887</pub-id></citation>
</ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>Y.</given-names></name> <name><surname>Yun</surname> <given-names>Z.</given-names></name> <name><surname>Ma</surname> <given-names>Y.</given-names></name> <name><surname>Olshausen</surname> <given-names>B.</given-names></name> <name><surname>LeCun</surname> <given-names>Y.</given-names></name></person-group> (<year>2022</year>). <article-title>Minimalistic unsupervised learning with the sparse manifold transform</article-title>. <source>arXiv</source>. [Preprint]. <pub-id pub-id-type="doi">10.48550/arXiv.2209.15261</pub-id></citation>
</ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Dalal</surname> <given-names>N.</given-names></name> <name><surname>Triggs</surname> <given-names>B.</given-names></name></person-group> (<year>2005</year>). <article-title>&#x0201C;Histograms of oriented gradients for human detection,&#x0201D;</article-title> in <source>2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR&#x00027;05)</source>, Volume 1 (<publisher-loc>San Diego, CA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>886</fpage>&#x02013;<lpage>893</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2005.177</pub-id></citation>
</ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Deng</surname> <given-names>C.</given-names></name> <name><surname>He</surname> <given-names>S.</given-names></name> <name><surname>Han</surname> <given-names>Y.</given-names></name> <name><surname>Zhao</surname> <given-names>B.</given-names></name></person-group> (<year>2021</year>). <article-title>Learning dynamic spatial-temporal regularization for uav object tracking</article-title>. <source>IEEE Signal Process. Lett.</source> <volume>28</volume>, <fpage>1230</fpage>&#x02013;<lpage>1234</lpage>.</citation>
</ref>
<ref id="B10">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Finn</surname> <given-names>C.</given-names></name> <name><surname>Abbeel</surname> <given-names>P.</given-names></name> <name><surname>Levine</surname> <given-names>S.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Model-agnostic meta-learning for fast adaptation of deep networks,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>PMLR</publisher-name>), <fpage>1126</fpage>&#x02013;<lpage>1135</lpage>.</citation>
</ref>
<ref id="B11">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Girshick</surname> <given-names>R.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Fast R-CNN,&#x0201D;</article-title> in <source>Proceedings of 2015 IEEE International Conference on Computer Vision</source> (<publisher-loc>Santiago</publisher-loc>), <fpage>1440</fpage>&#x02013;<lpage>1448</lpage>.</citation>
</ref>
<ref id="B12">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Grauman</surname> <given-names>K.</given-names></name> <name><surname>Darrell</surname> <given-names>T.</given-names></name></person-group> (<year>2005</year>). <article-title>&#x0201C;The pyramid match kernel: discriminative classification with sets of image features,&#x0201D;</article-title> in <source>Tenth IEEE International Conference on Computer Vision, Volume 2</source> (<publisher-loc>Beijing</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1458</fpage>&#x02013;<lpage>1465</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2005.239</pub-id></citation>
</ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Han</surname> <given-names>Y.</given-names></name> <name><surname>Liu</surname> <given-names>H.</given-names></name> <name><surname>Wang</surname> <given-names>Y.</given-names></name> <name><surname>Liu</surname> <given-names>C.</given-names></name></person-group> (<year>2022</year>). <article-title>A comprehensive review for typical applications based upon unmanned aerial vehicle platform</article-title>. <source>IEEE J. Select. Top. Appl. Earth Observ. Remote Sens.</source> <volume>15</volume>, <fpage>9654</fpage>&#x02013;<lpage>9666</lpage>.</citation>
</ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Ren</surname> <given-names>S.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Deep residual learning for image recognition,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Las Vegas, NV</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>770</fpage>&#x02013;<lpage>778</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2016.90</pub-id></citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hinton</surname> <given-names>G. E.</given-names></name> <name><surname>Salakhutdinov</surname> <given-names>R. R.</given-names></name></person-group> (<year>2006</year>). <article-title>Reducing the dimensionality of data with neural networks</article-title>. <source>Science</source> <volume>313</volume>:<fpage>504</fpage>. <pub-id pub-id-type="doi">10.1126/science.1127647</pub-id><pub-id pub-id-type="pmid">16873662</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huisman</surname> <given-names>M.</given-names></name> <name><surname>Van Rijn</surname> <given-names>J. N.</given-names></name> <name><surname>Plaat</surname> <given-names>A.</given-names></name></person-group> (<year>2021</year>). <article-title>A survey of deep meta-learning</article-title>. <source>Artif. Intell. Rev</source>. <volume>54</volume>, <fpage>4483</fpage>&#x02013;<lpage>4541</lpage>. <pub-id pub-id-type="doi">10.1007/s10462-021-10004-4</pub-id></citation>
</ref>
<ref id="B17">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jin</surname> <given-names>Y.</given-names></name> <name><surname>Geman</surname> <given-names>S.</given-names></name></person-group> (<year>2006</year>). <article-title>&#x0201C;Context and hierarchy in a probabilistic image model,&#x0201D;</article-title> in <source>2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR&#x00027;06), Volume</source> 2 (<publisher-loc>New York, NY</publisher-loc>), <fpage>2145</fpage>&#x02013;<lpage>2152</lpage>.</citation>
</ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kaya</surname> <given-names>M.</given-names></name> <name><surname>Bilge</surname> <given-names>H. &#x0015E;.</given-names></name></person-group> (<year>2019</year>). <article-title>Deep metric learning: a survey</article-title>. <source>Symmetry</source> <volume>11</volume>:<fpage>1066</fpage>. <pub-id pub-id-type="doi">10.3390/sym11091066</pub-id></citation>
</ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kruger</surname> <given-names>N.</given-names></name> <name><surname>Janssen</surname> <given-names>P.</given-names></name> <name><surname>Kalkan</surname> <given-names>S.</given-names></name> <name><surname>Lappe</surname> <given-names>M.</given-names></name> <name><surname>Leonardis</surname> <given-names>A.</given-names></name> <name><surname>Piater</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2013</year>). <article-title>Deep hierarchies in the primate visual cortex: what can we learn for computer vision?</article-title> <source>IEEE Trans. Pattern Anal. Mach. Intell</source>. <volume>35</volume>, <fpage>1847</fpage>&#x02013;<lpage>1871</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2012.272</pub-id><pub-id pub-id-type="pmid">23787340</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lake</surname> <given-names>B. M.</given-names></name> <name><surname>Salakhutdinov</surname> <given-names>R.</given-names></name> <name><surname>Tenenbaum</surname> <given-names>J. B.</given-names></name></person-group> (<year>2015</year>). <article-title>Human-level concept learning through probabilistic program induction</article-title>. <source>Science</source> <volume>350</volume>, <fpage>1332</fpage>&#x02013;<lpage>1338</lpage>. <pub-id pub-id-type="doi">10.1126/science.aab3050</pub-id><pub-id pub-id-type="pmid">26659050</pub-id></citation></ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lecun</surname> <given-names>Y.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Hinton</surname> <given-names>G.</given-names></name></person-group> (<year>2015</year>). <article-title>Deep learning</article-title>. <source>Nature</source> <volume>521</volume>:<fpage>436</fpage>. <pub-id pub-id-type="doi">10.1038/nature14539</pub-id></citation>
</ref>
<ref id="B22">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>T.-Y.</given-names></name> <name><surname>Dollr</surname> <given-names>P.</given-names></name> <name><surname>Girshick</surname> <given-names>R.</given-names></name> <name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Hariharan</surname> <given-names>B.</given-names></name> <name><surname>Belongie</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>&#x0201C;Feature pyramid networks for object detection,&#x0201D;</article-title> in <source>2017 IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Honolulu, HI</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>936</fpage>&#x02013;<lpage>944</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2017.106</pub-id></citation>
</ref>
<ref id="B23">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>W.</given-names></name> <name><surname>Anguelov</surname> <given-names>D.</given-names></name> <name><surname>Erhan</surname> <given-names>D.</given-names></name> <name><surname>Szegedy</surname> <given-names>C.</given-names></name> <name><surname>Reed</surname> <given-names>S.</given-names></name> <name><surname>Fu</surname> <given-names>C.-Y.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>&#x0201C;SSD: single shot multibox detector,&#x0201D;</article-title> in <source>Computer Vision-ECCV 2016, Volume</source> 9905, eds B. Leibe, J. Matas, N. Sebe, and M. Welling (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>), <fpage>21</fpage>&#x02013;<lpage>37</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-46448-0_2</pub-id></citation>
</ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lowe</surname> <given-names>D. G.</given-names></name></person-group> (<year>2004</year>). <article-title>Distinctive image features from scale-invariant keypoints</article-title>. <source>Int. J. Comput. Vis</source>. <volume>60</volume>, <fpage>91</fpage>&#x02013;<lpage>110</lpage>. <pub-id pub-id-type="doi">10.1023/B:VISI.0000029664.99615.94</pub-id></citation>
</ref>
<ref id="B25">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mikolajczyk</surname> <given-names>K.</given-names></name> <name><surname>Schmid</surname> <given-names>C.</given-names></name></person-group> (<year>2002</year>). <article-title>&#x0201C;An affine invariant interest point detector,&#x0201D;</article-title> in <source>ECCV</source> 2002, eds A. Heyden, G. Sparr, M. Nielsen, and P. Johansen (<publisher-loc>Berlin</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>128</fpage>&#x02013;<lpage>142</lpage>. <pub-id pub-id-type="doi">10.1007/3-540-47969-4_9</pub-id></citation>
</ref>
<ref id="B26">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Rasmussen</surname> <given-names>C. E.</given-names></name></person-group> (<year>2003</year>). <article-title>&#x0201C;Gaussian processes in machine learning,&#x0201D;</article-title> in <source>Summer School on Machine Learning</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>63</fpage>&#x02013;<lpage>71</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-540-28650-9_4</pub-id></citation>
</ref>
<ref id="B27">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Redmon</surname> <given-names>J.</given-names></name> <name><surname>Divvala</surname> <given-names>S.</given-names></name> <name><surname>Girshick</surname> <given-names>R.</given-names></name> <name><surname>Farhadi</surname> <given-names>A.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;You only look once: unified, real-time object detection,&#x0201D;</article-title> in <source>2016 IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Las Vegas, NV</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>779</fpage>&#x02013;<lpage>788</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2016.91</pub-id></citation>
</ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Seo</surname> <given-names>H. J.</given-names></name> <name><surname>Milanfar</surname> <given-names>P.</given-names></name></person-group> (<year>2010</year>). <article-title>Training-free, generic object detection using locally adaptive regression kernels</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell</source>. <volume>32</volume>, <fpage>1688</fpage>&#x02013;<lpage>1704</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2009.153</pub-id><pub-id pub-id-type="pmid">20634561</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shorten</surname> <given-names>C.</given-names></name> <name><surname>Khoshgoftaar</surname> <given-names>T. M.</given-names></name></person-group> (<year>2019</year>). <article-title>A survey on image data augmentation for deep learning</article-title>. <source>J. Big Data</source> <volume>6</volume>, <fpage>1</fpage>&#x02013;<lpage>48</lpage>. <pub-id pub-id-type="doi">10.1186/s40537-019-0197-0</pub-id></citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tan</surname> <given-names>Z.-H.</given-names></name> <name><surname>Zhang</surname> <given-names>Y.</given-names></name> <name><surname>Yang</surname> <given-names>J.</given-names></name> <name><surname>Yuan</surname> <given-names>Y.</given-names></name></person-group> (<year>2023</year>). <article-title>Contrastive learning is spectral clustering on similarity graph</article-title>. <source>arXiv [Preprint]. arXiv: 2303.15103</source>.</citation>
</ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Van Assel</surname> <given-names>H.</given-names></name> <name><surname>Espinasse</surname> <given-names>T.</given-names></name> <name><surname>Chiquet</surname> <given-names>J.</given-names></name> <name><surname>Picard</surname> <given-names>F.</given-names></name></person-group> (<year>2022</year>). <article-title>A probabilistic graph coupling view of dimension reduction</article-title>. <source>Adv. Neural Inform. Process. Syst.</source> <volume>35</volume>, <fpage>10696</fpage>&#x02013;<lpage>10708</lpage>. <pub-id pub-id-type="doi">10.48550/arXiv.2201.13053</pub-id></citation>
</ref>
<ref id="B32">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Vinyals</surname> <given-names>O.</given-names></name> <name><surname>Blundell</surname> <given-names>C.</given-names></name> <name><surname>Lillicrap</surname> <given-names>T.</given-names></name> <name><surname>Kavukcuoglu</surname> <given-names>K.</given-names></name> <name><surname>Wierstra</surname> <given-names>D.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Matching networks for one shot learning,&#x0201D;</article-title> <source>in Advances in Neural Information Processing Systems, Volume 29</source> (<publisher-loc>Red Hook, NY</publisher-loc>), <fpage>3630</fpage>&#x02013;<lpage>3638</lpage>.</citation>
</ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>Y.</given-names></name> <name><surname>Ma</surname> <given-names>S.</given-names></name> <name><surname>Shen</surname> <given-names>X.</given-names></name></person-group> (<year>2019</year>). <article-title>A novel video face verification algorithm based on tplbp and the 3D siamese-CNN</article-title>. <source>Electronics</source> <volume>8</volume>:<fpage>1544</fpage>. <pub-id pub-id-type="doi">10.3390/electronics8121544</pub-id></citation>
</ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zou</surname> <given-names>Z.</given-names></name> <name><surname>Shi</surname> <given-names>Z.</given-names></name></person-group> (<year>2017</year>). <article-title>Random access memories: a new paradigm for target detection in high resolution aerial remote sensing images</article-title>. <source>IEEE Trans. Image Process</source>. <volume>27</volume>, <fpage>1100</fpage>&#x02013;<lpage>1111</lpage>. <pub-id pub-id-type="doi">10.1109/TIP.2017.2773199</pub-id><pub-id pub-id-type="pmid">29220314</pub-id></citation></ref>
</ref-list>
</back>
</article>