<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Robot. AI</journal-id>
<journal-title>Frontiers in Robotics and AI</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Robot. AI</abbrev-journal-title>
<issn pub-type="epub">2296-9144</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">1386464</article-id>
<article-id pub-id-type="doi">10.3389/frobt.2024.1386464</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Robotics and AI</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Distributed training of CosPlace for large-scale visual place recognition</article-title>
<alt-title alt-title-type="left-running-head">Zaccone et al.</alt-title>
<alt-title alt-title-type="right-running-head">
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/frobt.2024.1386464">10.3389/frobt.2024.1386464</ext-link>
</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Zaccone</surname>
<given-names>Riccardo</given-names>
</name>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/2653636/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/formal-analysis/"/>
<role content-type="https://credit.niso.org/contributor-roles/investigation/"/>
<role content-type="https://credit.niso.org/contributor-roles/methodology/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Berton</surname>
<given-names>Gabriele</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/1573605/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/validation/"/>
<role content-type="https://credit.niso.org/contributor-roles/Writing - review &#x26; editing/"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Masone</surname>
<given-names>Carlo</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/1573604/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/conceptualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/supervision/"/>
<role content-type="https://credit.niso.org/contributor-roles/Writing - review &#x26; editing/"/>
</contrib>
</contrib-group>
<aff>
<institution>Visual And Multimodal Applied Learning Laboratory (VANDAL Lab)</institution>, <institution>Dipartimento di Automatica e Informatica (DAUIN)</institution>, <institution>Politecnico di Torino</institution>, <addr-line>Turin</addr-line>, <country>Italy</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/2379392/overview">Abdul Hafez Abdulhafez</ext-link>, Hasan Kalyoncu University, T&#xfc;rkiye</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/2666019/overview">Saed Alqaraleh</ext-link>, Isra University, Jordan</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/2672534/overview">Utkarsh Rai</ext-link>, International Institute of Information Technology, Hyderabad, India</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Riccardo Zaccone, <email>riccardo.zaccone@polito.it</email>
</corresp>
</author-notes>
<pub-date pub-type="epub">
<day>20</day>
<month>05</month>
<year>2024</year>
</pub-date>
<pub-date pub-type="collection">
<year>2024</year>
</pub-date>
<volume>11</volume>
<elocation-id>1386464</elocation-id>
<history>
<date date-type="received">
<day>15</day>
<month>02</month>
<year>2024</year>
</date>
<date date-type="accepted">
<day>22</day>
<month>04</month>
<year>2024</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2024 Zaccone, Berton and Masone.</copyright-statement>
<copyright-year>2024</copyright-year>
<copyright-holder>Zaccone, Berton and Masone</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>Visual place recognition (VPR) is a popular computer vision task aimed at recognizing the geographic location of a visual query, usually within a tolerance of a few meters. Modern approaches address VPR from an image retrieval standpoint using a kNN on top of embeddings extracted by a deep neural network from both the query and images in a database. Although most of these approaches rely on contrastive learning, which limits their ability to be trained on large-scale datasets (due to mining), the recently reported CosPlace proposes an alternative training paradigm using a classification task as the proxy. This has been shown to be effective in expanding the potential of VPR models to learn from large-scale and fine-grained datasets. In this work, we experimentally analyze CosPlace from a continual learning perspective and show that its sequential training procedure leads to suboptimal results. As a solution, we propose a different formulation that not only solves the pitfalls of the original training strategy effectively but also enables faster and more efficient distributed training. Finally, we discuss the open challenges in further speeding up large-scale image retrieval for VPR.</p>
</abstract>
<kwd-group>
<kwd>visual place recognition</kwd>
<kwd>visual geolocalization</kwd>
<kwd>distributed learning</kwd>
<kwd>image retrieval</kwd>
<kwd>deep learning</kwd>
</kwd-group>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Field Robotics</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>Visual place recognition (VPR) (<xref ref-type="bibr" rid="B15">Masone and Caputo, 2021</xref>) is a popular computer vision task that aims to recognize the geographic location of a visual query and usually has an accepted tolerance of a few meters. VPR tasks are commonly approached as image-retrieval problems, in which a never-before-seen query image is matched to a database of geotagged images; the most similar images in the database are then used to infer the coordinates of the query.</p>
<p>The typical pipeline for VPR involves a neural network to extract embeddings from both the query and each image in the database. These embeddings are then compared using a k-nearest neighbor (kNN) algorithm to retrieve the most similar results from the database and their corresponding geotags. For the kNN step to be effective, it is crucial that the embedding space learned by the neural network be sufficiently discriminative for places; this is commonly achieved by training the models with contrastive learning approaches using a triplet loss (<xref ref-type="bibr" rid="B3">Arandjelovi&#x107; et al., 2018</xref>) or other similar losses and leveraging the geotags of the database images as a form of weak supervision to mine negative and positive examples (<xref ref-type="bibr" rid="B3">Arandjelovi&#x107; et al., 2018</xref>). However, the execution time required for the mining operation scales linearly with the size of the database (<xref ref-type="bibr" rid="B5">Berton et al., 2022b</xref>), thus becoming a bottleneck that impedes training on massive datasets. A naive mitigation strategy here would be to mine the positive/negative examples within a subset of the data (<xref ref-type="bibr" rid="B28">Warburg et al., 2020</xref>), but this ultimately hampers the ability to learn more discriminative and generalizable representations.</p>
<p>To solve this problem at its root, <xref ref-type="bibr" rid="B4">Berton et al. (2022a)</xref> recently proposed a paradigm shift in the training procedure for VPR. Their solution called CosPlace is specifically designed for large-scale and fine-grained VPR, and it adopts a classification task as the proxy for training the model without mining. To enable this classification proxy, CosPlace introduces a partitioning strategy that divides the continuous label space of the training images (GPS and compass annotations) into a finite set of disjoint groups (CosPlace groups), each containing a number of classes. This partition is intended to guarantee that images from different classes (i.e., representative of different places) within the same group have no visual overlap. Thereafter, CosPlace is trained sequentially on a single group at a time to avoid ambiguities caused by partition-induced visual aliasing (<xref ref-type="fig" rid="F2">Figure 2</xref>, left). Although CosPlace can be trained on a much larger number of images than reported in previous works and has achieved new state-of-the-art (SOTA) results, we hypothesize that the sequential training protocol is suboptimal because it optimizes an approximation of the intended minimization problem. This hypothesis stems from approaching the CosPlace training protocol from an incremental learning perspective. In fact, each CosPlace group may be regarded as a separate learning task that uses a shared feature extractor and a per-group classification head. During each epoch, the model is trained for a given number of optimization steps on a single group (task). However, there is no guarantee that switching to a new task during the next epoch will not harm the model performances for the older tasks. In this paper, we experimentally validate this hypothesis by showing that sequential training delays convergence and that there are eventually diminishing returns as the number of groups increases beyond a certain threshold.</p>
<p>In light of this observation, we redefine the CosPlace training procedure so that the algorithm trains different groups parallelly (<xref ref-type="fig" rid="F1">Figure 1</xref>). Note that this is different from applying a standard data parallel approach since this would only split the same batch of data corresponding to the same task among the available accelerators (<xref ref-type="fig" rid="F2">Figure 2</xref>, right). The proposed solution not only solves the previous issue by implementing joint objective optimization over all the selected groups but also allows efficient training parallelization. Hence, we refer to this solution as distributed-CosPlace (D-CosPlace). The main contributions of this work are summarized as follows:<list list-type="simple">
<list-item>
<p>&#x2022; We analyze CosPlace to unveil the pitfalls of the original sequential formulation and investigate possible mitigation strategies.</p>
</list-item>
<list-item>
<p>&#x2022; We propose a new group-parallel training protocol called D-CosPlace, which not only addresses extant issues but also allows effective use of communication-efficient SOTA distributed algorithms. This improves the performance of the original CosPlace by a large margin on several VPR datasets within the same time budget.</p>
</list-item>
<list-item>
<p>&#x2022; By further analyzing the training of the proposed distributed version of CosPlace, we outline the open challenges in speeding up training for large-scale VPR.</p>
</list-item>
</list>
</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>In the proposed D-CosPlace, each accelerator parallelly optimizes the model with respect to a different CosGroup for <italic>J</italic> steps before merging the model and optimizers&#x2019; states (backbone only). This process is repeated until convergence.</p>
</caption>
<graphic xlink:href="frobt-11-1386464-g001.tif"/>
</fig>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>Comparison of CosPlace (<xref ref-type="bibr" rid="B4">Berton et al., 2022a</xref>) with a naive data-parallel variant. Unlike both approaches, the model in the proposed solution is jointly optimized with respect to all the training CosGroups (<xref ref-type="fig" rid="F1">Figure 1</xref>). Best viewed in color.</p>
</caption>
<graphic xlink:href="frobt-11-1386464-g002.tif"/>
</fig>
</sec>
<sec id="s2">
<title>2 Related works</title>
<sec id="s2.1">
<title>2.1 Large-scale visual place recognition</title>
<p>Modern VPR approaches extract compact image embeddings using a feature extractor backbone followed by a head that implements aggregation or pooling (<xref ref-type="bibr" rid="B11">Kim et al., 2017</xref>; <xref ref-type="bibr" rid="B3">Arandjelovi&#x107; et al., 2018</xref>; <xref ref-type="bibr" rid="B8">Ge et al., 2020</xref>; <xref ref-type="bibr" rid="B2">Ali-bey et al., 2023</xref>; <xref ref-type="bibr" rid="B7">Berton et al., 2023</xref>; <xref ref-type="bibr" rid="B31">Zhu et al., 2023</xref>). These usually employ contrastive learning, using the geotags of the training set as a type of weak supervision to mine negative examples. However, this mining operation is expensive and impractical for scaling to large datasets (<xref ref-type="bibr" rid="B5">Berton et al., 2022b</xref>). To mitigate this problem, <xref ref-type="bibr" rid="B1">Ali-bey et al. (2022)</xref> proposed the use of a curated training-only dataset in which the images are already split into predefined classes that are far apart from each other, thereby enabling the composition of training batches with images from the same place (positive examples) and from other places (negative examples) very efficiently. The method proposed by <xref ref-type="bibr" rid="B12">Leyva-Vallina et al. (2023)</xref> involves annotating the images with a graded similarity, thus enabling training with contrastive losses and full supervision while achieving improvements in terms of both data efficiency and final model quality. Instead of mitigating the cost of mining, <xref ref-type="bibr" rid="B4">Berton et al. (2022a)</xref> proposed an approach to remove it entirely through their CosPlace method. The idea of CosPlace is to first partition the training images into disjoint groups with one-hot labels and to then train sequentially on these groups with the CosFace loss (<xref ref-type="bibr" rid="B25">Wang et al., 2018</xref>) that was originally designed for large-scale face recognition. Although CosPlace achieves SOTA results on large-scale datasets and even in generalized scenarios, we show here that its sequential training procedure is suboptimal and hampers the convergence speed. In view of these findings, we introduce a parallel-training version of CosPlace that improves the convergence speed and produces new SOTA results on several benchmarks.</p>
</sec>
<sec id="s2.2">
<title>2.2 Distributed training</title>
<p>The growth of deep-learning methods and training datasets is driving research on distributed training solutions. Among these, data parallelism constitutes a popular family of methods (<xref ref-type="bibr" rid="B14">Lin et al., 2020</xref>) wherein different chunks of data are processed in parallel before combining the model updates either synchronously or asynchronously. In particular, to reduce the communication overhead of data movement between the accelerators, local optimization methods are commonly used to allow multiple optimization steps on disjoint sets of data before merging the updates (<xref ref-type="bibr" rid="B21">Stich, 2019</xref>; <xref ref-type="bibr" rid="B29">Yu et al., 2019</xref>; <xref ref-type="bibr" rid="B26">Wang et al., 2020</xref>). In this work, we redefine CosPlace&#x2019;s training procedure by introducing the parallel training of groups and leveraging local methods to speed up convergence.</p>
</sec>
</sec>
<sec id="s3">
<title>3 Analysis of CosPlace</title>
<p>In this section, we analyze the CosPlace training algorithm and highlight the drawbacks of its sequential protocol.</p>
<sec id="s3-1">
<title>3.1 Notation</title>
<p>The first step in CosPlace&#x2019;s training protocol involves creating a set of discrete labels from the continuous space of the Universal Transverse Mercator (UTM) coordinates of the area of interest (<xref ref-type="bibr" rid="B4">Berton et al., 2022a</xref>). Formally, we define the training distribution <inline-formula id="inf1">
<mml:math id="m1">
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
<mml:mo>&#x2254;</mml:mo>
<mml:mi mathvariant="script">X</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi mathvariant="script">C</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, where <inline-formula id="inf2">
<mml:math id="m2">
<mml:mi mathvariant="script">X</mml:mi>
</mml:math>
</inline-formula> is the space of possible images and <inline-formula id="inf3">
<mml:math id="m3">
<mml:mi mathvariant="script">C</mml:mi>
</mml:math>
</inline-formula> is the space of UTM coordinates (east, north, heading). We also define a new distribution <inline-formula id="inf4">
<mml:math id="m4">
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x2254;</mml:mo>
<mml:mi mathvariant="script">X</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi mathvariant="script">Y</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, where <inline-formula id="inf5">
<mml:math id="m5">
<mml:mi mathvariant="script">Y</mml:mi>
</mml:math>
</inline-formula> is the label space induced by partitioning <inline-formula id="inf6">
<mml:math id="m6">
<mml:mi mathvariant="script">C</mml:mi>
</mml:math>
</inline-formula>. Formally, a UTM point <inline-formula id="inf7">
<mml:math id="m7">
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi mathvariant="script">C</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is discretized to a label <inline-formula id="inf8">
<mml:math id="m8">
<mml:mrow>
<mml:mi>y</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfenced open="{" close="}">
<mml:mrow>
<mml:mrow>
<mml:mo>&#x230a;</mml:mo>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi>e</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>M</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
<mml:mo>&#x230b;</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mo>&#x230a;</mml:mo>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>M</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
<mml:mo>&#x230b;</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mo>&#x230a;</mml:mo>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>d</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>g</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b1;</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
<mml:mo>&#x230b;</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>,</inline-formula> where <italic>M</italic> and <italic>&#x3b1;</italic> describe the extent of a region covered by any class in meters and degrees, respectively. The set of such classes is then split into groups called CosGroups by fixing the minimum spatial separation between two classes of the same group in terms of both translation and orientation. Formally, a CosPlace group is defined as the set of classes such that <disp-formula id="e1">
<mml:math id="m9">
<mml:mtable class="align" columnalign="left">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:msub>
<mml:mrow>
<mml:mi>G</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mo>&#x2254;</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="right"/>
<mml:mtd columnalign="left">
<mml:mfenced open="{" close="}">
<mml:mrow>
<mml:mi>y</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi mathvariant="script">Y</mml:mi>
<mml:mo>:</mml:mo>
<mml:mspace width="1em"/>
<mml:mfenced open="&#x230a;" close="&#x230b;">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi>e</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>M</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
<mml:mspace width="0.2em"/>
<mml:mi>mod</mml:mi>
<mml:mspace width="0.2em"/>
<mml:mi>N</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>u</mml:mi>
<mml:mo>,</mml:mo>
<mml:mspace width="0.17em"/>
<mml:mfenced open="&#x230a;" close="&#x230b;">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>M</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
<mml:mspace width="0.2em"/>
<mml:mi>mod</mml:mi>
<mml:mspace width="0.2em"/>
<mml:mi>N</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>,</mml:mo>
<mml:mspace width="0.17em"/>
<mml:mfenced open="&#x230a;" close="&#x230b;">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>d</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>g</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b1;</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
<mml:mspace width="0.2em"/>
<mml:mi>mod</mml:mi>
<mml:mspace width="0.2em"/>
<mml:mi>L</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<label>(1)</label>
</disp-formula>where <italic>N</italic> and <italic>L</italic> are hyperparameters for the fixed minimum spatial and angular separations between classes belonging to the same CosGroup. We denote the set of such groups as <inline-formula id="inf9">
<mml:math id="m10">
<mml:mi mathvariant="script">G</mml:mi>
</mml:math>
</inline-formula>, i.e., <inline-formula id="inf10">
<mml:math id="m11">
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>G</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
<mml:mspace width="0.28em"/>
<mml:mo>&#x2200;</mml:mo>
<mml:mi>u</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi mathvariant="double-struck">N</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. Given multiple CosGroups (defined by Eq. <xref ref-type="disp-formula" rid="e1">1</xref>), it is possible to derive multiple training distributions <inline-formula id="inf11">
<mml:math id="m12">
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x2254;</mml:mo>
<mml:mi mathvariant="script">X</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2282;</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, where each distribution maps the sample image to a one-hot label within the <italic>i</italic>th CosGroup. The CosGroups partition is reflected in the model and is composed of two components: a feature extractor <inline-formula id="inf12">
<mml:math id="m13">
<mml:mrow>
<mml:mi>B</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mo>&#x22c5;</mml:mo>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>:</mml:mo>
<mml:mi mathvariant="script">X</mml:mi>
<mml:mo>&#x2192;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> parameterized by weights <italic>&#x3b8;</italic>
<sup>
<italic>b</italic>
</sup> and multiple classifiers <inline-formula id="inf13">
<mml:math id="m14">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mo>&#x22c5;</mml:mo>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>:</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2192;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mrow>
<mml:mn>0,1</mml:mn>
</mml:mrow>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">&#x7c;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> that are each associated with a different CosGroup parameterized by the weights <inline-formula id="inf14">
<mml:math id="m15">
<mml:msubsup>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula>.</p>
</sec>
<sec id="s3-2">
<title>3.2 CosPlace objective function</title>
<p>The goal of CosPlace is to learn a feature extractor <italic>B</italic>(&#x22c5;) that maps the original distribution <inline-formula id="inf15">
<mml:math id="m16">
<mml:mi mathvariant="script">X</mml:mi>
</mml:math>
</inline-formula> in an embedding space such that the distances between the locations depicted in the images are reflected well. Therefore, CosPlace aims to optimize the following problem:<disp-formula id="e2">
<mml:math id="m17">
<mml:msup>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2a;</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>arg</mml:mtext>
<mml:msub>
<mml:mrow>
<mml:mi>min</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msub>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mi mathvariant="script">G</mml:mi>
<mml:mo stretchy="false">&#x7c;</mml:mo>
</mml:mrow>
</mml:munderover>
</mml:mstyle>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="double-struck">E</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>y</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x223c;</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
</mml:msub>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">lmcl</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x25e6;</mml:mo>
<mml:mi>B</mml:mi>
<mml:mo>,</mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>y</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:math>
<label>(2)</label>
</disp-formula>
</p>
<p>In practice, the training procedure should minimize the large margin cosine loss (LMCL) (<xref ref-type="bibr" rid="B25">Wang et al., 2018</xref>) of the entire model <inline-formula id="inf16">
<mml:math id="m18">
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
<mml:mo>&#x2254;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo>&#x222a;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msubsup>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> with respect to the label distribution(s) induced by discretization of the GPS coordinates into classes and by the grouping of these classes. The parameters <inline-formula id="inf17">
<mml:math id="m19">
<mml:msubsup>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula> of the classifiers are used only to train the feature extractor <italic>&#x3b8;</italic>
<sup>
<italic>b</italic>
</sup> and discarded after training. The final performances of <italic>B</italic>(&#x22c5;) are assessed using the kNN algorithm as the proxy with respect to the original distribution <inline-formula id="inf18">
<mml:math id="m20">
<mml:mi mathvariant="script">D</mml:mi>
</mml:math>
</inline-formula>.</p>
</sec>
<sec id="s3-3">
<title>3.3 CosPlace training: a continual learning perspective</title>
<p>Although CosPlace aims to optimize Eq. <xref ref-type="disp-formula" rid="e2">2</xref>, it is observed that the sequential optimization of <italic>&#x3b8;</italic>
<sup>
<italic>b</italic>
</sup> with respect to each CosGroup is just an approximation of this objective function. Formally, it implements<disp-formula id="e3">
<mml:math id="m21">
<mml:mtable class="align" columnalign="left">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:msup>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2a;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>arg</mml:mtext>
<mml:msub>
<mml:mrow>
<mml:mi>min</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="double-struck">E</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>y</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x223c;</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
</mml:msub>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">lmcl</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x25e6;</mml:mo>
<mml:mi>B</mml:mi>
<mml:mo>,</mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>y</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mspace width="0.17em"/>
<mml:mspace width="0.17em"/>
<mml:mspace width="0.17em"/>
<mml:mo>&#x2a;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mspace width="1em"/>
<mml:mo>&#x2200;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo stretchy="false">&#x7c;</mml:mo>
</mml:mrow>
</mml:mfenced>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="right">
<mml:msup>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2a;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mo>&#x3d;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mspace width="1em"/>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
<mml:mspace width="0.17em"/>
<mml:mi>m</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>d</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<label>(3)</label>
</disp-formula>where <inline-formula id="inf19">
<mml:math id="m22">
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x2286;</mml:mo>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is a subset of all possible CosGroups selected <italic>a priori</italic> for training. Eq. <xref ref-type="disp-formula" rid="e3">3</xref> practically means that at each <italic>iteration</italic> <italic>e</italic>, the training procedure selects the <italic>i</italic>th CosGroup <inline-formula id="inf20">
<mml:math id="m23">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>, with <inline-formula id="inf21">
<mml:math id="m24">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2254;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>e</mml:mi>
<mml:mspace width="0.2em"/>
<mml:mi>mod</mml:mi>
<mml:mspace width="0.2em"/>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo stretchy="false">&#x7c;</mml:mo>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, and jointly optimizes the parameters <italic>&#x3b8;</italic>
<sup>
<italic>b</italic>
</sup> and <inline-formula id="inf22">
<mml:math id="m25">
<mml:msubsup>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula> for <italic>s</italic> optimization steps starting from the optimal model obtained from the previous CosGroup <inline-formula id="inf23">
<mml:math id="m26">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>.</p>
<p>By expressing the CosPlace learning problem in this form, we can revisit it from a continual learning perspective. Accordingly, each distribution associated with a CosGroup can be considered as a task with a disjoint set of labels and dedicated parameters <inline-formula id="inf24">
<mml:math id="m27">
<mml:msubsup>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula>. Therefore, when CosPlace training iterates to a new CosGroup, it is akin to switching to a new task (<xref ref-type="fig" rid="F2">Figure 2</xref>, left). This is different from solving the original problem in Eq. <xref ref-type="disp-formula" rid="e2">2</xref> because there is no guarantee that switching to the new task will not harm the model performances for the older tasks. In practice, the new model updates could be detrimental to the previous tasks, a phenomenon known as catastrophic forgetting (<xref ref-type="bibr" rid="B9">Goodfellow et al., 2014</xref>; <xref ref-type="bibr" rid="B17">Pf&#xfc;lb and Gepperth, 2019</xref>; <xref ref-type="bibr" rid="B18">Ramasesh et al., 2021</xref>). To verify if this phenomenon actually manifests during CosPlace training, we performed an experiment using its original implementation on the SF-XL dataset provided by <xref ref-type="bibr" rid="B4">Berton et al. (2022a)</xref>. We plot the training loss for this experiment in <xref ref-type="fig" rid="F3">Figure 3</xref>, from which it can be clearly seen that at each iteration, when switching to a new CosGroup, the loss function exhibits a steep increase and requires many steps to recover a loss value similar to the one before group change. This behavior is especially notable in the first few iterations, after which it disappears gradually as it is expected for the model to achieve convergence.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Training instabilities of CosPlace (left) and solution using D-CosPlace (right): changing classifiers (e.g., each <italic>s</italic> &#x3d;10<italic>k</italic> steps) is followed by a spike in the training loss. Simple mitigation strategies, e.g., freezing <italic>&#x3b8;</italic>
<sup>
<italic>b</italic>
</sup> for a number of <italic>s</italic>
<sub>
<italic>freeze</italic>
</sub> steps to warmup the classifier and resetting the optimizers&#x2019; states, have limited efficacy and do not work in the long run. The proposed D-CosPlace is unaffected by this problem by design since all the classifiers are optimized jointly.</p>
</caption>
<graphic xlink:href="frobt-11-1386464-g003.tif"/>
</fig>
<p>The reason why optimizing Eq. <xref ref-type="disp-formula" rid="e3">3</xref> still works remarkably well is that the CosPlace training protocol relies on the fact that each task will be revisited after some iterations. Therefore, the algorithm eventually converges to a solution that is also good for the joint objective function of Eq. <xref ref-type="disp-formula" rid="e2">2</xref>. However, this is achieved at the cost of increased training time and is hardly scalable with respect to the number of trained groups <inline-formula id="inf25">
<mml:math id="m28">
<mml:mrow>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo stretchy="false">&#x7c;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>, as observed in the original work (<xref ref-type="bibr" rid="B4">Berton et al., 2022a</xref>). Together, these problems drastically limit the training time scalability of CosPlace, which is its main purpose.</p>
</sec>
<sec id="s3-4">
<title>3.4 Mitigation strategies</title>
<p>Given that the most severe jumps in the training loss in <xref ref-type="fig" rid="F3">Figure 3</xref> occur in the first few iterations, i.e., when the classifiers <inline-formula id="inf26">
<mml:math id="m29">
<mml:msubsup>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula> associated with each task have not yet been trained, one can consider some engineering solutions to solve this problem. A first modification would be to freeze the backbone model <italic>&#x3b8;</italic>
<sup>
<italic>b</italic>
</sup> for a number of steps <italic>s</italic>
<sub>
<italic>freeze</italic>
</sub> &#x226a; <italic>s</italic> whenever the task is changed. This prevents the weights <inline-formula id="inf27">
<mml:math id="m30">
<mml:msubsup>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula> from being uninitialized or too stale with respect to the backbone. Additionally, considering the amount of training that the model <italic>&#x3b8;</italic>
<sup>
<italic>b</italic>
</sup> has undergone since the last time task <italic>i</italic> was selected, it would also be beneficial to reset the optimizer state for model <inline-formula id="inf28">
<mml:math id="m31">
<mml:msubsup>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula> as it may be excessively biased. However, repeating the same experiment as before with these modifications shows that the effectiveness is limited (<xref ref-type="fig" rid="F3">Figure 3</xref>, orange line). In particular, we observe that resetting the optimizer step is only beneficial during the first few iterations, which slightly speeds up convergence. However, we find that this strategy worsens the final model quality in the long run because maintaining the optimizer states is beneficial as the model finally approaches convergence. A similar observation also holds for freezing <italic>&#x3b8;</italic>
<sup>
<italic>b</italic>
</sup>; it is initially useful, although a very large number of <italic>s</italic>
<sub>
<italic>freeze</italic>
</sub> steps are needed for a noticeable reduction in the training loss. In the long run, this becomes detrimental because these steps are wasted.</p>
<p>In conclusion, despite their simplicity, such simple mitigation strategies require careful engineering to determine <italic>s</italic>
<sub>
<italic>freeze</italic>
</sub> as well as decide when to use them, making them practically ineffective. Moreover, since these issues arise after performing a significant amount of training between two samplings of the same task <italic>i</italic>, these simple strategies cannot be scaled when the number of training CosGroups <inline-formula id="inf29">
<mml:math id="m32">
<mml:mrow>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo stretchy="false">&#x7c;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> increases.</p>
</sec>
</sec>
<sec id="s4">
<title>4 Distributed CosPlace</title>
<p>The analysis presented in <xref ref-type="sec" rid="s3">Section 3</xref> reveals that the CosPlace training procedure does not correctly implement the objective function of Eq. <xref ref-type="disp-formula" rid="e2">2</xref>. The problem here lies in the sequential protocol, which optimizes the model with respect to each CosGroup separately in a sequential manner. To recover the objective function of Eq. <xref ref-type="disp-formula" rid="e2">2</xref>, we should calculate the gradients for all CosGroups in parallel, i.e., using the same model <italic>&#x3b8;</italic>
<sup>
<italic>b</italic>
</sup>, before averaging them to update the model according to the optimizer policy. These gradients can be computed sequentially or in parallel to benefit from the multiple accelerators. This joint optimization procedure exactly recovers the original objective function of the vanilla CosPlace aimed at optimization Eq. <xref ref-type="disp-formula" rid="e2">2</xref>: indeed, at each optimization step, the algorithm optimizes <inline-formula id="inf30">
<mml:math id="m33">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="double-struck">E</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x223c;</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">lmcl</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x25e6;</mml:mo>
<mml:mi>B</mml:mi>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> jointly with respect to all CosGroups <inline-formula id="inf31">
<mml:math id="m34">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>. Accordingly, the proposed formulation effectively addresses the problem outlined in <xref ref-type="sec" rid="s3-3">Section 3.3</xref> as shown in <xref ref-type="fig" rid="F3">Figure 3</xref> (right), in which the severe jumps during the sequential learning of CosGroup are solved completely.</p>
<p>This idea may seem to be similar to standard data parallelization, as implemented in most deep-learning frameworks. In fact, a common implementation would entail dividing the original batch of data into <italic>k</italic> smaller chunks, letting each accelerator compute gradients with respect to the same model on a chunk, merging these chunks, and updating the final model according to the optimizer policy (<xref ref-type="fig" rid="F2">Figure 2</xref>, right). However, this approach does not address the problem arising from sequential training as noted previously because it would still be applied separately to each CosGroup. Instead, we need a data parallelization strategy that is aware of the divisions in the CosGroups where each one corresponds to a separate classifier and can jointly optimize the model with respect to all CosGroups. Moreover, since each CosGroup is a disjoint set of data by construction, it is possible to assign one or more CosGroups to each accelerator or compute node and train without the need of a distributed sampling strategy or centralized storage. This effectively reduces data movement related to the training samples because a CosGroup can be previously stored locally on its assigned compute node.</p>
<p>This group-parallel approach can be further improved using local optimization methods (<xref ref-type="bibr" rid="B21">Stich, 2019</xref>; <xref ref-type="bibr" rid="B14">Lin et al., 2020</xref>). The core idea here is to have a master send the current model <italic>&#x3b8;</italic>
<sup>
<italic>b</italic>
</sup> to all accelerators that parallelly optimize it for <italic>J</italic> (local) steps before returning the updates to the master. The master then averages the updates and applies them to the current model. This process is repeated for a given number of iterations until convergence. Intuitively, performing multiple local steps before averaging allows training speedup by reducing the communication rate between the accelerators. It is also important to note that pure local methods allow the use of any optimizer during local training, while the master always calculates the new model as an exact average of the local models after training. A more general approach is SlowMo (<xref ref-type="bibr" rid="B26">Wang et al., 2020</xref>) that further applies stochastic gradient descent (SGD) with momentum on the master by using the exact average of the trainers&#x2019; gradients as the pseudogradient. Trivially, setting the momentum term <italic>&#x3b2;</italic> &#x3d; 0 in SlowMo corresponds to recovering the pure local method employed. By implementing multiple local steps, using local methods on CosGroup allows i) respecting the problem formulation in Eq. <xref ref-type="disp-formula" rid="e2">2</xref>, ii) lowering the data movement related to training samples, and iii) achieving high communication efficiency during training. A scheme representing the parallel training procedure across different CosGroups using local methods is depicted in <xref ref-type="fig" rid="F1">Figure 1</xref>, which we call as the D-CosPlace system.</p>
</sec>
<sec id="s5">
<title>5 Experiments</title>
<sec id="s5-1">
<title>5.1 Implementation details</title>
<sec id="s5-1-1">
<title>5.1.1 Model and training datasets</title>
<p>For all the experiments, we used a backbone based on ResNet-18, followed by GeM pooling and a fully connected layer with output dimension <italic>D</italic> &#x3d; 512, as in <xref ref-type="bibr" rid="B4">Berton et al. (2022a)</xref>. As per the training dataset, we used SF-XL, a large-scale dataset created from Google StreetView imagery, and retained the best hyperparameters of the original CosPlace (<italic>M</italic> &#x3d; 10 m, <italic>&#x3b1;</italic> &#x3d; 30&#xb0;, <italic>N</italic> &#x3d; 5, and <italic>L</italic> &#x3d; 2). Under this configuration, the total number of CosGroups is <inline-formula id="inf32">
<mml:math id="m35">
<mml:mrow>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mi mathvariant="script">G</mml:mi>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>50</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, and training is performed through experiments with <inline-formula id="inf33">
<mml:math id="m36">
<mml:mrow>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:mrow>
<mml:mn>4,8,16</mml:mn>
</mml:mrow>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, thereby demonstrating that the proposed approach can be scaled with increasing number of groups (and hence the dataset size).</p>
</sec>
<sec id="s5-1-2">
<title>5.1.2 Training hyperparameters</title>
<p>For the classic CosPlace sequential training, <italic>s</italic> &#x3d; 10<italic>k</italic> iterations for a given CosGroup before moving on to the next. As optimizers, Adam and Local-Adam are used for the distributed version, with learning rates of <italic>&#x3b7;</italic>
<sub>
<italic>b</italic>
</sub> &#x3d; 10<sup>&#x2013;5</sup> and <italic>&#x3b7;</italic>
<sub>
<italic>f</italic>
</sub> &#x3d; 10<sup>&#x2013;2</sup> for the backbone <italic>&#x3b8;</italic>
<sup>
<italic>b</italic>
</sup> and classifiers <inline-formula id="inf34">
<mml:math id="m37">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mspace width="0.17em"/>
<mml:mo>&#x2200;</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, respectively. Unless otherwise specified, all the algorithms employ a batch size equal to 32 for each group trained, mainly because of the hardware memory limitations. For the distributed version, we additionally adopted a warm-up scheme by doubling the learning rate for the first three iterations. We searched the optimal number of local steps using <italic>J</italic> &#x2208; {1, 10, 100} and found <italic>J</italic> &#x3d; 10 to be the best; similarly, the slow momentum values <italic>&#x3b2;</italic> &#x2208; {0.1, 0.3, 0.5, 0.7} were evaluated before choosing <italic>&#x3b2;</italic> &#x3d; 0.3. To provide meaningful comparisons, we considered a fixed wall-clock time budget of 60 h per experiment with reference to using NVIDIA GTX1080 GPUs.</p>
</sec>
<sec id="s5-1-3">
<title>5.1.3 Testing procedure</title>
<p>To assess the performances of the algorithms, we selected the model that performed best on the SF-XL validation set and used it to measure the Recall@1 (R@1) and Recall@5 (R@5) values. Following standard procedures (<xref ref-type="bibr" rid="B30">Zaffar et al., 2021</xref>; <xref ref-type="bibr" rid="B19">Schubert et al., 2023</xref>), Recall@N is defined as the number of queries for which at least one of the first N predictions is correct, divided by the total number of queries. A prediction is deemed correct if its distance from the query is less than 25 m (<xref ref-type="bibr" rid="B3">Arandjelovi&#x107; et al., 2018</xref>). In reporting the final performance, we tested the chosen model on the Pitts250k (<xref ref-type="bibr" rid="B24">Torii et al., 2015</xref>), Pitts30k (<xref ref-type="bibr" rid="B10">Gron&#xe1;t et al., 2013</xref>), Tokyo 24/7 (<xref ref-type="bibr" rid="B23">Torii et al., 2018</xref>), Mapillary Street Level Sequences (MSLS) (<xref ref-type="bibr" rid="B28">Warburg et al., 2020</xref>), SF-XL (<xref ref-type="bibr" rid="B4">Berton et al., 2022a</xref>), St. Lucia (<xref ref-type="bibr" rid="B16">Milford and Wyeth, 2008</xref>), SVOX (<xref ref-type="bibr" rid="B6">Berton et al., 2021</xref>), and Nordland (<xref ref-type="bibr" rid="B22">S&#xfc;nderhauf et al., 2013</xref>) datasets.</p>
</sec>
</sec>
<sec id="s5-2">
<title>5.2 D-CosPlace vs CosPlace</title>
<p>In this section, we compare the results obtained by D-CosPlace with those from the original CosPlace algorithm in terms of both convergence speed (cf. <xref ref-type="table" rid="T2">Table 2</xref>) and final model quality given the time budget (cf. <xref ref-type="table" rid="T1">Table 1</xref>).</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Final model quality comparisons between CosPlace and D-CosPlace for equal training times on several VG datasets and varying numbers of CosGroups used during training. The results show that D-CosPlace can leverage multiple CosGroups, outperforming the vanilla CosPlace on average. The best overall results for each dataset are shown in boldface, while the best result for each number of CosGroups is underlined.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="center">Method</th>
<th rowspan="2" align="left">&#x23;CosGroups</th>
<th colspan="2" align="center">Pitts-30k</th>
<th colspan="2" align="center">Pitts-250k</th>
<th colspan="2" align="center">Tokyo 24/7</th>
<th colspan="2" align="center">MSLS</th>
<th colspan="2" align="center">SF-XL v1</th>
<th colspan="2" align="center">SF-XL v2</th>
<th colspan="2" align="center">Average</th>
</tr>
<tr>
<th align="left">R@1</th>
<th align="left">R@5</th>
<th align="left">R@1</th>
<th align="left">R@5</th>
<th align="left">R@1</th>
<th align="left">R@5</th>
<th align="left">R@1</th>
<th align="left">R@5</th>
<th align="left">R@1</th>
<th align="left">R@5</th>
<th align="left">R@1</th>
<th align="left">R@5</th>
<th align="left">R@1</th>
<th align="left">R@5</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">CosPlace (<xref ref-type="bibr" rid="B4">Berton et al., 2022a</xref>)</td>
<td rowspan="3" align="center">4</td>
<td align="left">89.4</td>
<td align="left">95.0</td>
<td align="left">90.5</td>
<td align="left">97.1</td>
<td align="left">80.0</td>
<td align="left">89.5</td>
<td align="left">81.0</td>
<td align="left">87.7</td>
<td align="left">65.6</td>
<td align="left">73.0</td>
<td align="left">85.6</td>
<td align="left">91.8</td>
<td align="left">82.0</td>
<td align="left">89.0</td>
</tr>
<tr>
<td align="left">D-CosPlace</td>
<td align="left">89.6</td>
<td align="left">94.8</td>
<td align="left">90.4</td>
<td align="left">96.6</td>
<td align="left">77.8</td>
<td align="left">90.8</td>
<td align="left">83.0</td>
<td align="left">89.5</td>
<td align="left">67.6</td>
<td align="left">76.0</td>
<td align="left">85.4</td>
<td align="left">92.5</td>
<td align="left">82.3</td>
<td align="left">90.0</td>
</tr>
<tr>
<td align="left">D-CosPlace (w/SlowMo)</td>
<td align="left">90.0</td>
<td align="left">95.0</td>
<td align="left">90.6</td>
<td align="left">96.6</td>
<td align="left">
<bold>81.3</bold>
</td>
<td align="left">
<bold>91.7</bold>
</td>
<td align="left">82.2</td>
<td align="left">89.5</td>
<td align="left">68.5</td>
<td align="left">75.5</td>
<td align="left">85.4</td>
<td align="left">92.3</td>
<td align="left">83.0</td>
<td align="left">90.1</td>
</tr>
<tr>
<td align="left">CosPlace (<xref ref-type="bibr" rid="B4">Berton et al., 2022a</xref>)</td>
<td rowspan="3" align="center">8</td>
<td align="left">89.5</td>
<td align="left">94.8</td>
<td align="left">90.4</td>
<td align="left">96.9</td>
<td align="left">81.6</td>
<td align="left">90.2</td>
<td align="left">81.8</td>
<td align="left">88.7</td>
<td align="left">65.5</td>
<td align="left">74.1</td>
<td align="left">84.6</td>
<td align="left">91.6</td>
<td align="left">82.2</td>
<td align="left">89.4</td>
</tr>
<tr>
<td align="left">D-CosPlace</td>
<td align="left">90.1</td>
<td align="left">
<bold>95.2</bold>
</td>
<td align="left">91.4</td>
<td align="left">
<bold>97.3</bold>
</td>
<td align="left">80.3</td>
<td align="left">89.8</td>
<td align="left">83.2</td>
<td align="left">89.9</td>
<td align="left">70.4</td>
<td align="left">78.8</td>
<td align="left">86.4</td>
<td align="left">93.6</td>
<td align="left">83.6</td>
<td align="left">90.8</td>
</tr>
<tr>
<td align="left">D-CosPlace (w/SlowMo)</td>
<td align="left">90.0</td>
<td align="left">
<bold>95.2</bold>
</td>
<td align="left">
<bold>91.5</bold>
</td>
<td align="left">96.9</td>
<td align="left">80.9</td>
<td align="left">
<bold>91.7</bold>
</td>
<td align="left">83.3</td>
<td align="left">89.8</td>
<td align="left">70.4</td>
<td align="left">78.9</td>
<td align="left">86.6</td>
<td align="left">94.0</td>
<td align="left">83.8</td>
<td align="left">91.1</td>
</tr>
<tr>
<td align="left">CosPlace (<xref ref-type="bibr" rid="B4">Berton et al., 2022a</xref>)</td>
<td rowspan="3" align="center">16</td>
<td align="left">89.4</td>
<td align="left">94.9</td>
<td align="left">90.4</td>
<td align="left">96.7</td>
<td align="left">78.4</td>
<td align="left">89.2</td>
<td align="left">81.5</td>
<td align="left">88.2</td>
<td align="left">64.5</td>
<td align="left">73.4</td>
<td align="left">84.8</td>
<td align="left">91.5</td>
<td align="left">81.5</td>
<td align="left">89.0</td>
</tr>
<tr>
<td align="left">D-CosPlace</td>
<td align="left">
<bold>90.3</bold>
</td>
<td align="left">
<bold>95.2</bold>
</td>
<td align="left">91.1</td>
<td align="left">96.9</td>
<td align="left">80.6</td>
<td align="left">89.5</td>
<td align="left">83.0</td>
<td align="left">89.9</td>
<td align="left">69.2</td>
<td align="left">78.9</td>
<td align="left">86.6</td>
<td align="left">93.3</td>
<td align="left">83.5</td>
<td align="left">90.6</td>
</tr>
<tr>
<td align="left">D-CosPlace (w/SlowMo)</td>
<td align="left">90.0</td>
<td align="left">95.0</td>
<td align="left">91.3</td>
<td align="left">97.2</td>
<td align="left">78.4</td>
<td align="left">90.8</td>
<td align="left">
<bold>84.1</bold>
</td>
<td align="left">
<bold>90.4</bold>
</td>
<td align="left">
<bold>71.2</bold>
</td>
<td align="left">
<bold>79.7</bold>
</td>
<td align="left">
<bold>88.1</bold>
</td>
<td align="left">
<bold>94.2</bold>
</td>
<td align="left">
<bold>86.4</bold>
</td>
<td align="left">
<bold>91.2</bold>
</td>
</tr>
</tbody>
</table>
</table-wrap>
<sec id="s5-2-1">
<title>5.2.1 Convergence speed</title>
<p>We compared the convergence speed of D-CosPlace to that of the vanilla CosPlace. For both algorithms, we report the wall-clock training times under the same conditions using a single GPU and 4 GPUs separately. The results in <xref ref-type="table" rid="T2">Table 2</xref> show that D-CosPlace achieves the same final accuracy as that of CosPlace while requiring less than half of the time budget. This is because the proposed parallel training procedure avoids training instabilities due to changing the CosGroup, thus leveraging the potential of the classification proxy task in a more efficient manner.</p>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>Convergence speed comparisons between CosPlace and D-CosPlace using <inline-formula id="inf35">
<mml:math id="m38">
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>8</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> (best results in boldface): D-CosPlace can achieve the same accuracy as CosPlace for a fraction of the total wall-clock time. Alternatively, it surpasses the performance of the vanilla CosPlace within the time budget.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="center">Method</th>
<th colspan="2" align="center">Wall-clock time (hh:mm)</th>
<th colspan="2" align="center">Best accuracy (SF-XL val)</th>
</tr>
<tr>
<th align="center">Target R@1</th>
<th align="center">Best R@1</th>
<th align="center">R@1</th>
<th align="center">R@5</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">CosPlace (<xref ref-type="bibr" rid="B4">Berton et al., 2022a</xref>)</td>
<td align="left">57:30</td>
<td align="left">57:30</td>
<td align="left">90.9</td>
<td align="left">95.5</td>
</tr>
<tr>
<td align="left">D-CosPlace J &#x3d; 1</td>
<td align="left">42:00</td>
<td align="left">
<bold>49:50</bold>
</td>
<td align="left">91.4</td>
<td align="left">96.2</td>
</tr>
<tr>
<td align="left">D-CosPlace J &#x3d; 10</td>
<td align="left">
<bold>25:50</bold>
</td>
<td align="left">59:25</td>
<td align="left">
<bold>92.2</bold>
</td>
<td align="left">
<bold>96.6</bold>
</td>
</tr>
<tr>
<td align="left">D-CosPlace J &#x3d; 100</td>
<td align="left">26:02</td>
<td align="left">54:26</td>
<td align="left">91.6</td>
<td align="left">96.5</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s5-2-2">
<title>5.2.2 Final model quality</title>
<p>In addition to being significantly faster, D-CosPlace also achieves a better final model quality within the time budget. <xref ref-type="table" rid="T1">Table 1</xref> shows that the distributed version consistently outperforms the vanilla baseline on all the tested datasets. The reason behind this rather prominent gap is that our formulation effectively implements the objective function in Eq. <xref ref-type="disp-formula" rid="e2">2</xref> while CosPlace implements Eq. <xref ref-type="disp-formula" rid="e2">2</xref>.</p>
</sec>
<sec id="s5-2-3">
<title>5.2.3 Scalability on the number of CosGroups</title>
<p>To further corroborate the claim that our formulation of CosPlace training is effective for exploiting larger datasets, we present the results for various numbers of training groups. It is noted that the original CosPlace treats <inline-formula id="inf36">
<mml:math id="m39">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> as a hyperparameter and determined that <inline-formula id="inf37">
<mml:math id="m40">
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">G</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>8</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> worked best, whereas adding more groups would be detrimental. The results in <xref ref-type="table" rid="T1">Table 1</xref> confirm this limitation of CosPlace and show that D-CosPlace can effectively utilize more CosGroups, owing to the formulation of the objective function of Eq. <xref ref-type="disp-formula" rid="e2">2</xref>.</p>
</sec>
<sec id="s5-2-4">
<title>5.2.4 Fair comparison with larger batch size</title>
<p>Since the distributed version trains <italic>N</italic>
<sub>
<italic>t</italic>
</sub> groups in parallel using the same original batch size for all groups (e.g., respective classifiers), the actual batch size with respect to <italic>&#x3b8;</italic>
<sup>
<italic>b</italic>
</sup> is <italic>N</italic>
<sub>
<italic>t</italic>
</sub> times larger than that used for the vanilla CosPlace. For fair comparison, we also implemented CosPlace with the same batch size to investigate if a larger batch size would be needed to achieve faster convergence. The results presented in Figure 4 show that there is no advantage in increasing the batch size for the convergence speed or final model quality, further corroborating that CosPlace&#x2019;s problem lies in the sequential training procedure.</p>
</sec>
</sec>
<sec id="s5-3">
<title>5.3 Ablation study: effect of local steps</title>
<p>Local steps ensure that the distributed training is more efficient from a communication perspective by lowering the synchronization frequency. However, even when a large number of local steps is desirable, too many steps could slow the convergence when the training distributions are different, like in our case. For this reason, <italic>J</italic> is treated as a hyperparameter. <xref ref-type="table" rid="T2">Table 2</xref> shows the impact of the local steps on the convergence speed and final model quality, where the former is expressed in terms of wall-clock time to reach the accuracy of the vanilla CosPlace and the latter is expressed as R@1/R@5. It can be seen that <italic>J</italic> &#x3d; 10 produces the optimal balance between training time, convergence speed, and final model quality.</p>
</sec>
<sec id="s5-4">
<title>5.4 Comparisons with other methods</title>
<sec id="s5-4-1">
<title>5.4.1 Baselines</title>
<p>Herein, we compare D-CosPlace with a number of SOTA VPR methods, namely, the evergreen NetVLAD <xref ref-type="bibr" rid="B3">Arandjelovi&#x107; et al. (2018)</xref>, SFRS <xref ref-type="bibr" rid="B8">Ge et al. (2020)</xref> that improves on NetVLAD with an ingenious augmentation technique, Conv-AP <xref ref-type="bibr" rid="B1">Ali-bey et al. (2022)</xref> that uses a multisimilarity loss <xref ref-type="bibr" rid="B27">Wang et al. (2019)</xref>, CosPlace <xref ref-type="bibr" rid="B4">Berton et al. (2022a),</xref> and MixVPR <xref ref-type="bibr" rid="B2">Ali-bey et al. (2023)</xref> that uses a powerful and efficient MLP-mixer as the aggregator. For NetVLAD and SFRS, we use the authors&#x2019; best-performing backbone, which is the VGG16 (<xref ref-type="bibr" rid="B20">Simonyan and Zisserman, 2015</xref>), whereas for all the other methods, we use their respective implementations with a ResNet-50 backbone and output dimensionality of 512.</p>
</sec>
<sec id="s5-4-2">
<title>5.4.2 Results</title>
<p>As seen from the results in <xref ref-type="table" rid="T3">Table 3</xref>, D-CosPlace not only improves upon the vanilla CosPlace by a large margin of &#x2b;11.5% on average R@1 but also achieves new results as a SOTA VPR algorithm, surpassing CONV-AP by &#x2b;1.6% on average R@1. These results show that the improved formulation of the classification proxy task originally introduced in CosPlace effectively learns better features for image retrieval.</p>
<table-wrap id="T3" position="float">
<label>TABLE 3</label>
<caption>
<p>Final model quality comparisons with state-of-the-art VPR approaches on several datasets using ResNet-50 as the backbone. The best overall results for each dataset are in boldface, and the second-best results are underlined. D-CosPlace outperform the competitors (including CosPlace) in all cases except the &#x201c;Tokyo 24/7&#x2033; and &#x201c;MSLS&#x201d; datasets. We believe that this may be attributed to the superior fitting capabilities of D-CosPlace as well as the datasets being particularly different from the one used to train the models. However, D-CosPlace outperforms CosPlace by a large margin (&#x2b;11.5% on R@1) on average.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="center">Method</th>
<th colspan="2" align="center">Pitts30k</th>
<th colspan="2" align="center">Pitts250k</th>
<th colspan="2" align="center">Tokyo 24/7</th>
<th colspan="2" align="center">MSLS</th>
<th colspan="2" align="center">SF-XL v1</th>
<th colspan="2" align="center">SF-XL v2</th>
<th colspan="2" align="center">St. Lucia</th>
</tr>
<tr>
<th align="center">R@1</th>
<th align="center">R@5</th>
<th align="center">R@1</th>
<th align="center">R@5</th>
<th align="center">R@1</th>
<th align="center">R@5</th>
<th align="center">R@1</th>
<th align="center">R@5</th>
<th align="center">R@1</th>
<th align="center">R@5</th>
<th align="center">R@1</th>
<th align="center">R@5</th>
<th align="center">R@1</th>
<th align="center">R@5</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">NetVLAD (<xref ref-type="bibr" rid="B3">Arandjelovi&#x107; et al., 2018</xref>)</td>
<td align="left">85.0</td>
<td align="left">92.1</td>
<td align="left">85.9</td>
<td align="left">93.1</td>
<td align="left">69.8</td>
<td align="left">81.3</td>
<td align="left">58.9</td>
<td align="left">70.8</td>
<td align="left">40.0</td>
<td align="left">52.9</td>
<td align="left">76.9</td>
<td align="left">88.8</td>
<td align="left">64.6</td>
<td align="left">80.3</td>
</tr>
<tr>
<td align="left">SFRS (<xref ref-type="bibr" rid="B8">Ge et al., 2020</xref>)</td>
<td align="left">89.1</td>
<td align="left">94.6</td>
<td align="left">90.4</td>
<td align="left">96.3</td>
<td align="left">80.3</td>
<td align="left">88.6</td>
<td align="left">70.0</td>
<td align="left">80.0</td>
<td align="left">50.3</td>
<td align="left">60.0</td>
<td align="left">83.8</td>
<td align="left">90.5</td>
<td align="left">75.9</td>
<td align="left">86.6</td>
</tr>
<tr>
<td align="left">Conv-AP (<xref ref-type="bibr" rid="B1">Ali-bey et al., 2022</xref>)</td>
<td align="left">89.1</td>
<td align="left">94.6</td>
<td align="left">90.4</td>
<td align="left">96.7</td>
<td align="left">61.3</td>
<td align="left">77.8</td>
<td align="left">82.3</td>
<td align="left">90.3</td>
<td align="left">41.8</td>
<td align="left">53.1</td>
<td align="left">64.0</td>
<td align="left">81.2</td>
<td align="left">99.1</td>
<td align="left">99.99</td>
</tr>
<tr>
<td align="left">CosPlace (<xref ref-type="bibr" rid="B4">Berton et al., 2022a</xref>)</td>
<td align="left">90.2</td>
<td align="left">95.2</td>
<td align="left">91.7</td>
<td align="left">97.0</td>
<td align="left">
<bold>89.5</bold>
</td>
<td align="left">
<bold>94.9</bold>
</td>
<td align="left">
<bold>86.9</bold>
</td>
<td align="left">
<bold>93.2</bold>
</td>
<td align="left">76.7</td>
<td align="left">82.5</td>
<td align="left">89.0</td>
<td align="left">95.3</td>
<td align="left">99.2</td>
<td align="left">99.99</td>
</tr>
<tr>
<td align="left">MixVPR (<xref ref-type="bibr" rid="B2">Ali-bey et al., 2023</xref>)</td>
<td align="left">90.4</td>
<td align="left">95.4</td>
<td align="left">
<bold>93.0</bold>
</td>
<td align="left">
<bold>97.8</bold>
</td>
<td align="left">78.4</td>
<td align="left">86.7</td>
<td align="left">83.6</td>
<td align="left">91.5</td>
<td align="left">57.7</td>
<td align="left">70.3</td>
<td align="left">84.3</td>
<td align="left">91.6</td>
<td align="left">99.2</td>
<td align="left">99.99</td>
</tr>
<tr>
<td align="left">
<bold>D-CosPlace (proposed)</bold>
</td>
<td align="left">
<bold>91.2</bold>
</td>
<td align="left">
<bold>95.7</bold>
</td>
<td align="left">92.3</td>
<td align="left">97.3</td>
<td align="left">85.7</td>
<td align="left">94.0</td>
<td align="left">86.1</td>
<td align="left">91.9</td>
<td align="left">
<bold>80.9</bold>
</td>
<td align="left">
<bold>86.2</bold>
</td>
<td align="left">
<bold>91.0</bold>
</td>
<td align="left">
<bold>95.7</bold>
</td>
<td align="left">
<bold>99.5</bold>
</td>
<td align="left">
<bold>100.0</bold>
</td>
</tr>
<tr>
<td align="left" style="background-color:#BFBFBF"/>
<td colspan="2" align="left" style="background-color:#BFBFBF">Nordland</td>
<td colspan="2" align="left" style="background-color:#BFBFBF">SVOX night</td>
<td colspan="2" align="left" style="background-color:#BFBFBF">SVOX overcast</td>
<td colspan="2" align="left" style="background-color:#BFBFBF">SVOX rain</td>
<td colspan="2" align="left" style="background-color:#BFBFBF">SVOX snow</td>
<td colspan="2" align="left" style="background-color:#BFBFBF">SVOX sun</td>
<td colspan="2" align="left" style="background-color:#BFBFBF">Average</td>
</tr>
<tr>
<td align="left">NetVLAD (<xref ref-type="bibr" rid="B3">Arandjelovi&#x107; et al., 2018</xref>)</td>
<td align="left">13.1</td>
<td align="left">21.1</td>
<td align="left">8.0</td>
<td align="left">17.4</td>
<td align="left">66.4</td>
<td align="left">81.5</td>
<td align="left">51.5</td>
<td align="left">69.3</td>
<td align="left">54.4</td>
<td align="left">71.8</td>
<td align="left">35.4</td>
<td align="left">52.7</td>
<td align="left">58.5</td>
<td align="left">67.2</td>
</tr>
<tr>
<td align="left">SFRS (<xref ref-type="bibr" rid="B8">Ge et al., 2020</xref>)</td>
<td align="left">16.0</td>
<td align="left">24.1</td>
<td align="left">28.6</td>
<td align="left">40.6</td>
<td align="left">81.1</td>
<td align="left">88.4</td>
<td align="left">69.7</td>
<td align="left">81.5</td>
<td align="left">76.0</td>
<td align="left">86.1</td>
<td align="left">54.8</td>
<td align="left">68.3</td>
<td align="left">66.6</td>
<td align="left">75.8</td>
</tr>
<tr>
<td align="left">Conv-AP (<xref ref-type="bibr" rid="B1">Ali-bey et al., 2022</xref>)</td>
<td align="left">66.5</td>
<td align="left">79.7</td>
<td align="left">51.6</td>
<td align="left">68.8</td>
<td align="left">90.0</td>
<td align="left">96.6</td>
<td align="left">87.3</td>
<td align="left">94.7</td>
<td align="left">89.5</td>
<td align="left">97.0</td>
<td align="left">75.9</td>
<td align="left">88.3</td>
<td align="left">83.4</td>
<td align="left">91.0</td>
</tr>
<tr>
<td align="left">CosPlace (<xref ref-type="bibr" rid="B4">Berton et al., 2022a</xref>)</td>
<td align="left">59.2</td>
<td align="left">74.6</td>
<td align="left">36.0</td>
<td align="left">52.5</td>
<td align="left">90.5</td>
<td align="left">95.9</td>
<td align="left">80.3</td>
<td align="left">90.0</td>
<td align="left">86.4</td>
<td align="left">95.3</td>
<td align="left">75.3</td>
<td align="left">88.1</td>
<td align="left">73.5</td>
<td align="left">83.3</td>
</tr>
<tr>
<td align="left">MixVPR (<xref ref-type="bibr" rid="B2">Ali-bey et al., 2023</xref>)</td>
<td align="left">
<bold>67.2</bold>
</td>
<td align="left">
<bold>81.0</bold>
</td>
<td align="left">44.8</td>
<td align="left">63.2</td>
<td align="left">93.9</td>
<td align="left">97.7</td>
<td align="left">86.4</td>
<td align="left">93.9</td>
<td align="left">
<bold>93.9</bold>
</td>
<td align="left">97.6</td>
<td align="left">78.7</td>
<td align="left">91.2</td>
<td align="left">80.9</td>
<td align="left">89.7</td>
</tr>
<tr>
<td align="left">
<bold>D-CosPlace (proposed)</bold>
</td>
<td align="left">65.6</td>
<td align="left">79.7</td>
<td align="left">
<bold>56.5</bold>
</td>
<td align="left">
<bold>73.0</bold>
</td>
<td align="left">
<bold>94.6</bold>
</td>
<td align="left">
<bold>97.8</bold>
</td>
<td align="left">
<bold>88.8</bold>
</td>
<td align="left">
<bold>96.1</bold>
</td>
<td align="left">91.1</td>
<td align="left">
<bold>97.8</bold>
</td>
<td align="left">
<bold>81.9</bold>
</td>
<td align="left">
<bold>91.9</bold>
</td>
<td align="left">
<bold>85.0</bold>
</td>
<td align="left">
<bold>92.1</bold>
</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="s5-5">
<title>5.5 Open challenges</title>
<p>Our analysis in <xref ref-type="sec" rid="s3-3">Section 3.3</xref> reveals that CosPlace&#x2019;s training procedure experiences severe jumps in the loss function due to the optimization procedure not implementing the objective function in Eq. <xref ref-type="disp-formula" rid="e2">2</xref> correctly. Indeed, the sharp jumps in loss occur only in the vanilla CosPlace because of the training process that optimizes different CosGroups (and their related classification heads) one at a time. This does not occur in D-CosPlace since all classifiers associated with the CosGroup are jointly optimized (<xref ref-type="fig" rid="F3">Figure 3</xref>). A second challenge that we experienced with CosPlace is the noisy optimization of a single CosGroup, as shown by the loss in <xref ref-type="fig" rid="F4">Figure 4</xref>. It is noted that the training loss is particularly unstable and remains high for many steps before dropping abruptly, with a seemingly periodic cycle every <inline-formula id="inf38">
<mml:math id="m41">
<mml:mrow>
<mml:mo>&#x2248;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> steps. We initially associated this behavior with the batch size, especially if it is a low value when compared to the output dimensionality of the final layer. Each CosGroup is in fact associated with <inline-formula id="inf39">
<mml:math id="m42">
<mml:mrow>
<mml:mo>&#x2248;</mml:mo>
<mml:mn>35</mml:mn>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> classes on average, which makes the problem hard to learn. Additionally, the LMCL loss seeks a hard margin boundary, which can be difficult to achieve in high-dimensional problems. To validate this hypothesis, we increased the batch size to fill the memory of an NVIDIA-V100-32 GB GPU. The results in <xref ref-type="fig" rid="F4">Figure 4</xref> show that the problem persists even after increasing to 1,024 samples. Considering the validation results, the initial value of 32 still gives the best validation performance, substantiating the conclusion that increasing the batch size is not a practical solution. This difficulty of learning a single CosGroup is still present in D-CosPlace since the optimization with respect to a CosGroup is the same as that for CosPlace. We believe this to be an intrinsic limitation of the classification approach of CosPlace that will be an interesting direction for future works.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Noisy loss during training of a CosGroup in CosPlace: the training loss is plotted for the first 2.5<italic>k</italic> steps, which correspond to an iteration with batch size 128 and two iterations with batch size 256. It can be observed that the stepwise behavior remains even after enlarging the batch size, suggesting that other factors may be involved. The abrupt jumps observed for the orange, green, and red lines are attributed to the changes in the trained CosGroups (and hence the final classification head), which occur in fewer steps with respect to the blue line, owing to the increase in batch size.</p>
</caption>
<graphic xlink:href="frobt-11-1386464-g004.tif"/>
</fig>
</sec>
</sec>
<sec sec-type="conclusion" id="s6">
<title>6 Conclusion</title>
<p>In this work, we analyzed the training procedure of CosPlace, a recent SOTA large-scale VPR method, by showing that its sequential protocol does not correctly implement the intended objective. By leveraging an incremental perspective on the problem, we modified the training procedure such that it correctly optimizes the learning objective function. This new formulation enables efficient distributed training since it allows disjoint sets of the dataset to be preallocated to the assigned compute nodes and benefits from the multiple local training steps. In particular, we show that i) D-CosPlace converges faster than CosPlace and that ii) within a fixed time budget, D-CosPlace outperforms CosPlace by a large margin. We also outline some open challenges in further speeding up the training of CosPlace, highlighting the instabilities during the training of the CosGroups. We believe that these insights are valuable for the research community in not only the field of VPR but also other large-scale image retrieval tasks.</p>
</sec>
</body>
<back>
<sec sec-type="data-availability" id="s7">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/Supplementary material; further inquiries can be directed to the corresponding author.</p>
</sec>
<sec id="s8">
<title>Author contributions</title>
<p>RZ: formal analysis, investigation, methodology, and writing&#x2013;original draft. GB: validation and writing&#x2013;review and editing. CM: conceptualization, supervision, and writing&#x2013;review and editing.</p>
</sec>
<sec sec-type="funding-information" id="s9">
<title>Funding</title>
<p>The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. This study was carried out within the project FAIR - Future Artificial Intelligence Research - and received funding from the European Union Next-GenerationEU [PIANO NAZIONALE DI RIPRESA E RESILIENZA (PNRR) &#x2013; MISSIONE 4 COMPONENTE 2, INVESTIMENTO 1.3 &#x2013; D.D. 1555 11/10/2022, PE00000013 - CUP: E13C22001800001]. This manuscript reflects only the authors&#x2019; views and opinions, neither the European Union nor the European Commission can be considered responsible for them. A part of the computational resources for this work was provided by hpc@polito, which is a Project of Academic Computing within the Department of Control and Computer Engineering at the Politecnico di Torino (<ext-link ext-link-type="uri" xlink:href="http://www.hpc.polito.it">http://www.hpc.polito.it</ext-link>). We acknowledge the CINECA award under the ISCRA initiative for the availability of high-performance computing resources. This work was supported by CINI.</p>
</sec>
<sec sec-type="COI-statement" id="s10">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s11">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ali-bey</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Chaib-draa</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Gigu&#xe8;re</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>GSV-cities: toward appropriate supervised visual place recognition</article-title>. <source>Neurocomputing</source> <volume>513</volume>, <fpage>194</fpage>&#x2013;<lpage>203</lpage>. <pub-id pub-id-type="doi">10.1016/j.neucom.2022.09.127</pub-id>
</citation>
</ref>
<ref id="B2">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Ali-bey</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Chaib-draa</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Gigu&#xe8;re</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2023</year>). &#x201c;<article-title>MixVPR: feature mixing for visual place recognition</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</conf-name>, <conf-loc>Waikoloa, Hawaii, USA</conf-loc>, <conf-date>3-7 January 2023</conf-date>, <fpage>2998</fpage>&#x2013;<lpage>3007</lpage>.</citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Arandjelovi&#x107;</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Gronat</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Torii</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Pajdla</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Sivic</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>NetVLAD: CNN architecture for weakly supervised place recognition</article-title>. <source>IEEE Trans. Pattern Analysis Mach. Intell.</source> <volume>40</volume>, <fpage>1437</fpage>&#x2013;<lpage>1451</lpage>. <pub-id pub-id-type="doi">10.1109/tpami.2017.2711011</pub-id>
</citation>
</ref>
<ref id="B4">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Berton</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Masone</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Caputo</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2022a</year>). &#x201c;<article-title>Rethinking visual geo-localization for large-scale applications</article-title>,&#x201d; in <source>Cvpr</source>.</citation>
</ref>
<ref id="B5">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Berton</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Mereu</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Trivigno</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Masone</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Csurka</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Sattler</surname>
<given-names>T.</given-names>
</name>
<etal/>
</person-group> (<year>2022b</year>). &#x201c;<article-title>Deep visual geo-localization benchmark</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</conf-name>, <conf-loc>New Orleans, LA, USA</conf-loc>, <conf-date>June 18 2022 to June 24 2022</conf-date>.</citation>
</ref>
<ref id="B6">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Berton</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Paolicelli</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Masone</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Caputo</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Adaptive-attentive geolocalization from few queries: a hybrid approach</article-title>,&#x201d; in <conf-name>IEEE Winter Conference on Applications of Computer Vision</conf-name>, <conf-loc>Waikoloa, HI, USA</conf-loc>, <conf-date>January 3-8, 2021</conf-date>, <fpage>2918</fpage>&#x2013;<lpage>2927</lpage>.</citation>
</ref>
<ref id="B7">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Berton</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Trivigno</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Caputo</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Masone</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2023</year>). &#x201c;<article-title>Eigenplaces: training viewpoint robust models for visual place recognition</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)</conf-name>, <conf-loc>Paris - France</conf-loc>, <conf-date>October 2-6, 2023</conf-date>, <fpage>11080</fpage>&#x2013;<lpage>11090</lpage>.</citation>
</ref>
<ref id="B8">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ge</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Self-supervising fine-grained region similarities for large-scale image localization</article-title>,&#x201d; in <source>Computer vision &#x2013; eccv 2020</source>. Editors <person-group person-group-type="editor">
<name>
<surname>Vedaldi</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Bischof</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Brox</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Frahm</surname>
<given-names>J.-M.</given-names>
</name>
</person-group> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>), <fpage>369</fpage>&#x2013;<lpage>386</lpage>.</citation>
</ref>
<ref id="B9">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Goodfellow</surname>
<given-names>I. J.</given-names>
</name>
<name>
<surname>Mirza</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Da</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Courville</surname>
<given-names>A. C.</given-names>
</name>
<name>
<surname>Bengio</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>An empirical investigation of catastrophic forgetting in gradient-based neural networks</article-title>,&#x201d; in <conf-name>2nd International Conference on Learning Representations, ICLR 2014, Conference Track Proceedings</conf-name>, <conf-loc>Banff, AB, Canada</conf-loc>, <conf-date>April 14-16, 2014</conf-date>.</citation>
</ref>
<ref id="B10">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Gron&#xe1;t</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Obozinski</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Sivic</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Pajdla</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2013</year>). &#x201c;<article-title>Learning and calibrating per-location classifiers for visual place recognition</article-title>,&#x201d; in <conf-name>2013 IEEE Conference on Computer Vision and Pattern Recognition</conf-name>, <conf-loc>Portland, OR, USA</conf-loc>, <conf-date>June 23 2013 to June 28 2013</conf-date>, <fpage>907</fpage>&#x2013;<lpage>914</lpage>.</citation>
</ref>
<ref id="B11">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Kim</surname>
<given-names>H. J.</given-names>
</name>
<name>
<surname>Dunn</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Frahm</surname>
<given-names>J.-M.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Learned contextual feature reweighting for image geo-localization</article-title>,&#x201d; in <conf-name>IEEE Conference on Computer Vision and Pattern Recognition</conf-name>, <conf-loc>Honolulu, HI, USA</conf-loc>, <conf-date>July 21 2017 to July 26 2017</conf-date>, <fpage>3251</fpage>&#x2013;<lpage>3260</lpage>.</citation>
</ref>
<ref id="B12">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Leyva-Vallina</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Strisciuglio</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Petkov</surname>
<given-names>N.</given-names>
</name>
</person-group> (<year>2023</year>). &#x201c;<article-title>Data-efficient large scale place recognition with graded similarity supervision</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>, <conf-loc>Vancouver, BC, Canada</conf-loc>, <conf-date>June 17 2023 to June 24 2023</conf-date>, <fpage>23487</fpage>&#x2013;<lpage>23496</lpage>.</citation>
</ref>
<ref id="B13">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Varma</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Salpekar</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Noordhuis</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>T.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>) <source>Pytorch distributed: experiences on accelerating data parallel training</source>.</citation>
</ref>
<ref id="B14">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Lin</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Stich</surname>
<given-names>S. U.</given-names>
</name>
<name>
<surname>Patel</surname>
<given-names>K. K.</given-names>
</name>
<name>
<surname>Jaggi</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Don&#x2019;t use large mini-batches, use local sgd</article-title>,&#x201d; in <conf-name>International Conference on Learning Representations</conf-name>, <conf-loc>Addis Ababa, Ethiopia</conf-loc>, <conf-date>April 26-30, 2020</conf-date>.</citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Masone</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Caputo</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>A survey on deep visual place recognition</article-title>. <source>IEEE Access</source> <volume>9</volume>, <fpage>19516</fpage>&#x2013;<lpage>19547</lpage>. <pub-id pub-id-type="doi">10.1109/ACCESS.2021.3054937</pub-id>
</citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Milford</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Wyeth</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2008</year>). <article-title>Mapping a suburb with a single camera using a biologically inspired slam system</article-title>. <source>IEEE Trans. Robotics</source> <volume>24</volume>, <fpage>1038</fpage>&#x2013;<lpage>1053</lpage>. <pub-id pub-id-type="doi">10.1109/tro.2008.2004520</pub-id>
</citation>
</ref>
<ref id="B17">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Pf&#xfc;lb</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Gepperth</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>A comprehensive, application-oriented study of catastrophic forgetting in DNNs</article-title>,&#x201d; in <conf-name>International Conference on Learning Representations</conf-name>, <conf-loc>New Orleans, Louisiana, United States</conf-loc>, <conf-date>May 6 - May 9, 2019</conf-date>.</citation>
</ref>
<ref id="B18">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Ramasesh</surname>
<given-names>V. V.</given-names>
</name>
<name>
<surname>Dyer</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Raghu</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Anatomy of catastrophic forgetting: hidden representations and task semantics</article-title>,&#x201d; in <conf-name>International Conference on Learning Representations</conf-name>, <conf-loc>Austria</conf-loc>, <conf-date>May 3-7, 2021</conf-date>.</citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schubert</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Neubert</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Garg</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Milford</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Fischer</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Visual place recognition: a tutorial</article-title>. <source>IEEE Robotics Automation Mag.</source>, <fpage>2</fpage>&#x2013;<lpage>16</lpage>. <pub-id pub-id-type="doi">10.1109/mra.2023.3310859</pub-id>
</citation>
</ref>
<ref id="B20">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Simonyan</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Zisserman</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2015</year>). &#x201c;<article-title>Very deep convolutional networks for large-scale image recognition</article-title>,&#x201d; in <conf-name>International Conference on Learning Representations</conf-name>, <conf-loc>San Diego, CA, USA</conf-loc>, <conf-date>May 7-9, 2015</conf-date>.</citation>
</ref>
<ref id="B21">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Stich</surname>
<given-names>S. U.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Local SGD converges fast and communicates little</article-title>,&#x201d; in <conf-name>International Conference on Learning Representations</conf-name>, <conf-loc>New Orleans, LA, USA</conf-loc>, <conf-date>May 6-9, 2019</conf-date>.</citation>
</ref>
<ref id="B22">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>S&#xfc;nderhauf</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Neubert</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Protzel</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2013</year>). &#x201c;<article-title>Are we there yet? challenging SeqSLAM on a 3000 km journey across all four seasons</article-title>,&#x201d; in <conf-name>Proc. of Workshop on Long-Term Autonomy, IEEE International Conference on Robotics and Automation. 2013</conf-name>, <conf-loc>Karlsruhe, Germany</conf-loc>, <conf-date>6-10 May 2013</conf-date>.</citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Torii</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Arandjelovi&#x107;</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Sivic</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Okutomi</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Pajdla</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>24/7 place recognition by view synthesis</article-title>. <source>IEEE Trans. Pattern Analysis Mach. Intell.</source> <volume>40</volume>, <fpage>257</fpage>&#x2013;<lpage>271</lpage>. <pub-id pub-id-type="doi">10.1109/tpami.2017.2667665</pub-id>
</citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Torii</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Sivic</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Okutomi</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Pajdla</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Visual place recognition with repetitive structures</article-title>. <source>IEEE Trans. Pattern Analysis Mach. Intell.</source> <volume>37</volume>, <fpage>2346</fpage>&#x2013;<lpage>2359</lpage>. <pub-id pub-id-type="doi">10.1109/tpami.2015.2409868</pub-id>
</citation>
</ref>
<ref id="B25">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Ji</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Gong</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). &#x201c;<article-title>Cosface: large margin cosine loss for deep face recognition</article-title>,&#x201d; in <conf-name>IEEE Conference on Computer Vision and Pattern Recognition (Computer Vision Foundation/IEEE Computer Society)</conf-name>, <conf-loc>Salt Lake City, Utah, USA</conf-loc>, <conf-date>18-22 June 2018</conf-date>, <fpage>5265</fpage>&#x2013;<lpage>5274</lpage>.</citation>
</ref>
<ref id="B26">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Tantia</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Ballas</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Rabbat</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Slowmo: improving communication-efficient distributed sgd with slow momentum</article-title>,&#x201d; in <conf-name>International Conference on Learning Representations</conf-name>, <conf-loc>Addis Ababa, Ethiopia</conf-loc>, <conf-date>April 26-30, 2020</conf-date>.</citation>
</ref>
<ref id="B27">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Dong</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Scott</surname>
<given-names>M. R.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Multi-similarity loss with general pair weighting for deep metric learning</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</conf-name>, <conf-loc>Long Beach, CA, USA</conf-loc>, <conf-date>June 16 2019 to June 17 2019</conf-date>, <fpage>5022</fpage>&#x2013;<lpage>5030</lpage>.</citation>
</ref>
<ref id="B28">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Warburg</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Hauberg</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Lopez-Antequera</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Gargallo</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Kuang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Civera</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Mapillary street-level sequences: a dataset for lifelong place recognition</article-title>,&#x201d; in <conf-name>IEEE Conference on Computer Vision and Pattern Recognition</conf-name>, <conf-loc>Seattle, WA, USA</conf-loc>, <conf-date>June 13 2020 to June 19 2020</conf-date>.</citation>
</ref>
<ref id="B29">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Yu</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Jin</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>On the linear speedup analysis of communication efficient momentum SGD for distributed non-convex optimization</article-title>,&#x201d; in <conf-name>Proceedings of the 36th International Conference on Machine Learning (PMLR), vol. 97 of Proceedings of Machine Learning Research</conf-name>, <conf-loc>Long Beach, California, USA</conf-loc>, <conf-date>09-15 June 2019</conf-date>, <fpage>7184</fpage>&#x2013;<lpage>7193</lpage>.</citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zaffar</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Garg</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Milford</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Kooij</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Flynn</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>McDonald-Maier</surname>
<given-names>K.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>VPR-Bench: an open-source visual place recognition evaluation framework with quantifiable viewpoint and appearance change</article-title>. <source>Int. J. Comput. Vis.</source> <volume>129</volume>, <fpage>2136</fpage>&#x2013;<lpage>2174</lpage>. <pub-id pub-id-type="doi">10.1007/s11263-021-01469-5</pub-id>
</citation>
</ref>
<ref id="B31">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Zhu</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Shah</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2023</year>). &#x201c;<article-title>R2former: unified retrieval and reranking transformer for place recognition</article-title>,&#x201d; in <conf-name>IEEE Conference on Computer Vision and Pattern Recognition</conf-name>, <conf-loc>Vancouver, BC, Canada</conf-loc>, <conf-date>June 17 2023 to June 24 2023</conf-date>.</citation>
</ref>
</ref-list>
</back>
</article>