<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Plant Sci.</journal-id>
<journal-title>Frontiers in Plant Science</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Plant Sci.</abbrev-journal-title>
<issn pub-type="epub">1664-462X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fpls.2022.787527</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Plant Science</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Plant recognition by AI: Deep neural nets, transformers, and kNN in deep embeddings</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Picek</surname> <given-names>Luk&#x000E1;&#x00161;</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1497982/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>&#x00160;ulc</surname> <given-names>Milan</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1315174/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Patel</surname> <given-names>Yash</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1500686/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Matas</surname> <given-names>Ji&#x00159;&#x000ED;</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Cybernetics, Faculty of Applied Sciences, University of West Bohemia</institution>, <addr-line>Pilsen</addr-line>, <country>Czechia</country></aff>
<aff id="aff2"><sup>2</sup><institution>Visual Recognition Group, Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague</institution>, <addr-line>Prague</addr-line>, <country>Czechia</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Pierre Bonnet, CIRAD, UMR AMAP, France</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Herv&#x000E9; Go&#x000EB;au, UMR5120 Botanique et mod&#x000E9;lisation de l&#x00027;architecture des plantes et des v&#x000E9;g&#x000E9;tations (AMAP), France; Chuan Lu, Aberystwyth University, United Kingdom</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Luk&#x000E1;&#x00161; Picek <email>picekl&#x00040;kky.zcu.cz</email>; <email>lukaspicek&#x00040;gmail.com</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Technical Advances in Plant Science, a section of the journal Frontiers in Plant Science</p></fn></author-notes>
<pub-date pub-type="epub">
<day>27</day>
<month>09</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>13</volume>
<elocation-id>787527</elocation-id>
<history>
<date date-type="received">
<day>30</day>
<month>09</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>15</day>
<month>07</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2022 Picek, &#x00160;ulc, Patel and Matas.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Picek, &#x00160;ulc, Patel and Matas</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>The article reviews and benchmarks machine learning methods for automatic image-based plant species recognition and proposes a novel retrieval-based method for recognition by nearest neighbor classification in a deep embedding space. The image retrieval method relies on a model trained <italic>via</italic> the Recall&#x00040;k surrogate loss. State-of-the-art approaches to image classification, based on Convolutional Neural Networks (CNN) and Vision Transformers (ViT), are benchmarked and compared with the proposed image retrieval-based method. The impact of performance-enhancing techniques, e.g., class prior adaptation, image augmentations, learning rate scheduling, and loss functions, is studied. The evaluation is carried out on the PlantCLEF 2017, the ExpertLifeCLEF 2018, and the iNaturalist 2018 Datasets&#x02014;the largest publicly available datasets for plant recognition. The evaluation of CNN and ViT classifiers shows a gradual improvement in classification accuracy. The current state-of-the-art Vision Transformer model, ViT-Large/16, achieves 91.15% and 83.54% accuracy on the PlantCLEF 2017 and ExpertLifeCLEF 2018 test sets, respectively; the best CNN model (ResNeSt-269e) error rate dropped by 22.91% and 28.34%. Apart from that, additional tricks increased the performance for the ViT-Base/32 by 3.72% on ExpertLifeCLEF 2018 and by 4.67% on PlantCLEF 2017. The retrieval approach achieved superior performance in all measured scenarios with accuracy margins of 0.28%, 4.13%, and 10.25% on ExpertLifeCLEF 2018, PlantCLEF 2017, and iNat2018&#x02013;Plantae, respectively.</p></abstract>
<kwd-group>
<kwd>plant</kwd>
<kwd>species</kwd>
<kwd>classification</kwd>
<kwd>recognition</kwd>
<kwd>machine learning</kwd>
<kwd>computer vision</kwd>
<kwd>species recognition</kwd>
<kwd>fine-grained</kwd>
</kwd-group>
<contract-num rid="cn001">SS05010008</contract-num>
<contract-num rid="cn002">SGS-2022-017</contract-num>
<contract-num rid="cn003">CZ.02.1.01/0.0/0.0/16_019/0000765</contract-num>
<contract-num rid="cn004">SGS20/171/OHK3/3T/13</contract-num>
<contract-sponsor id="cn001">Ministerstvo &#x0017D;ivotn&#x000ED;ho Prostred&#x000ED;<named-content content-type="fundref-id">10.13039/501100013855</named-content></contract-sponsor>
<contract-sponsor id="cn002">Z&#x000E1;padocesk&#x000E1; Univerzita v Plzni<named-content content-type="fundref-id">10.13039/100009056</named-content></contract-sponsor>
<contract-sponsor id="cn003">Research Center for Informatics, Czech Technical University in Prague<named-content content-type="fundref-id">10.13039/100018240</named-content></contract-sponsor>
<contract-sponsor id="cn004">Cesk&#x000E9; Vysok&#x000E9; Ucen&#x000ED; Technick&#x000E9; v Praze<named-content content-type="fundref-id">10.13039/100007655</named-content></contract-sponsor>
<counts>
<fig-count count="5"/>
<table-count count="7"/>
<equation-count count="12"/>
<ref-count count="57"/>
<page-count count="16"/>
<word-count count="10057"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Accurate species identification is essential for most ecologically motivated studies, in the pharmaceutical industry, agriculture, and conservation. In the case of Flora&#x02014;with more than 400,000 species and high inter-species similarities&#x02014;correct species determination requires a high level of expertise. An identification process using dichotomous keys may take days, even for specialists, especially in locations with high biodiversity, and it is exceedingly difficult for non-scientists (Belhumeur et al., <xref ref-type="bibr" rid="B1">2008</xref>). To overcome that issue, Gaston and O&#x00027;Neill (<xref ref-type="bibr" rid="B8">2004</xref>) proposed to use a computer vision based search engine to partially assist with plant identification and consequentially speed up the identification process. Since then, we have witnessed an increased research interest in plant species identification using computer vision and machine learning (Wu et al., <xref ref-type="bibr" rid="B53">2006</xref>, <xref ref-type="bibr" rid="B54">2007</xref>; Prasad et al., <xref ref-type="bibr" rid="B36">2011</xref>; Priya et al., <xref ref-type="bibr" rid="B37">2012</xref>; Caglayan et al., <xref ref-type="bibr" rid="B4">2013</xref>; Munisami et al., <xref ref-type="bibr" rid="B31">2015</xref>), especially following the advances in deep learning (Ghazi et al., <xref ref-type="bibr" rid="B9">2017</xref>; Bonnet et al., <xref ref-type="bibr" rid="B2">2018</xref>; Lee et al., <xref ref-type="bibr" rid="B27">2018</xref>; &#x00160;ulc et al., <xref ref-type="bibr" rid="B43">2018</xref>; W&#x000E4;ldchen and M&#x000E4;der, <xref ref-type="bibr" rid="B50">2018</xref>; Picek et al., <xref ref-type="bibr" rid="B33">2019</xref>).</p>
<p>The overall performance of automatic fine-grained image classifiers has improved considerably over the last decade with the development of deep neural networks, mostly Convolutional Neural Networks (CNNs). We refer readers unfamiliar with the principles of deep learning and CNNs to the book by Goodfellow et al. (<xref ref-type="bibr" rid="B16">2016</xref>). The success of deep learning models trained with full supervision is typically conditioned by the existence of large databases of annotated images. For plant recognition, such large-scale data are available, thanks to citizen-science and open-data initiatives such as Encyclopedia of Life (<ext-link ext-link-type="uri" xlink:href="http://www.eol.org/">EoL</ext-link>), <ext-link ext-link-type="uri" xlink:href="http://www.plantnet.org/">Pl&#x00040;ntNet</ext-link>, and the Global Biodiversity Information Facility (<ext-link ext-link-type="uri" xlink:href="http://www.gbif.org/">GBIF</ext-link>). This allowed building challenging datasets for fine-grained classification training and evaluation, e.g., in PlantCLEF (Go&#x000EB;au et al., <xref ref-type="bibr" rid="B10">2016</xref>, <xref ref-type="bibr" rid="B11">2017</xref>, <xref ref-type="bibr" rid="B12">2018</xref>, <xref ref-type="bibr" rid="B14">2020</xref>, <xref ref-type="bibr" rid="B15">2021</xref>), LifeCLEF (Joly et al., <xref ref-type="bibr" rid="B19">2018</xref>, <xref ref-type="bibr" rid="B20">2019</xref>, <xref ref-type="bibr" rid="B21">2020</xref>, <xref ref-type="bibr" rid="B22">2021</xref>), iNaturalist (Van Horn et al., <xref ref-type="bibr" rid="B47">2018</xref>), and Pl&#x00040;ntNet (Garcin et al., <xref ref-type="bibr" rid="B7">2021</xref>).</p>
<p>This article deals with automatic image-based plant species identification &#x0201C;<italic>in the wild&#x0201D;</italic>, thus dealing with: (i) Different scales: Plant species can be observed from various angles and distances. (ii) Intra-class differences: Plant organs&#x02014;leaf, fruit, bark, etc.&#x02014;look very distinct. (iii) Inter-class similarities: The same organ of different species might look very similar. (iv) Background and Clutter: Other species are present behind or around the observed sample, and many more. Identification of plants from images is a fine-grained classification problem, due to the high number of classes<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref>, high intra-class variance, and small inter-class differences. &#x00160;ulc and Matas (<xref ref-type="bibr" rid="B41">2017</xref>) showed that constrained plant identification tasks, such as recognition of scanned leaves, can be solved with a high level of classification accuracy (&#x000B1; 99%). Yet the &#x0201C;<italic>in the wild&#x0201D;</italic> scenario, with an unspecified view or organ type, natural background, possible clutter in the scene, etc., remains challenging even for state-of-the-art deep learning methods. For &#x0201C;In the wild&#x0201D; photograph samples, refer to <xref ref-type="fig" rid="F1">Figure 1</xref>.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>&#x0201C;In the wild&#x0201D; photograph samples&#x02014;PlantCLEF datasets. Images by soyoban, Liliane Roubaudi, Hugo Santacreu, Sarah Dechamps, Richard Gautier, Heinz Gass, Alain Bigou, Jean-Michel Launay, and Jose Luis Romero.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-13-787527-g0001.tif"/>
</fig>
<p>First, is the standard approach, where fine-grained recognition is posed as closed-set classification; the learning involves minimization of cross-entropy loss. Second, a retrieval-based approach, which is very competitive, achieves superior in comparable conditions. Here, the training involves learning an embedding where the metric space leads to high recall in the retrieval task. Formulating fine-grained recognition as retrieval has clear advantages&#x02014;besides providing ranked class predictions, it recovers relevant nearest-neighbor labeled samples. The retrieved nearest neighbors provide explainability to the deep network and can be visually checked by an expert. Moreover, the user may inspect specific information, e.g., about location and date of collection, to further reduce decision uncertainty. Besides, the retrieval approach naturally supports open-set recognition problems, i.e., the ability to extend or modify the set of recognized classes after the training stage. The set of classes may change, e.g., as a result of modifications to biological taxonomy. New classes are introduced simply by adding training images with the new label, whereas in the standard approach, the classification head needs re-training. On the negative side, the retrieval approach requires, on top of running the deep net to extract the embedding, to execute the nearest neighbor search efficiently, increasing the overall complexity of the fine-grained recognition system.</p>
<p>Section 4 discusses techniques that can noticeably improve the performance of any vision-based species recognition system. The techniques are diverse and attend to different problems. The prior shift in the datasets, i.e., the difference between the training and test data class distribution, is a significant and omnipresent phenomenon. We test existing prior shift adaptation methods and their impact on classification accuracy. Class prior adaptation equips the system with the ability to reflect the change of prior probability of observing a specimen of a given species over time and location. Image augmentations make the system robust to acquisition conditions that, in some applications, e.g., plant recognition, are far from the lab setting. Finally, technical aspects related to training of the deep nets, such as learning rate schedule, loss functions and the impact of the noisy data, on classification performance, are discussed.</p>
<p>The performance evaluation part of the article builds on our winning submissions to PlantCLEF (Picek et al., <xref ref-type="bibr" rid="B33">2019</xref>; Sulc and Matas, <xref ref-type="bibr" rid="B42">2019</xref>) and extends a workshop article (&#x00160;ulc et al., <xref ref-type="bibr" rid="B43">2018</xref>) and a PhD thesis (&#x00160;ulc, <xref ref-type="bibr" rid="B40">2020</xref>). It substantially extends the experiments by including recent state-of-the-art methods for image classification: Convolutional Neural Networks (CNNs) (Xie et al., <xref ref-type="bibr" rid="B55">2017</xref>; Hu et al., <xref ref-type="bibr" rid="B18">2018</xref>; Zhang et al., <xref ref-type="bibr" rid="B56">2020</xref>; Tan and Le, <xref ref-type="bibr" rid="B45">2021</xref>), Vision Transformers (ViTs) (Dosovitskiy et al., <xref ref-type="bibr" rid="B6">2021</xref>), and an interpretable image retrieval approach (Patel et al., <xref ref-type="bibr" rid="B32">2021</xref>).</p>
</sec>
<sec id="s2">
<title>2. Related work</title>
<p>This chapter reviews existing methods, systems, and applications for plant species recognition: leaf or bark recognition and &#x0201C;<italic>in the wild</italic>&#x0201D; plant species recognition.</p>
<sec>
<title>2.1. Leaf and bark recognition</title>
<p>Leaf and bark recognition was the only application before deep learning where automatic plant species identification allowed to reliably tackle complex species recognition tasks. Most techniques were based on two steps: (i) descriptor extraction, often based on combining different hand-crafted features such as shape, color, or local descriptors (SIFT, SURF, ORB, etc.), and (ii) classical. classifiers such as k-Nearest Neighbor (Munisami et al., <xref ref-type="bibr" rid="B31">2015</xref>), Random Forest (Caglayan et al., <xref ref-type="bibr" rid="B4">2013</xref>), SVM (Prasad et al., <xref ref-type="bibr" rid="B36">2011</xref>; Priya et al., <xref ref-type="bibr" rid="B37">2012</xref>), and early adoptions of neural networks (Wu et al., <xref ref-type="bibr" rid="B53">2006</xref>, <xref ref-type="bibr" rid="B54">2007</xref>). The generalization capability of these methods was limited, and so was the applicability&#x02014;e.g., most leaf recognition methods relied on the shape of scanned leaves; thus, the usability in the &#x0201C;in the wild&#x0201D; scenario was limited since the uniform background was required.</p>
</sec>
<sec>
<title>2.2. Flora recognition in the wild</title>
<p>The continuous progress in automatic plant species recognition &#x0201C;<italic>in the wild</italic>&#x0201D; has been strongly driven by the efforts of the LifeCLEF research platform. Established in 2014, the LifeCLEF helps track progress and allows reliable evaluation of novel methods. In particular, the annual PlantCLEF challenges are an immense source of plant species datasets tailored to develop and evaluate automatic plant species recognition methods.</p>
<p>Following the findings of the LifeCLEF challenges (Joly et al., <xref ref-type="bibr" rid="B19">2018</xref>, <xref ref-type="bibr" rid="B20">2019</xref>, <xref ref-type="bibr" rid="B21">2020</xref>, <xref ref-type="bibr" rid="B22">2021</xref>), AI-based identification of the world flora has improved significantly over the last 5 years, and it reached similar performance as human experts for common (&#x00160;ulc et al., <xref ref-type="bibr" rid="B43">2018</xref>) as well as for rare species (Picek et al., <xref ref-type="bibr" rid="B33">2019</xref>). Ensembles of CNN models were able to recognize 10,000 plant species from Europe and North America and 10,000 from the Guiana shield and the Amazonia with approximately 90 and 40% accuracy, respectively.</p>
<p>Overall, there are few methods for plant recognition &#x0201C;in the wild&#x0201D;; thus, we overview relevant methods for general fine-grained recognition. Wu et al. (<xref ref-type="bibr" rid="B52">2019</xref>) developed a Taxonomic Loss that sums up loss functions calculated from different taxonomy ranks, e.g., species, genus, and family. Cui et al. (<xref ref-type="bibr" rid="B5">2018</xref>) studied domain-specific transfer learning from large-scale datasets to domain-specific fine-grained datasets. Zheng et al. (<xref ref-type="bibr" rid="B57">2019</xref>) propose the Trilinear Attention Sampling Network that generates attention maps by modeling the inter-channel relationships, highlights attended parts with high resolution and distills part features into an object-level feature. Keaton et al. (<xref ref-type="bibr" rid="B23">2021</xref>) utilized object detection as a form of attention with a bottom-up approach to detect plant organs and combine the predictions from organ-specific classifiers. Malik et al. (<xref ref-type="bibr" rid="B30">2021</xref>) used a standard ensemble-based approach utilizing Inception, MobileNet and ResNet CNN architectures.</p>
<p>Several interesting approaches emerged in connection with the annual PlantCLEF workshops. In PlantCLEF 2017, the best performing submission competition with an accuracy of 88.5% was developed by Lasseck (<xref ref-type="bibr" rid="B26">2017</xref>). The underlying method is based on 12 models derived from 3 architectures&#x02014;GoogLeNet, ResNet-152, and ResNeXt-101-64x4d. All models were fine-tuned from the ImageNet-1k checkpoints utilizing various augmentation techniques, e.g., random cropping, horizontal flipping, variations of saturation and lightness, and rotation. While testing, 5 crops for all observation images are predicted with all models and averaged. In the PlantCLEF 2018, the best performing submission (Sulc and Matas, <xref ref-type="bibr" rid="B42">2019</xref>) was based on two architectures&#x02014;Inception-ResNet-v2 and Inception-v4 (Szegedy et al., <xref ref-type="bibr" rid="B44">2017</xref>)&#x02014;and their ensembles and achieved an accuracy of 88.4%. The TensorFlow-Slim API was used to adjust and fine-tune the networks from the publicly available ImageNet-1k pre-trained checkpoints. All networks shared the following optimizer settings: RMSprop with momentum and decay set to 0.9, initial learning rate 0.01, and exponential learning rate decay factor 0.4. Batch size, input resolution, and random crop area range were set differently for each network. For the used values please refer to the original article (Sulc and Matas, <xref ref-type="bibr" rid="B42">2019</xref>). The following image pre-processing was used for training: Random crop, with aspect ratio range (0.75, 1.33) and with various area ranges, Random left-right flip, and Brightness and Saturation distortion. At test-time, 14 predictions per image are generated by using 7 crops and their mirrored versions: full image, central crop covering 80% of the original image dimensions, central crop covering 60% of the original image dimensions, and 4 corner crops covering 60% of the original image dimensions. The significant improvement in accuracy was achieved by using running averages of the trained variables instead of the values from the last training step. This is important especially if the noisy labels are present in the training set where mini-batches with noisy samples may produce large gradients pointing outside of the local optima. The use of the Polyak averaging (Polyak and Juditsky, <xref ref-type="bibr" rid="B35">1992</xref>) resulted in a more stable version of the training variables.</p>
</sec>
</sec>
<sec id="s3">
<title>3. Datasets</title>
<p>This section overviews datasets suitable for plant recognition &#x0201C;<italic>in the wild</italic>&#x0201D; which, unlike other plant species datasets, contain images of various plant body parts observed in an open world. Such datasets are unique with high inter-class similarities&#x02014;bark of one species is similar to the bark of another species&#x02014;and high intra-class differences&#x02014;the bark, flower, and fruit of one species are visually distinct. Currently, datasets with large species diversity and a sufficient number of samples to train a reliable machine learning model are available. The most significant providers of those datasets&#x02014;<ext-link ext-link-type="uri" xlink:href="https://www.inaturalist.org/">iNaturalist</ext-link>, <ext-link ext-link-type="uri" xlink:href="http://www.plantnet.org/">Pl&#x00040;ntNet</ext-link>, <ext-link ext-link-type="uri" xlink:href="http://www.eol.org/">EoL</ext-link>, <ext-link ext-link-type="uri" xlink:href="https://www.imageclef.org/LifeCLEF2022">LifeCLEF</ext-link>&#x02014;are closely connected to citizen-science platforms, thus their data originate from thousands of users, and are captured on various devices, observed under different conditions, and submitted from many countries. The most influential datasets are described below and their main characteristics are summarized in <xref ref-type="table" rid="T1">Table 1</xref>.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Datasets for plant recognition; &#x0201C;<italic>in the wild</italic>&#x0201D; scenario.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left" colspan="2"></th>
<th valign="top" align="center" colspan="3"><bold>Number of images in</bold></th>
</tr>
<tr>
<th valign="top" align="center"><bold>Dataset</bold></th>
<th valign="top" align="center"><bold>Species</bold></th>
<th valign="top" align="center"><bold>Training</bold></th>
<th valign="top" align="center"><bold>Validation</bold></th>
<th valign="top" align="center"><bold>Test</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Pl&#x00040;ntNet-300K</td>
<td valign="top" align="center">1,081</td>
<td valign="top" align="center">243,916</td>
<td valign="top" align="center">31,118</td>
<td valign="top" align="center">31,112</td>
</tr>
<tr>
<td valign="top" align="left">iNaturalist 2017<sup>&#x02020;</sup></td>
<td valign="top" align="center">2,101</td>
<td valign="top" align="center">158,407</td>
<td valign="top" align="center">38,206</td>
<td valign="top" align="center">&#x000D7;</td>
</tr>
<tr>
<td valign="top" align="left">iNaturalist 2018<sup>&#x02020;</sup></td>
<td valign="top" align="center">2,917</td>
<td valign="top" align="center">118,800</td>
<td valign="top" align="center">8,751</td>
<td valign="top" align="center">&#x000D7;</td>
</tr>
<tr>
<td valign="top" align="left">iNaturalist 2021<sup>&#x02020;</sup></td>
<td valign="top" align="center">4,271</td>
<td valign="top" align="center">1,148,702</td>
<td valign="top" align="center">42,710</td>
<td valign="top" align="center">&#x000D7;</td>
</tr>
<tr>
<td valign="top" align="left">PlantCLEF 2016</td>
<td valign="top" align="center">1,000</td>
<td valign="top" align="center">113,205</td>
<td valign="top" align="center">&#x000D7;</td>
<td valign="top" align="center">2,583</td>
</tr>
<tr>
<td valign="top" align="left">PlantCLEF 2017<sup>&#x02021;</sup></td>
<td valign="top" align="center">10,000</td>
<td valign="top" align="center">320,544</td>
<td valign="top" align="center">&#x000D7;</td>
<td valign="top" align="center">25,170</td>
</tr>
<tr>
<td valign="top" align="left">ExpertLifeCLEF 2018<sup>&#x02021;</sup></td>
<td valign="top" align="center">10,000</td>
<td valign="top" align="center">320,544</td>
<td valign="top" align="center">&#x000D7;</td>
<td valign="top" align="center">6,892</td>
</tr>
<tr>
<td valign="top" align="left">PlantCLEF 2019</td>
<td valign="top" align="center">10,000</td>
<td valign="top" align="center">434,251</td>
<td valign="top" align="center">&#x000D7;</td>
<td valign="top" align="center">2,974</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Species from the <italic>Plantae</italic> kingdom marked<sup>&#x02020;</sup>, data with &#x0201C;<italic>trusted</italic>&#x0201D;, i.e., human verified, labels marked<sup>&#x02021;</sup>.</p>
</table-wrap-foot>
</table-wrap>
<p>For the experimental evaluation in this article, we used iNaturalist 2018<sup>&#x02020;</sup>, PlantCLEF 2017<sup>&#x02021;</sup>, and ExpertLifeCLEF 2018<sup>&#x02021;</sup>, as they offer a sufficient number of species and test samples while keeping the training set size and, thus, computational demands reasonably low.</p>
<sec>
<title>3.1. LifeCLEF&#x02014;PlantCLEF</title>
<p>The annual LifeCLEF&#x02014;PlantCLEF identification challenge is an important source of data for plant recognition. Since 2017 the PlantCLEF challenges present the following classification problem: For each plant observations consisting of one or more images of the same specimen, predict the species. Example images from one observation are visualized in <xref ref-type="fig" rid="F2">Figure 2</xref>. The PlantCLEF datasets are mainly intended for benchmarking machine-learning-based algorithms for plant recognition, thus are briefly described below.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>A PlantCLEF observation&#x02014;images of different plant parts. Images by Hugo Santacreu.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-13-787527-g0002.tif"/>
</fig>
<p><bold>The PlantCLEF 2016</bold> dataset (Go&#x000EB;au et al., <xref ref-type="bibr" rid="B10">2016</xref>) comprises 1,13,205 training images belonging to 41,794 observations of 1,000 plant species from France and neighboring countries. Every image is annotated with a plant organ label, i.e., flower, leaf, fruit, stem, branch, and whole plant. A small fraction has GPS coordinates. The test set contains 2,583 images. As in all PlantCLEF challenges, no predefined validation set was provided.</p>
<p><bold>The PlantCLEF 2017</bold> challenge dataset (Go&#x000EB;au et al., <xref ref-type="bibr" rid="B11">2017</xref>) includes 3,20,544 images from the Encyclopedia of Life with trusted labels, and noisy web data crawled with Bing and Google search engines (&#x0007E;1.15M images). The dataset covers 10,000 plant species&#x02014;mainly from North America and Europe&#x02014;representing the biggest plant species identification dataset in the number of classes. The test set contains 25,170 images (17,868 observations).</p>
<p><bold>The ExperLifeCLEF 2018</bold> training dataset (Go&#x000EB;au et al., <xref ref-type="bibr" rid="B12">2018</xref>) differs from the PlantCLEF 2017 dataset only in the test set. The test set contains 6,892 images (2,072 observations) covering species mainly from Western Europe and North America. In addition, selected endangered species, and cultivated and ornamental plant species were added.</p>
<p><bold>The PlantCLEF2019</bold> dataset (Go&#x000EB;au et al., <xref ref-type="bibr" rid="B13">2019</xref>) contains 434,251 images that belong to 10,000 rare species from the Guiana shield and the Amazon rain forest.The images originate from EoL and Google/Bing search engines; the majority have the &#x0201C;<italic>noisy</italic>&#x0201D; labels. The test set is composed of 742 plant observations (2,974 images) collected and identified by five experts on tropical flora.</p>
</sec>
<sec>
<title>3.2. iNaturalist</title>
<p>iNaturalist is a crowd-based citizen-science platform allowing citizens and experts to upload, annotate and categorize species of the world. iNaturalist has a wide geographic and taxonomic coverage&#x02014;more than 343 thousand species with approximately 97 million observations. The annual iNaturalist competition datasets that include a significant number of plant species are described below.</p>
<p><bold>iNaturalist 2017</bold>: The iNaturalist 2017 dataset (Van Horn et al., <xref ref-type="bibr" rid="B47">2018</xref>) contains 2,101 plant species, with 1,58,407 training and 38,206 validation images that have been collected and verified by multiple independent users. The dataset features many visually similar species that have been captured worldwide and under various conditions. As labels for the test set were not provided, it is impossible to specify how many plant species are contained.</p>
<p><bold>iNaturalist 2018</bold>: The <ext-link ext-link-type="uri" xlink:href="https://github.com/visipedia/inat_comp/tree/master/2018">iNaturalist Challenge 2018 dataset</ext-link> includes 2,917 plant species, with 118,800 training and 8,751 validation images acquired the same way as in the previous year. Additionally, complete taxonomy information was given for all images. Test labels were not provided.</p>
<p><bold>iNaturalist 2021</bold>: The <ext-link ext-link-type="uri" xlink:href="https://github.com/visipedia/inat_comp/tree/master/2021">iNaturalist Challenge 2021 dataset</ext-link> with 1,148,702 training and 42,710 validation images is the most extensive dataset considering the number of images&#x02014;the number of plant species was increased to 4,271. Test labels were not provided as in all iNaturalist Challenge datasets.</p>
</sec>
<sec>
<title>3.3. Pl&#x00040;ntNet-300K</title>
<p>The Pl&#x00040;ntNet-300K dataset Garcin et al. (<xref ref-type="bibr" rid="B7">2021</xref>) is built from the database of the Pl&#x00040;ntNet citizen observatory and includes 1,081 species and 306,146 images. The dataset exhibits a long-tailed class imbalance, where 20% of the most common species provide 89% of the images. Provided validation and test sets include 31,118 and 31,112 images, respectively.</p>
</sec>
</sec>
<sec sec-type="methods" id="s4">
<title>4. Methods</title>
<p>This section is divided into three parts. First, the pipeline for automatic Plant Recognition by the standard Image Classification pipeline is described. Second, an alternative and novel approach to Plant Recognition <italic>via</italic> kNN classification in deep embedding space is proposed and described. Finally, a range of methods and techniques that increase classification performance are introduced.</p>
<sec>
<title>4.1. Deep neural network classifiers</title>
<p>Plant species recognition can be easily automated through the standard image classification approach, where a Deep Neural Network (DNN) serves as a deep feature extractor and a fully convolutional neural network as a classifier. Image representations learned by deep neural networks provide significantly better results than handcrafted features. Furthermore, DNNs are data-driven and require no effort or expertise for feature selection as they automatically learn discriminative features for every task. In addition, the automatically learned features are represented hierarchically on multiple levels. Having such deep features is a strong advantage over traditional approaches.</p>
<p>Currently, many DNN architectures are widely used; thus, a broad range of Convolutional Neural Networks and Transformer-based architectures are evaluated to test the classification capabilities for different feature extractor architectures. The ResNet-50 (He et al., <xref ref-type="bibr" rid="B17">2016</xref>), Inception-v4, and Inception-ResNet-v2 (Szegedy et al., <xref ref-type="bibr" rid="B44">2017</xref>) are chosen as baselines as they are commonly used in related study. We add the following novel and state-of-the-art architectures:</p>
<p><bold>SE-ResNeXt-101:</bold> Extends the ResNet deep residual blocks by adding the <italic>NeXt</italic> dimension, called Cardinality (Xie et al., <xref ref-type="bibr" rid="B55">2017</xref>), and Squeeze and Excite blocks that adaptively re-calibrates channel-wise feature responses by explicitly modeling inter-dependencies between channels (Hu et al., <xref ref-type="bibr" rid="B18">2018</xref>).</p>
<p><bold>ResNeSt-269e:</bold> Applies channel-wise attention to different parts of the architecture to leverage and allow the cross-feature interactions and learning of the more diverse representations. (Zhang et al., <xref ref-type="bibr" rid="B56">2020</xref>).</p>
<p><bold>EfficientNetV2-S:</bold> Similarly to the first EfficientNet generation, the EfficientNet-v2 architectures are developed by a combination of training-aware architecture search and scaling, to jointly optimize training speed and parameter efficiency (Tan and Le, <xref ref-type="bibr" rid="B45">2021</xref>). Newly, the models: (i) were searched from the space enriched with Fused-MBConv, and (ii) the last stride-1 stage in the original EfficientNet was removed.</p>
<p><bold>Vision Transformers:</bold> Unlike CNN, the Vision Transformer (ViT) (Dosovitskiy et al., <xref ref-type="bibr" rid="B6">2021</xref>) does not use convolutions but interprets an image as a sequence of patches and processes it by a standard Transformer encoder used primarily for natural language processing (Vaswani et al., <xref ref-type="bibr" rid="B48">2017</xref>). Compared to state-of-the-art convolutional networks, selected ViT architectures demonstrated excellent performance in fine-grained image classification (Picek et al., <xref ref-type="bibr" rid="B34">2022</xref>).</p>
<sec>
<title>4.1.1. Training strategy</title>
<p>All NN architectures were initialized from publicly available ImageNet-1k or ImageNet-21k pre-trained checkpoints (Wightman, <xref ref-type="bibr" rid="B51">2019</xref>) and further fine-tuned for 100 epochs. Mini-batch gradients were accumulated to reach an effective size of 128 for all the architectures&#x02014;most of the time, 4 batches of size 32 are accumulated. SGD with momentum (0.9) was used as an optimizer with a custom learning rate (LR) schedule&#x02014;Reduce LR to a fraction of 0.9 if validation loss does not decrease for 2 epochs. The loss was calculated as Softmax Cross Entropy. While training, we employ a few data augmentation techniques from the Albumentations library (Buslaev et al., <xref ref-type="bibr" rid="B3">2020</xref>). A sample image and its augmented variations are shown in <xref ref-type="fig" rid="F3">Figure 3</xref>. Augmentation methods, their description, and specified non-default parameters are:</p>
<list list-type="bullet">
<list-item><p><italic>RandomResizedCrop</italic>: creates a random resized crop with a scale of 0.8 &#x02212; 1.0.</p></list-item>
<list-item><p><italic>HorizontalFlip</italic>: randomly (50% probability) flips the image horizontally.</p></list-item>
<list-item><p><italic>VerticalFlip</italic>: randomly (50% probability) flips the image vertically.</p></list-item>
<list-item><p><italic>RandomBrightnessContrast</italic>: changes contrast and brightness by a random factor in a range &#x02212;0.2 &#x02212; 0.2 with 20% probability.</p></list-item>
</list>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Image augmentations&#x02014;Horizontal and vertical flip, small brightness/contrast adjustments, and 80&#x02013;100% crops&#x02014;used while training the deep neural network classifier. Image by Zoya Akulova.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-13-787527-g0003.tif"/>
</fig>
<p>All images were: resized to match the pre-trained model input size of 224 &#x000D7; 224 or 384 &#x000D7; 384, re-scaled from 0 &#x02212; 255 to 0 &#x02212; 1, and normalized by mean (0.5) and std (0.5) values in each channel.</p>
</sec>
<sec>
<title>4.1.2. Test-time</title>
<p>At the test time, all images are resized to the appropriate size, i.e., 224 &#x000D7; 224 or 384 &#x000D7; 384, and normalized as in training. Next, all observation images are feed-forward and class predictions are combined. The study about different methods for prediction combinations is included in Section 5.3. The classification performance for all selected models is evaluated on both resolutions&#x02014;224 &#x000D7; 224 and 384 &#x000D7; 384&#x02014;and two different test sets&#x02014;PlantCLEF 2017 and ExpertLifeCLEF 2018.</p>
</sec>
</sec>
<sec>
<title>4.2. Plant recognition <italic>via</italic> kNN classification in deep embedding space</title>
<p>Fine-grained recognition of plant species can be alternatively solved <italic>via</italic> the k-Nearest Neighbors algorithm (kNN) in an embedding space where the samples from the same semantic class are grouped together, and the samples from different classes are far apart. Recent study by Touvron et al. (<xref ref-type="bibr" rid="B46">2021</xref>); Khosla et al. (<xref ref-type="bibr" rid="B24">2020</xref>) have shown such a recognition technique to outperform standard cross entropy based training. For training of such an embedding, we use the current state-of-the-art image retrieval method Patel et al. (<xref ref-type="bibr" rid="B32">2021</xref>), where a deep neural network is trained on a surrogate loss&#x02014;Recall&#x00040;k. The notations and methodology for the retrieval approach are described below.</p>
<sec>
<title>4.2.1. Notations</title>
<p>For a query example <italic>q</italic> &#x02208; <italic>X</italic>, the objective of a retrieval model is to obtain semantically similar samples from a collection &#x003A9; &#x02282; <italic>X</italic>, also known as database, where <italic>X</italic> is the space of all images. The database is divided into two subsets based on the positive or negative samples to the query <italic>q</italic>. These subsets are denoted by <italic>P</italic><sub><italic>q</italic></sub> and <italic>N</italic><sub><italic>q</italic></sub>, respectively, such that &#x003A9; &#x0003D; <italic>P</italic><sub><italic>q</italic></sub> &#x0222A; <italic>N</italic><sub><italic>q</italic></sub>. For the query <italic>q</italic>, all database samples are ranked based on a similarity score, with the goal to rank positives before negatives.</p>
</sec>
<sec>
<title>4.2.2. Deep embedding</title>
<p>Image embedding, a learned vector representation of an image, is generated by function <inline-formula><mml:math id="M1"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mo>:</mml:mo><mml:mi>X</mml:mi><mml:mo>&#x02192;</mml:mo><mml:msup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. Function <italic>f</italic><sub>&#x003B8;</sub> is a deep neural network, either a ResNet-50 or a Vision Transformer in this article, mapping input images to an <italic>L</italic><sub>2</sub>-normalized <italic>d</italic>-dimensional embedding. Embedding for image <italic>x</italic> is denoted by <italic><bold>x</bold></italic> &#x0003D; <italic>f</italic><sub>&#x003B8;</sub>(<italic>x</italic>). Parameters &#x003B8; of the network are learned during the training using Recall&#x00040;k surrogate loss. The similarity score between a query <italic>q</italic> and a database image <italic>x</italic> is computed by the dot product of the corresponding embeddings and is denoted by <italic>s</italic>(<italic>q, x</italic>) &#x0003D; <italic><bold>q</bold></italic><sup><italic>T</italic></sup><italic><bold>x</bold></italic>, also denoted as <italic>s</italic><sub><italic>qx</italic></sub>.</p>
</sec>
<sec>
<title>4.2.3. Recall&#x00040;k surrogate loss</title>
<p>The Recall&#x00040;k Surrogate loss is a differentiable approximation of the Recall&#x00040;k evaluation metric. For a query <italic>q</italic>, the Recall&#x00040;k metric is the ratio of positive (relevant) samples in top-k retrieved samples to the total number of positive samples in the database, given by |<italic>P</italic><sub><italic>q</italic></sub>|. The metric focuses only on top-k ranked samples and is one of the standard metrics to evaluate retrieval benchmarks. Recall&#x00040;k cannot be directly used as a loss function. It requires two non-differentiable operations: ranking the database samples and counting the number of positives that appear in top-k. The subsequent text presents Recall&#x00040;k expressed mathematically, non-differentiability, and the differentiable approximation as proposed by Patel et al. (<xref ref-type="bibr" rid="B32">2021</xref>).</p>
<p>Patel et al. (<xref ref-type="bibr" rid="B32">2021</xref>) denotes Recall&#x00040;k by <inline-formula><mml:math id="M2"><mml:mrow><mml:msubsup><mml:mi>R</mml:mi><mml:mtext>&#x003A9;</mml:mtext><mml:mi>k</mml:mi></mml:msubsup><mml:mo stretchy='false'>(</mml:mo><mml:mi>q</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></inline-formula> when computed for query <italic>q</italic> and database &#x003A9; and expresses it mathematically in terms of ranks of samples in the database:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M3"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:msubsup><mml:mi>R</mml:mi><mml:mi>&#x003A9;</mml:mi><mml:mi>k</mml:mi></mml:msubsup><mml:mo stretchy='false'>(</mml:mo><mml:mi>q</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mi>P</mml:mi><mml:mi>q</mml:mi></mml:msub></mml:mrow></mml:munder><mml:mrow><mml:mi>H</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x02212;</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mi>&#x003A9;</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>q</mml:mi><mml:mo>,</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mrow><mml:mo>&#x0007C;</mml:mo><mml:msub><mml:mi>P</mml:mi><mml:mi>q</mml:mi></mml:msub><mml:mo>&#x0007C;</mml:mo></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where the rank of sample <italic>x</italic> is denoted by <italic>r</italic><sub>&#x003A9;</sub>(<italic>q, x</italic>), which depends on the query sample <italic>q</italic> and the database &#x003A9;. <italic>H</italic>(.) is the Heaviside step function, which is 0 for negative values and otherwise 1. The rank <italic>r</italic><sub>&#x003A9;</sub>(<italic>q, x</italic>) of sample <italic>x</italic> is computed according to the similarity score, and it can be expressed mathematically as:</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M4"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mtext>&#x003A9;</mml:mtext></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>q</mml:mi><mml:mo>,</mml:mo><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x0002B;</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>z</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mtext>&#x003A9;</mml:mtext><mml:mo>,</mml:mo><mml:mi>z</mml:mi><mml:mo>&#x02260;</mml:mo><mml:mi>x</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:mi>H</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>q</mml:mi><mml:mi>z</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>q</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>H</italic>(.) is also the Heaviside step function applied on the difference of similarity scores. Therefore, Recall&#x00040;k from Equation (1) can also be directly expressed as a function of similarity scores as:</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M5"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:msubsup><mml:mi>R</mml:mi><mml:mtext>&#x003A9;</mml:mtext><mml:mi>k</mml:mi></mml:msubsup><mml:mo stretchy='false'>(</mml:mo><mml:mi>q</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mi>P</mml:mi><mml:mi>q</mml:mi></mml:msub></mml:mrow></mml:munder><mml:mrow><mml:mi>H</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x02212;</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>z</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mtext>&#x003A9;</mml:mtext><mml:mo>,</mml:mo><mml:mi>z</mml:mi><mml:mo>&#x02260;</mml:mo><mml:mi>x</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mi>H</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>q</mml:mi><mml:mi>z</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02212;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>q</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle></mml:mrow><mml:mrow><mml:mo>&#x0007C;</mml:mo><mml:msub><mml:mi>P</mml:mi><mml:mi>q</mml:mi></mml:msub><mml:mo>&#x0007C;</mml:mo></mml:mrow></mml:mfrac><mml:mo>.</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The computation of Recall&#x00040;k in Equation (3) involves the use of two Heaviside step functions, one to obtain the rank and the other to count the positives in top-k retrieved samples. The gradient of the Heaviside step function is a Dirac delta function. Hence, direct optimization of recall with back-propagation is not feasible. Patel et al. (<xref ref-type="bibr" rid="B32">2021</xref>) provide a smooth approximation of the Heaviside step function by the logistic function, a sigmoid function &#x003C3;<sub>&#x003C4;</sub>:<italic>R</italic>&#x02192;<italic>R</italic> controlled by temperature &#x003C4;:</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M6"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x0002B;</mml:mo><mml:msup><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mfrac><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow></mml:mfrac></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Replacing the two Heaviside step functions with the sigmoid functions of appropriate temperatures, a smooth approximation of Recall&#x00040;k can be expressed as:</p>
<disp-formula id="E5"><label>(5)</label><mml:math id="M7"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:msubsup><mml:mover accent='true'><mml:mi>R</mml:mi><mml:mo>&#x002DC;</mml:mo></mml:mover><mml:mi>&#x003A9;</mml:mi><mml:mi>k</mml:mi></mml:msubsup><mml:mo stretchy='false'>(</mml:mo><mml:mi>q</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mi>P</mml:mi><mml:mi>q</mml:mi></mml:msub></mml:mrow></mml:munder><mml:mrow><mml:msub><mml:mi>&#x003C3;</mml:mi><mml:mrow><mml:msub><mml:mi>&#x003C4;</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x02212;</mml:mo></mml:mrow></mml:mstyle><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mrow><mml:mi>z</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>&#x003A9;</mml:mi></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mi>z</mml:mi><mml:mo>&#x02260;</mml:mo><mml:mi>x</mml:mi></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:munder><mml:mrow><mml:msub><mml:mi>&#x003C3;</mml:mi><mml:mrow><mml:msub><mml:mi>&#x003C4;</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>q</mml:mi><mml:mi>z</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02212;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>q</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mrow><mml:mo>&#x0007C;</mml:mo><mml:msub><mml:mi>P</mml:mi><mml:mi>q</mml:mi></mml:msub><mml:mo>&#x0007C;</mml:mo></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The Recall&#x00040;k Surrogate loss from Equation (5) is differentiable and is used for training the parameters &#x003B8; of the deep embedding model. In practice, the Recall&#x00040;k Surrogate loss is re-scaled to have values between 0 and 1, by dividing it by min(<italic>k</italic>, |<italic>P</italic><sub><italic>q</italic></sub>|) instead of |<italic>P</italic><sub><italic>q</italic></sub>|, and by clipping the values larger than <italic>k</italic> in the numerator. The single-query loss to be minimized in a mini-batch <italic>B</italic>, with size |<italic>B</italic>|, and query <italic>q</italic>&#x02208;<italic>B</italic> is given by:</p>
<disp-formula id="E6"><label>(6)</label><mml:math id="M8"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>B</mml:mi><mml:mo>\</mml:mo><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The final loss is computed by averaging the loss across multiple values of <italic>k</italic> as:</p>
<disp-formula id="E7"><label>(7)</label><mml:math id="M9"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mi>K</mml:mi><mml:mo>|</mml:mo></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>K</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:msup><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>In practice, we use following values <italic>K</italic> &#x0003D; {1, 2, 4, 8, 16}. All examples in the mini-batch are used as queries, and the average loss over all queries is minimized during the training.</p>
</sec>
<sec>
<title>4.2.4. Training</title>
<p>The training is set up for 100 epochs using an AdamW optimizer (Loshchilov and Hutter, <xref ref-type="bibr" rid="B29">2019</xref>) with an initial learning rate of 0.0001, which decreases by a factor of 0.3 using a step decay. For data augmentation, images are resized to 256 &#x000D7; 256, and a random crop of 224 &#x000D7; 224 is taken, followed by a random horizontal flip with a probability of 0.5 and normalization with mean and SD. The mini-batch is constructed <italic>via</italic> class-balanced sampling with 4 samples per class and a large batch size of 4, 000 is used. Two feed-forward passes (Patel et al., <xref ref-type="bibr" rid="B32">2021</xref>) are accumulated to create a larger batch size to address the GPU hardware demands. The first feed-forward pass is performed on the batch with 4, 000 samples in chunks of 200 samples at a time. All embedding vectors are stored while the intermediate features are discarded from the GPU memory. Using the embedding vectors and the ground truth labels, the loss (Equation 7) and the gradients for each sample with respect to the embedding vectors are calculated. Finally, a second feed-forward is performed, also in the chunks of 200 samples at a time, allowing the propagation of the gradients through the deep embedding model for the current chunk of 200 samples. At the end of the second feed-forward stage, the model&#x00027;s weights are updated.</p>
</sec>
<sec>
<title>4.2.5. Test-time</title>
<p>At inference, the test image is resized to 256 &#x000D7; 256, and a central crop of 224 &#x000D7; 224 with normalization is the input to the deep embedding model. A feed-forward pass is performed through all the training and testing samples, and the embedding vectors are stored. Each test sample is treated as a query for retrieval, and the ten closest samples from the training set are obtained. A majority vote determines the semantic class of the test sample.</p>
</sec>
</sec>
<sec>
<title>4.3. Class prior estimation</title>
<p>Commonly in Machine Learning, the class prior probabilities are the same for the training data and test data. However, plant species distributions change dramatically based on various aspects, i.e., seasonality, geographic location, weather, the hour in a day, etc. The problem of adjusting CNN outputs to the change in class prior probabilities was discussed in Sulc and Matas (<xref ref-type="bibr" rid="B42">2019</xref>), where it was proposed to recompute the posterior probabilities (predictions) <italic>p</italic>(<italic>c</italic><sub><italic>k</italic></sub>|<bold>x</bold><sub><italic>i</italic></sub>) by Equation (8).</p>
<disp-formula id="E8"><label>(8)</label><mml:math id="M10"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:msub><mml:mi>p</mml:mi><mml:mi>e</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy='false'>|</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy='false'>|</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mi>e</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mi>e</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy='false'>|</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mi>e</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac></mml:mrow><mml:mrow><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>K</mml:mi></mml:munderover><mml:mi>p</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy='false'>|</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mi>e</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac></mml:mrow></mml:mfrac><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>&#x0221D;</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy='false'>|</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mi>e</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>c</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The subscript <italic>e</italic> denotes probabilities on the evaluation/test set. The posterior probabilities <italic>p</italic>(<italic>c</italic><sub><italic>k</italic></sub>|<bold>x</bold><sub><italic>i</italic></sub>) are estimated by the Convolutional Neural Network outputs since it was trained with the cross-entropy loss. For class priors <italic>p</italic>(<italic>c</italic><sub><italic>k</italic></sub>), we have an empirical observation&#x02014;the class frequency in the training set. The evaluation and test set priors <italic>p</italic><sub><italic>e</italic></sub>(<italic>c</italic><sub><italic>k</italic></sub>) are, however, unknown. To evaluate the impact of changing class priors, we compare three existing prior estimation algorithms&#x02014;the Expectation&#x02013;maximization algorithm (EM) of Saerens et al. (<xref ref-type="bibr" rid="B38">2002</xref>) and the recently proposed CM-L and SCM-L methods of Sipka et al. (<xref ref-type="bibr" rid="B39">2022</xref>).</p>
<sec>
<title>4.3.1. EM&#x02014;expectation maximization</title>
<p>In our ExpertLifeCLEF 2018 challenge submissions, we followed the proposition from Sulc and Matas (<xref ref-type="bibr" rid="B42">2019</xref>) to use an EM algorithm of Saerens et al. (<xref ref-type="bibr" rid="B38">2002</xref>) for the estimation of test set priors by maximization of the likelihood of the test observations. The E and M step are described by Equation (9), where the super-scripts (<italic>s</italic>) or (<italic>s</italic> &#x0002B; 1) denote the step of the EM algorithm.</p>
<disp-formula id="E9"><label>(9)</label><mml:math id="M12"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtable style="text-align:axis;" equalrows="false" columnlines="none" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant='bold-italic'><mml:mi>x</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>x</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mfrac><mml:mrow><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mrow><mml:mrow><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>x</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mfrac><mml:mrow><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant='bold-italic'><mml:mi>x</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>In our submissions, we estimated the class prior probabilities for the whole test set. However, one may also consider estimating different class priors for different locations, based on the GPS-coordinates of the observations. Moreover, as discussed by Sulc and Matas (<xref ref-type="bibr" rid="B42">2019</xref>), one may use this procedure even in the cases where the new test samples come sequentially.</p>
</sec>
<sec>
<title>4.3.2. CM-L&#x02014;confusion matrix based likelihood maximization</title>
<p>The prior estimate is based on maximizing the likelihood of the observed classifier decisions. The CM-L method uses the classifier&#x00027;s <italic>confusion matrix</italic> (CM) in the format <bold>C</bold><sub><italic>d</italic>|<italic>y</italic></sub>, where the value in the <italic>k</italic>-th column and <italic>i</italic>-th row is the probability <italic>p</italic>(<italic>D</italic> &#x0003D; <italic>i</italic>|<italic>Y</italic> &#x0003D; <italic>k</italic>) of the classifier deciding for class <italic>i</italic> when the true class is <italic>k</italic>. The new class priors <bold>P</bold> are then estimated by maximizing the log-likelihood with the following objective:</p>
<disp-formula id="E10"><label>(10a)</label><mml:math id="M13"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mstyle mathvariant="bold-italic"><mml:mover accent="true"><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mstyle><mml:mo>=</mml:mo></mml:mtd><mml:mtd><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">arg&#x000A0;max</mml:mo></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>P</mml:mtext></mml:mstyle></mml:mrow></mml:munder></mml:mstyle><mml:mi>&#x02113;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>P</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">arg&#x000A0;max</mml:mo></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>P</mml:mtext></mml:mstyle></mml:mrow></mml:munder></mml:mstyle><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msub><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo class="qopname">log</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>C</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mo>:</mml:mo></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>P</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E11"><label>(10b)</label><mml:math id="M14"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtext class="textrm" mathvariant="normal">s.t.:</mml:mtext></mml:mtd><mml:mtd><mml:mtext>&#x02003;</mml:mtext><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>;</mml:mo><mml:mtext>&#x02003;</mml:mtext><mml:mo>&#x02200;</mml:mo><mml:mi>k</mml:mi><mml:mo>:</mml:mo><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02265;</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>n</italic><sub><italic>k</italic></sub> is the numbers of classifier&#x00027;s decisions for class <italic>k</italic> on test set and <bold>C</bold><sub><italic>k</italic>, :</sub> is the <italic>k</italic>-th row of the confusion matrix.</p>
<p>The SCM-L method works analogically, but uses the so-called <italic>soft confusion matrix</italic> (SCM) <inline-formula><mml:math id="M15"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>C</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mo>|</mml:mo><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">soft</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup></mml:math></inline-formula> estimated from the classifier&#x00027;s soft predictions <bold>f</bold> as</p>
<disp-formula id="E12"><label>(11)</label><mml:math id="M16"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>&#x00108;</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mo>:</mml:mo><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">soft</mml:mtext></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>:</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:mstyle mathvariant="bold"><mml:mtext>f</mml:mtext></mml:mstyle><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M17"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>&#x00108;</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mo>:</mml:mo><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">soft</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup></mml:math></inline-formula> denotes the <italic>k</italic>-th column of SCM. The probability <inline-formula><mml:math id="M18"><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">E</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">soft</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> can be estimated by averaging predictions <bold>f</bold>(<bold>x</bold>) over the test set.</p>
</sec>
</sec>
</sec>
<sec sec-type="results" id="s5">
<title>5. Results</title>
<p>First, we compare the state-of-the-art Convolutional Neural Networks and Vision Transformers in Section 5.1. Second, we evaluate the image retrieval approach to classification and compare it with the standard classifiers in Section 5.2. Finally, additional techniques for performance improvements are evaluated in Section 5.3.</p>
<sec>
<title>5.1. Image classification</title>
<sec>
<title>5.1.1. Combining several predictions per observation</title>
<p>LifeCLEF datasets include sets of images belonging to the same specimen observation. Typically, the images represent different organs of the specimen, e.g., flower, leaf, Such sets of images are connected by the ObservationID values provided in the metadata. The PlantCLEF 2017 test set contains 17,868 observations and 25,170 images. The ExpertLifeCLEF 2018 test set is smaller with 2,072 observations and 6,892 images. Plant species prediction based on multiple images is intuitive; it is inspired by the process used for years by botanists. Four simple approaches of per-image prediction combination are evaluated. Decide for the class with</p>
<list list-type="bullet">
<list-item><p><bold>Max softmax</bold>: maximum posterior probability estimate&#x02014;softmax&#x02014;over all images, i.e., follow the most confident prediction,</p></list-item>
<list-item><p><bold>Mean softmax</bold>: maximum average (over images) estimated posterior probability,</p></list-item>
<list-item><p><bold>Max logit</bold>: maximum activation value (Logit) over all images.</p></list-item>
<list-item><p><bold>Mean logits</bold>: maximum average (over images) logit value.</p></list-item>
</list>
<p>The best results of species prediction combination was achieved by selecting the species with the maximum value of logit mean. For the single ViT-Base/32 model and image size of 224 &#x000D7; 224, the Mean logits approach outperformed the max softmax by 0.86% on PlantCLEF 2017 and 4.59% on ExpertLifeCLEF 2018. Overall, the accuracy is significantly higher for observations then for single images, in some cases increasing the accuracy by more then 20%. Full results are shown in <xref ref-type="table" rid="T2">Table 2</xref>.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Classification accuracy on the PlantCLEF 2017 and the ExpertLifeCLEF 2018 datasets for different image prediction combination strategies.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Architecture</bold></th>
<th valign="top" align="center"><bold>Test set</bold></th>
<th valign="top" align="center"><bold>Image-wise</bold></th>
<th valign="top" align="center"><bold>Max Softmax</bold></th>
<th valign="top" align="center"><bold>Mean Softmax</bold></th>
<th valign="top" align="center"><bold>Max Logits</bold></th>
<th valign="top" align="center"><bold>Mean Logits</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">EfficientNetV2-S</td>
<td valign="top" align="center">2017</td>
<td valign="top" align="center">79.21</td>
<td valign="top" align="center">84.35</td>
<td valign="top" align="center">85.26</td>
<td valign="top" align="center">85.54</td>
<td valign="top" align="center"><bold>85.75</bold></td>
</tr>
<tr>
<td valign="top" align="left">EfficientNetV2-S</td>
<td valign="top" align="center">2018</td>
<td valign="top" align="center">53.08</td>
<td valign="top" align="center">67.28</td>
<td valign="top" align="center">70.32</td>
<td valign="top" align="center">72.25</td>
<td valign="top" align="center"><bold>74.13</bold></td>
</tr>
<tr>
<td valign="top" align="left">ViT-Base/32</td>
<td valign="top" align="center">2017</td>
<td valign="top" align="center">73.50</td>
<td valign="top" align="center">80.43</td>
<td valign="top" align="center">80.55</td>
<td valign="top" align="center">80.79</td>
<td valign="top" align="center"><bold>81.29</bold></td>
</tr>
<tr>
<td valign="top" align="left">ViT-Base/32</td>
<td valign="top" align="center">2018</td>
<td valign="top" align="center">49.36</td>
<td valign="top" align="center">66.94</td>
<td valign="top" align="center">66.84</td>
<td valign="top" align="center">68.87</td>
<td valign="top" align="center"><bold>71.53</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p><bold>Convolutional neural networks:</bold> The comparison of the former and recent state-of-the-art CNN architectures on the PlantCLEF2017 and the ExpertLifeCLEF 2018 test sets shows similar behavior as on other fine-grained datasets (Wah et al., <xref ref-type="bibr" rid="B49">2011</xref>; Van Horn et al., <xref ref-type="bibr" rid="B47">2018</xref>; Picek et al., <xref ref-type="bibr" rid="B34">2022</xref>). The best performing model on both datasets is EfficientNetV2-L with 77.03% accuracy on ExpertLifeCLEF 2018 and 88.52% accuracy on PlantCLEF 2017. Other deep networks including ResNeSt-269e and SE-ResNeXt-101 underperformend by a significant margin. The achieved scores are summarized in <xref ref-type="table" rid="T3">Table 3</xref>.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Image classification accuracy for Deep Neural Network Classifiers on the PlantCLEF 2017 (right) and ExpertLifeCLEF 2018 (left) test sets.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left" colspan="2"></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;" colspan="2"><bold>PlantCLEF 2018&#x02014;Accuracy [%]</bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;" colspan="2"><bold>PlantCLEF 2017&#x02014;Accuracy [%]</bold></th>
</tr>
<tr>
<th/>
</tr>
</thead>
<tbody>
 <tr>
<td valign="top" align="left"><bold>Architecture</bold></td>
<td valign="top" align="center"><bold>Input</bold></td>
<td valign="top" align="center"><bold>Images</bold></td>
<td valign="top" align="center"><bold>Observations</bold></td>
<td valign="top" align="center"><bold>Images</bold></td>
<td valign="top" align="center"><bold>Observations</bold></td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">ResNet-50</td>
<td valign="top" align="center">224 &#x000D7; 224</td>
<td valign="top" align="center">40.03</td>
<td valign="top" align="center">56.32</td>
<td valign="top" align="center">68.00</td>
<td valign="top" align="center">74.57</td>
</tr>
<tr>
<td valign="top" align="left">Inception-v4</td>
<td valign="top" align="center">224 &#x000D7; 224</td>
<td valign="top" align="center">43.41</td>
<td valign="top" align="center">59.41</td>
<td valign="top" align="center">71.32</td>
<td valign="top" align="center">77.92</td>
</tr>
<tr>
<td valign="top" align="left">Inception-Resnet-V2</td>
<td valign="top" align="center">224 &#x000D7; 224</td>
<td valign="top" align="center">44.14</td>
<td valign="top" align="center">68.15</td>
<td valign="top" align="center">70.57</td>
<td valign="top" align="center">78.96</td>
</tr>
<tr>
<td valign="top" align="left">ViT-Base/32</td>
<td valign="top" align="center">224 &#x000D7; 224</td>
<td valign="top" align="center">49.36</td>
<td valign="top" align="center">71.53</td>
<td valign="top" align="center">73.50</td>
<td valign="top" align="center">81.29</td>
</tr>
<tr>
<td valign="top" align="left">ViT-Base/16</td>
<td valign="top" align="center">224 &#x000D7; 224</td>
<td valign="top" align="center">51.58</td>
<td valign="top" align="center">73.70</td>
<td valign="top" align="center">75.54</td>
<td valign="top" align="center">82.57</td>
</tr>
<tr>
<td valign="top" align="left">EfficientNetV2-S</td>
<td valign="top" align="center">224 &#x000D7; 224</td>
<td valign="top" align="center"><bold>53.08</bold></td>
<td valign="top" align="center"><bold>74.13</bold></td>
<td valign="top" align="center"><bold>79.21</bold></td>
<td valign="top" align="center"><bold>85.75</bold></td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">ViT-Tiny/16</td>
<td valign="top" align="center">384 &#x000D7; 384</td>
<td valign="top" align="center">47.43</td>
<td valign="top" align="center">69.06</td>
<td valign="top" align="center">73.64</td>
<td valign="top" align="center">80.59</td>
</tr>
<tr>
<td valign="top" align="left">SE-ResNeXt-101</td>
<td valign="top" align="center">384 &#x000D7; 384</td>
<td valign="top" align="center">54.61</td>
<td valign="top" align="center">73.75</td>
<td valign="top" align="center">80.31</td>
<td valign="top" align="center">85.98</td>
</tr>
<tr>
<td valign="top" align="left">ResNeSt-269e</td>
<td valign="top" align="center">384 &#x000D7; 384</td>
<td valign="top" align="center">56.27</td>
<td valign="top" align="center">74.52</td>
<td valign="top" align="center">81.68</td>
<td valign="top" align="center">86.74</td>
</tr>
<tr>
<td valign="top" align="left">ViT-Base/16</td>
<td valign="top" align="center">384 &#x000D7; 384</td>
<td valign="top" align="center">58.49</td>
<td valign="top" align="center">77.03</td>
<td valign="top" align="center">82.28</td>
<td valign="top" align="center">87.75</td>
</tr>
<tr>
<td valign="top" align="left">EfficientNetV2-L</td>
<td valign="top" align="center">384 &#x000D7; 384</td>
<td valign="top" align="center">59.90</td>
<td valign="top" align="center">77.03</td>
<td valign="top" align="center">84.15</td>
<td valign="top" align="center">88.52</td>
</tr>
<tr>
<td valign="top" align="left">ViT-Large/16</td>
<td valign="top" align="center">384 &#x000D7; 384</td>
<td valign="top" align="center"><bold>67.03</bold></td>
<td valign="top" align="center"><bold>83.54</bold></td>
<td valign="top" align="center"><bold>86.87</bold></td>
<td valign="top" align="center"><bold>91.15</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Observation values calculated as Mean Logits.</p>
</table-wrap-foot>
</table-wrap>
<p><bold>Vision transformers:</bold> The performance of different ViT architectures in the FGVC domain, multiple architectures, was evaluated for two different input resolutions&#x02014;224 &#x000D7; 224 and 384 &#x000D7; 384&#x02014;on two test sets&#x02014;PlantCLEF2017 and ExpertLifeCLEF 2018. More precisely, ViT-Base/16 and ViT-Base/32 are compared on the input size of 224 &#x000D7; 224 and ViT-Large/16, ViT-Base/16 and ViT-Tiny/16 are tested on the input size of 384 &#x000D7; 384.</p>
<p>In the 384 &#x000D7; 384 scenario, ViT-Large/16 outperformed the best CNN model (ResNeSt-269e) 2.63% points on PlantCLEF 2017 and by 6.51% points on ExpertLifeCLEF 2018 while reducing the error by 22.91% and 28.34%, respectively. In the 224 &#x000D7; 224 scenario, the relative performance differed; EfficientNetV2-S outperformed all the models including both Vision Transformers on the ExpertLifeCLEF 2017 dataset. Comparison on the PlantCLEF2017 dataset, show the insignificant performance difference between ViT-Base/16 and EfficientNetV2-S.</p>
</sec>
</sec>
<sec>
<title>5.2. Classification vs. metric learning</title>
<p>This section compares training a softmax image classifier explicitly as in the previous experiments and training an image retrieval system, which is subsequently used for nearest neighbor classification. The resolution of images, pre-trained weights and number of training epochs are kept the same across the two setups for a fair comparison. Even though we compare both methods under the same conditions, those conditions handicap the standard image classification approach as any additional techniques are permitted.</p>
<p>Overall, the retrieval approach achieved superior performance in all measured scenarios. Notably, the ViT-Base/16 feature extractor architecture achieved a higher classification accuracy with a margins of 0.28, 4.13, and 10.25% on ExpertLifeCLEF 2018, PlantCLEF 2017, and iNat2018&#x02013;Plantae, respectively. Besides, the macro-F1 performance differences margin is noticeably higher&#x02014;1.85% for ExpertLifeCLEF 2018 and 12.23% for iNat2018&#x02013;Plantae datasets. Even though the standard classification approach performs better on classes with fewer samples (refer to <xref ref-type="fig" rid="F4">Figure 4</xref>), common species with high a-prior probability are frequently wrongly predicted. This is primarily due to the high-class imbalance preserved in the dataset mimicked by the deep neural network optimized <italic>via</italic> SoftMax Cross-Entropy Loss. Thus, the results of the standard image classification approach performs way worst in case of the macro-F1 score. A full comparison of the classification and retrieval-based methods and their appropriate recognition scores are listed in <xref ref-type="table" rid="T4">Table 4</xref>. Three architectures&#x02014;ResNet-50, ViT-Base/32, and ViT-Base/16 are evaluated. It can be seen from the results that for all selected architectures, retrieval leads to better performance. Furthermore, in <xref ref-type="fig" rid="F5">Figure 5</xref>, we provide qualitative examples from the retrieval approach on the iNaturalist dataset. The Top5 predictions for randomly selected target images show that the retrieval-like approach allows better interpretability of the results.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Classification performance (F1 and Accuracy) as box-plot for three backbone architectures and Classification and Retrieval approaches. Tested on PlantCLEF2017 test set with input resolution of 224 &#x000D7; 224.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-13-787527-g0004.tif"/>
</fig>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Performance evaluation for Classification (C) and Retrieval (R) based methods.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left" colspan="2"></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;" colspan="2"><bold>ExpertLifeCLEF 2018</bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;" colspan="2"><bold>PlantCLEF 2017</bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;" colspan="2"><bold>iNat2018&#x02013;Plantae</bold></th>
</tr>
<tr>
<th/>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><bold>Architecture</bold></td>
<td valign="top" align="center"><bold>Method</bold></td>
<td valign="top" align="center"><bold>Acc</bold>.</td>
<td valign="top" align="center"><bold>Macro F1</bold></td>
<td valign="top" align="center"><bold>Acc</bold></td>
<td valign="top" align="center"><bold>Macro F1</bold></td>
<td valign="top" align="center"><bold>Acc</bold></td>
<td valign="top" align="center"><bold>Macro F1</bold></td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">ResNet-50</td>
<td valign="top" align="center">C</td>
<td valign="top" align="center">59.87</td>
<td valign="top" align="center">55.11</td>
<td valign="top" align="center">77.89</td>
<td valign="top" align="center">54.48</td>
<td valign="top" align="center">57.73</td>
<td valign="top" align="center">52.69</td>
</tr>
<tr>
<td valign="top" align="left">ViT-Base/32</td>
<td valign="top" align="center">C</td>
<td valign="top" align="center">65.21</td>
<td valign="top" align="center">60.29</td>
<td valign="top" align="center">80.68</td>
<td valign="top" align="center">59.18</td>
<td valign="top" align="center">57.24</td>
<td valign="top" align="center">53.17</td>
</tr>
<tr>
<td valign="top" align="left">ViT-Base/16</td>
<td valign="top" align="center">C</td>
<td valign="top" align="center">71.71</td>
<td valign="top" align="center">67.35</td>
<td valign="top" align="center">84.48</td>
<td valign="top" align="center">65.40</td>
<td valign="top" align="center">67.42</td>
<td valign="top" align="center">64.51</td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">ResNet-50</td>
<td valign="top" align="center">R</td>
<td valign="top" align="center">60.15</td>
<td valign="top" align="center">56.30</td>
<td valign="top" align="center">80.27</td>
<td valign="top" align="center">55.57</td>
<td valign="top" align="center">57.95</td>
<td valign="top" align="center">56.32</td>
</tr>
<tr>
<td valign="top" align="left">ViT-Base/32</td>
<td valign="top" align="center">R</td>
<td valign="top" align="center">66.48</td>
<td valign="top" align="center">61.49</td>
<td valign="top" align="center">84.89</td>
<td valign="top" align="center">60.79</td>
<td valign="top" align="center">63.12</td>
<td valign="top" align="center">61.24</td>
</tr>
<tr>
<td valign="top" align="left">ViT-Base/16</td>
<td valign="top" align="center">R</td>
<td valign="top" align="center"><bold>71.99</bold></td>
<td valign="top" align="center"><bold>69.20</bold></td>
<td valign="top" align="center"><bold>88.61</bold></td>
<td valign="top" align="center"><bold>66.39</bold></td>
<td valign="top" align="center"><bold>77.67</bold></td>
<td valign="top" align="center"><bold>76.74</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>All models were trained for 100 epochs with fixed image size (224 &#x000D7; 224). No test-time augmentations were used. The most confident image prediction is used for all images belonging to the same observation.</p>
</table-wrap-foot>
</table-wrap>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Qualitative examples from the retrieval approach on the iNaturalist dataset. The leftmost column shows samples from the test set followed by five nearest neighbors in the learned embedding space from the training set. The red box denotes the wrong species.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpls-13-787527-g0005.tif"/>
</fig>
</sec>
<sec>
<title>5.3. A fine-tuning cookbook</title>
<p>In this section, we evaluate several methods that have the potential to increase performance for almost any deep neural network architecture considerably. The evaluation considers different loss functions, learning rate schedulers, prior estimation methods, and augmentations. Furthermore, the impact of the noisy data and the contribution of the test-time augmentations are studied. We list helpful methods and those that will make the performance worst if utilized. The evaluation is carried out on the PlantCLEF2017 and ExpertLifeCLEF 2018 datasets and ViT/Base-32 architecture with an input size of 224 &#x000D7; 224, if not stated differently. All used methods are described bellow. The ablation study for relevant methods is summarized in <xref ref-type="table" rid="T5">Table 5</xref>.</p>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>Ablation study considering different techniques for ViT-Base/32 performance improvements.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="center" style="border-bottom: thin solid #000000;" colspan="3"></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;" colspan="2"><bold>Test 2018 - Acc [%]</bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;" colspan="2"><bold>Test 2017 - Acc [%]</bold></th>
</tr>
<tr>
<th/>
</tr>
</thead>
<tbody>
 <tr>
<td valign="top" align="center"><bold>TTA</bold></td>
<td valign="top" align="center"><bold>CCA</bold></td>
<td valign="top" align="center"><bold>RC</bold></td>
<td valign="top" align="center"><bold>Images</bold></td>
<td valign="top" align="center"><bold>Observations</bold></td>
<td valign="top" align="center"><bold>Images</bold></td>
<td valign="top" align="center"><bold>Observations</bold></td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="center">&#x000D7;</td>
<td valign="top" align="center">&#x000D7;</td>
<td valign="top" align="center">&#x000D7;</td>
<td valign="top" align="center">49.59</td>
<td valign="top" align="center">71.62</td>
<td valign="top" align="center">73.59</td>
<td valign="top" align="center">81.29</td>
</tr>
<tr>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x000D7;</td>
<td valign="top" align="center">&#x000D7;</td>
<td valign="top" align="center">&#x0002B;2.51</td>
<td valign="top" align="center">&#x0002B;1.98</td>
<td valign="top" align="center">&#x0002B;5.38</td>
<td valign="top" align="center">&#x0002B;4.65</td>
</tr>
<tr>
<td valign="top" align="center">&#x000D7;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x000D7;</td>
<td valign="top" align="center">&#x0002B;0.32</td>
<td valign="top" align="center">&#x0002B;1.06</td>
<td valign="top" align="center">&#x0002B;0.70</td>
<td valign="top" align="center">&#x0002B;0.80</td>
</tr>
<tr>
<td valign="top" align="center">&#x000D7;</td>
<td valign="top" align="center">&#x000D7;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02013;0.48</td>
<td valign="top" align="center">&#x0002B;1.30</td>
<td valign="top" align="center">&#x0002B;3.82</td>
<td valign="top" align="center">&#x0002B;3.86</td>
</tr>
<tr>
<td valign="top" align="center">&#x000D7;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02013;0.10</td>
<td valign="top" align="center">&#x0002B;1.93</td>
<td valign="top" align="center">&#x0002B;3.83</td>
<td valign="top" align="center">&#x0002B;3.89</td>
</tr>
<tr>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x000D7;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x0002B;2.44</td>
<td valign="top" align="center">&#x0002B;2.51</td>
<td valign="top" align="center">&#x0002B;5.22</td>
<td valign="top" align="center">&#x0002B;4.22</td>
</tr>
<tr>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x000D7;</td>
<td valign="top" align="center"><bold>&#x0002B;3.01</bold></td>
<td valign="top" align="center"><bold>&#x0002B;3.72</bold></td>
<td valign="top" align="center">&#x0002B;5.16</td>
<td valign="top" align="center">&#x0002B;4.38</td>
</tr>
<tr>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x0002B;2.83</td>
<td valign="top" align="center">&#x0002B;2.85</td>
<td valign="top" align="center"><bold>&#x0002B;5.68</bold></td>
<td valign="top" align="center"><bold>&#x0002B;4.67</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p><bold>Cyclic cosine annealing:</bold> We compare standard cosine, a custom adaptive strategy where Learning Rate is decayed by 10% if validation loss is not reduced for two epochs, and Cyclic Cosine Annealing (CCA). The CCA is an alternative to standard Learning rate scheduling approaches, e.g., Exponential, Linear, Step, and Cosine. The CCA is divided into multiple cycles where the start learning rate decreases by 20%, and the learning rate in each cycle decreases <italic>via</italic> the standard cosine function. Such a learning rate schedule allows for diverging from local minima and searching for better optima. We compare standard cosine, a custom adaptive strategy where Learning Rate is decayed by 10% if validation loss is not reduced for two epochs, and Cyclic Cosine Annealing (CCA). Using the CCA instead of the standard approaches, we measured relative performance increases equal to &#x0002B;1.06 and &#x0002B;0.80% on the ExpertLifeCLEF 2018 and LifeCLEF2017, respectively.</p>
<p><bold>Test-time augmentations:</bold> Test-time augmentations is a procedure where various mutations of the original image are feed-forwarded through the deep neural network in order to provide images in different rotations or scales. In our case, we use a simple test-time augmentation procedure&#x02014;each test image is processed as a batch of 13 images:</p>
<list list-type="bullet">
<list-item><p>One original image (resized to 224 &#x000D7; 224 or 384 &#x000D7; 384),</p></list-item>
<list-item><p>Four central crops covering 90, 80, and 70% of the original image size,</p></list-item>
<list-item><p>Two top left corner crops covering 80 and 70% of the original image size,</p></list-item>
<list-item><p>Two top right corner crops covering 80 and 70% of the original image size,</p></list-item>
<list-item><p>Two bottom left corner crops covering 80 and 70% of the original image size,</p></list-item>
<list-item><p>Two bottom right corner crops covering 80 and 70% of the original image size,</p></list-item>
</list>
<p>The predictions from all 13 cropped/augmented images are then combined. The results in <xref ref-type="table" rid="T5">Table 5</xref> show than using so called test time augmentation improves the classification accuracy up to 1.98 and 4.65% on the ExpertLifeCLEF 2018 and LifeCLEF2017, respectively.</p>
<p><bold>Random crop:</bold> Random crop allows for learning more detailed object representation as an image is not resized to a smaller resolution. Furthermore, training with random crops has high synergy with the test-time augmentation process if crops of similar size are used for TTA. For just a random crop, we measured performance increases equal to &#x0002B;1.30 and &#x0002B;3.86% achieved on the ExpertLifeCLEF 2018 and LifeCLEF2017, respectively. Combining with TTA, the margin increased to &#x0002B;1.93%, &#x0002B;3.89%.</p>
<p><bold>Prior shift adaptation:</bold> The prior shift adaptation methods described in Sections 4.3.1 and 4.3.2 are compared in <xref ref-type="table" rid="T6">Table 6</xref>. Prior shift adaptation is applied to the prediction of each test augmentation, before the combination of augmentation and images per observation by averaging. The results show that in all cases, prior shift adaptation improves the recognition accuracy. The EM algorithm of Saerens et al. (<xref ref-type="bibr" rid="B38">2002</xref>) achieves the best result in three cases, the CM-L method of Sipka et al. (<xref ref-type="bibr" rid="B39">2022</xref>) in one case, but the differences are very small among the three compared prior shift adaptation methods.</p>
<table-wrap position="float" id="T6">
<label>Table 6</label>
<caption><p>Accuracy before and after prior shift adaptation with the EM algorithm (Saerens et al., <xref ref-type="bibr" rid="B38">2002</xref>) and the (S)CM-L methods (Sipka et al., <xref ref-type="bibr" rid="B39">2022</xref>) on the ExpertLifeCLEF 2018 and the PlantCLEF 2017 test sets.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Architecture</bold></th>
<th valign="top" align="left"><bold>Test set</bold></th>
<th valign="top" align="center"><bold>EM</bold></th>
<th valign="top" align="center"><bold>CM-L</bold></th>
<th valign="top" align="center"><bold>SCM-L</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">ViT-Large/16</td>
<td valign="top" align="left">PlantCLEF 2017</td>
<td valign="top" align="center">&#x0002B;1.17</td>
<td valign="top" align="center"><bold>&#x0002B;1.25</bold></td>
<td valign="top" align="center">&#x0002B;0.66</td>
</tr>
<tr>
<td valign="top" align="left">ViT-Large/16</td>
<td valign="top" align="left">ExpertLifeCLEF 2018</td>
<td valign="top" align="center"><bold>&#x0002B;2.21</bold></td>
<td valign="top" align="center">&#x0002B;1.83</td>
<td valign="top" align="center">&#x0002B;1.64</td>
</tr>
<tr>
<td valign="top" align="left">SE-ResNeXt-101</td>
<td valign="top" align="left">PlantCLEF 2017</td>
<td valign="top" align="center"><bold>&#x0002B;1.65</bold></td>
<td valign="top" align="center">&#x0002B;1.50</td>
<td valign="top" align="center">&#x0002B;1.07</td>
</tr>
<tr>
<td valign="top" align="left">SE-ResNeXt-101</td>
<td valign="top" align="left">ExpertLifeCLEF 2018</td>
<td valign="top" align="center"><bold>&#x0002B;3.81</bold></td>
<td valign="top" align="center">&#x0002B;3.28</td>
<td valign="top" align="center">&#x0002B;3.23</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>All results are using the fine-tuned models and Mean Softmax Accuracy for combining predictions belonging to the same observation. Input size 384 &#x000D7; 384.</p>
</table-wrap-foot>
</table-wrap>
<p><bold>Focal loss:</bold> Even though commonly used in object detection, Focal Loss (Lin et al., <xref ref-type="bibr" rid="B28">2017</xref>) has the potential to focus the training process on more challenging and rare samples and could prevent the vast majority of images from dominating the optimizer. As any considerable performance increase for ViT and CNN architectures was not measured on both datasets, we do not recommend using Focal Loss for plant recognition.</p>
<p><bold>Impact of the noisy data:</bold> Noisy data, i.e., data without human-verified labels, are commonly used to increase the number of rare species samples and balance long-tailed class distribution. Even though the Krause et al. (<xref ref-type="bibr" rid="B25">2016</xref>) showed unreasonable effectiveness of the noisy labels on small-scale FGVC datasets, the contribution in the &#x0201C;in the wild&#x0201D; scenario is not established. In the case of the flora recognition, upsampling the minimum samples for each class (up to 10, 20, 30, and 40) did not improve the accuracy on both testing sets, i.e., the performance difference was statistically insignificant (see <xref ref-type="table" rid="T7">Table 7</xref>).</p>
<table-wrap position="float" id="T7">
<label>Table 7</label>
<caption><p>Impact of additional noisy data on classification performance.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th valign="top" align="center" style="border-bottom: thin solid #000000;" colspan="2"><bold>Test 2018 - Acc [%]</bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;" colspan="2"><bold>Test 2017 - Acc [%]</bold></th>
</tr>
<tr>
<th/>
</tr>
</thead>
<tbody>
 <tr>
<td valign="top" align="left"><bold>Min. samples</bold></td>
<td valign="top" align="center"><bold>Images</bold></td>
<td valign="top" align="center"><bold>Observations</bold></td>
<td valign="top" align="center"><bold>Images</bold></td>
<td valign="top" align="center"><bold>Observations</bold></td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">10</td>
<td valign="top" align="center">&#x0002B;0.17</td>
<td valign="top" align="center">&#x02013;0.58</td>
<td valign="top" align="center">&#x02013;0.20</td>
<td valign="top" align="center">&#x02013;0.49</td>
</tr>
<tr>
<td valign="top" align="left">20</td>
<td valign="top" align="center"><bold>&#x0002B;0.32</bold></td>
<td valign="top" align="center">&#x02013;0.53</td>
<td valign="top" align="center">&#x02013;0.33</td>
<td valign="top" align="center">&#x02013;0.38</td>
</tr>
<tr>
<td valign="top" align="left">30</td>
<td valign="top" align="center">&#x02013;0.13</td>
<td valign="top" align="center">&#x02013;0.24</td>
<td valign="top" align="center">&#x02013;0.44</td>
<td valign="top" align="center">&#x02013;0.66</td>
</tr>
<tr>
<td valign="top" align="left">40</td>
<td valign="top" align="center">&#x02013;0.10</td>
<td valign="top" align="center">&#x02013;1.25</td>
<td valign="top" align="center">&#x02013;0.60</td>
<td valign="top" align="center">&#x02013;0.82</td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">Baseline</td>
<td valign="top" align="center">49.77</td>
<td valign="top" align="center"><bold>68.24</bold></td>
<td valign="top" align="center"><bold>74.19</bold></td>
<td valign="top" align="center"><bold>81.16</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec sec-type="conclusions" id="s6">
<title>6. Conclusion</title>
<p>The article assessed automatic plant identification as a fine-grained classification task on the largest available plant recognition datasets coming from the LifeCLEF and CVPR-FGVC workshops, counting up to 10,000 plant species.</p>
<p><bold>State-of-the-art classifiers:</bold> The comparison of deep neural network classifiers in Section 5.1 shows the improvement in classification accuracy achieved by recent CNN architectures. The state-of-the-art Vision Transformers achieve even higher recognition scores: the best model, ViT-Large/16, achieves recognition scores of 91.15% and 83.54% on the PlantCLEF 2017 and ExpertLifeCLEF 2018 test sets, respectively, before additional post-processing like test-time augmentations and prior shift adaptation.</p>
<p><bold>Prior shift adaptation:</bold> The prior shift in the datasets, i.e., the difference between the training and test data class distribution, is a significant and omnipresent phenomenon. We test existing prior shift adaptation methods and their impact on classification accuracy. The experiments with state-of-the-art methods for prior shift estimation (Saerens et al., <xref ref-type="bibr" rid="B38">2002</xref>; Sipka et al., <xref ref-type="bibr" rid="B39">2022</xref>), evaluated in <xref ref-type="table" rid="T6">Table 6</xref>, show that all three compared methods improve the classification accuracy in all cases. The differences among all three methods are rather small, EM achieving slightly better results in 3 of 4 cases. Given the optimization speed, EM algorithm is a preferred choice.</p>
<p><bold>Retrieval approach to fine-grained classification:</bold> Training an image retrieval system and subsequently performing a nearest neighbor classification is a competitive alternative, with better results than direct classification. The prediction obtained <italic>via</italic> a nearest neighbor search is more interpretable as the samples contributing to the prediction can be visualized. Therefore, a retrieval-based approach is more suitable if utilized within the humans in the loop. On the other hand, the softmax predictions of a standard neural network classifier allow for simple post-processing procedures such as averaging and prior shift adaptation, which are yet to be explored for the retrieval approach, and which noticeably improve the final recognition accuracy of the standard classifiers.</p>
<p>Overall, using image-retrieval has clear advantages, e.g., recovering relevant nearest-neighbor labeled samples, providing ranked class predictions, and allows user or experts to visually verify the species based on the k-nearest neighbors Besides, the retrieval approach naturally supports open-set recognition problems, i.e., the ability to extend or modify the set of recognized classes after the training stage. The set of classes may change e.g., as a results of modifications to biological taxonomy. New classes are introduced simply by adding training images with the new label, whereas in the standard approach, the classification head needs re-training. On the negative side, the retrieval approach requires, on top of running the deep net to extract the embedding, to execute the nearest neighbor search efficiently, increasing the overall complexity of the fine-grained recognition system.</p>
<p>Contrary to our expectations, the error analysis in <xref ref-type="fig" rid="F4">Figure 4</xref> shows that the retrieval approach does not bring an improvement in classifying images from classes with few training samples. <xref ref-type="fig" rid="F5">Figure 5</xref> shows that retrieval has a very high accuracy for a higher number of species, but it also fails for a higher number of species.</p>
</sec>
<sec sec-type="data-availability" id="s7">
<title>Data availability statement</title>
<p>The PlantCLEF datasets used in this study are publicly available in the repository of the <ext-link ext-link-type="uri" xlink:href="http://otmedia.lirmm.fr/LifeCLEF/">LifeCLEF</ext-link> challenge organizers. The test set labels were kindly provided by the challenge Go&#x000EB;au et al. (<xref ref-type="bibr" rid="B12">2018</xref>) organizers. The iNaturalist dataset is publicly available at the competition GitHub page. All images used in the article are with CC-BY licence.</p>
</sec>
<sec id="s8">
<title>Author contributions</title>
<p>LP, M&#x00160;, YP, and JM conceived the study and drafted the manuscript. LP, M&#x00160;, and YP implemented and conducted the machine learning experiments. All authors critically revised, reviewed, and approved the manuscript.</p>
</sec>
<sec sec-type="funding-information" id="s9">
<title>Funding</title>
<p>LP was supported by the UWB project No. SGS-2022-017. LP and JM were supported by the Ministry of Environment of the Czech Republic project No. SS05010008. M&#x00160; and JM were supported by Toyota Motor Europe. JM and YP were supported by Research Center for Informatics (project CZ.02.1.01/0.0/0.0/16\_019/0000765 funded by OP VVV). YP was supported by the Grant Agency of the Czech Technical University in Prague, grant No. SGS20/171/OHK3/3T/13, by Project StratDL in the realm of COMET K1 center Software Competence Center Hagenberg and an Amazon Research Award.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s10">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<ack><p>Computational resources were supplied by the project e-Infrastruktura CZ (e-INFRA CZ LM2018140) supported by the Ministry of Education, Youth and Sports of the Czech Republic.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Belhumeur</surname> <given-names>P. N.</given-names></name> <name><surname>Chen</surname> <given-names>D.</given-names></name> <name><surname>Feiner</surname> <given-names>S.</given-names></name> <name><surname>Jacobs</surname> <given-names>D. W.</given-names></name> <name><surname>Kress</surname> <given-names>W. J.</given-names></name> <name><surname>Ling</surname> <given-names>H.</given-names></name> <etal/></person-group>. (<year>2008</year>). <article-title>&#x0201C;Searching the world&#x00027;s Herbaria: a system for visual identification of plant species,&#x0201D;</article-title> in <source>Computer Vision-ECCV 2008</source> (<publisher-loc>Berlin; Heidelberg</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>116</fpage>&#x02013;<lpage>129</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-540-88693-8_9</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bonnet</surname> <given-names>P.</given-names></name> <name><surname>Go&#x000EB;au</surname> <given-names>H.</given-names></name> <name><surname>Hang</surname> <given-names>S. T.</given-names></name> <name><surname>Lasseck</surname> <given-names>M.</given-names></name> <name><surname>&#x00160;ulc</surname> <given-names>M.</given-names></name> <name><surname>Mal&#x000E9;cot</surname> <given-names>V.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>&#x0201C;<italic>Plant identification: experts vs. machines in the era of deep learning</italic>,&#x0201D;</article-title> in <source>Multimedia Tools and Applications for Environmental</source> &#x00026; <italic>Biodiversity Informatics</italic> (Cham: Springer International Publishing), <fpage>131</fpage>&#x02013;<lpage>149</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-76445-0_8</pub-id></citation>
</ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Buslaev</surname> <given-names>A.</given-names></name> <name><surname>Iglovikov</surname> <given-names>V. I.</given-names></name> <name><surname>Khvedchenya</surname> <given-names>E.</given-names></name> <name><surname>Parinov</surname> <given-names>A.</given-names></name> <name><surname>Druzhinin</surname> <given-names>M.</given-names></name> <name><surname>Kalinin</surname> <given-names>A. A.</given-names></name></person-group> (<year>2020</year>). <article-title>Albumentations: fast and flexible image augmentations</article-title>. <source>Information</source> <volume>11</volume>, <fpage>125</fpage>. <pub-id pub-id-type="doi">10.3390/info11020125</pub-id></citation>
</ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Caglayan</surname> <given-names>A.</given-names></name> <name><surname>Guclu</surname> <given-names>O.</given-names></name> <name><surname>Can</surname> <given-names>A. B.</given-names></name></person-group> (<year>2013</year>). <article-title>&#x0201C;A plant recognition approach using shape and color features in leaf images,&#x0201D;</article-title> in <source>International Conference on Image Analysis and Processing</source> (<publisher-loc>Berlin; Heidelberg</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>161</fpage>&#x02013;<lpage>170</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-642-41184-7_17</pub-id></citation>
</ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cui</surname> <given-names>Y.</given-names></name> <name><surname>Song</surname> <given-names>Y.</given-names></name> <name><surname>Sun</surname> <given-names>C.</given-names></name> <name><surname>Howard</surname> <given-names>A.</given-names></name> <name><surname>Belongie</surname> <given-names>S.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Large scale fine-grained categorization and domain-specific transfer learning,&#x0201D; <italic>in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</italic></article-title> (Salt Lake City, UT). <pub-id pub-id-type="doi">10.1109/CVPR.2018.00432</pub-id></citation>
</ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dosovitskiy</surname> <given-names>A.</given-names></name> <name><surname>Beyer</surname> <given-names>L.</given-names></name> <name><surname>Kolesnikov</surname> <given-names>A.</given-names></name> <name><surname>Weissenborn</surname> <given-names>D.</given-names></name> <name><surname>Zhai</surname> <given-names>X.</given-names></name> <name><surname>Unterthiner</surname> <given-names>T.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>&#x0201C;<italic>An image is worth 16x16</italic> words: transformers for image recognition at scale,&#x0201D;</article-title> in <source>International Conference on Learning Representations</source> (<publisher-loc>Vienna</publisher-loc>).</citation>
</ref>
<ref id="B7">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Garcin</surname> <given-names>C.</given-names></name> <name><surname>Joly</surname> <given-names>A.</given-names></name> <name><surname>Bonnet</surname> <given-names>P.</given-names></name> <name><surname>Lombardo</surname> <given-names>J.-C.</given-names></name> <name><surname>Affouard</surname> <given-names>A.</given-names></name> <name><surname>Chouet</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>&#x0201C;Pl&#x00040; ntnet-300k: a plant image dataset with high label ambiguity and a long-tailed distribution,&#x0201D;</article-title> in <source>NeurIPS 2021-35th Conference on Neural Information Processing Systems</source>, ed J. Vanschoren and S. Yeung. Available online at: <ext-link ext-link-type="uri" xlink:href="https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/7e7757b1e12abcb736ab9a754ffb617a-Paper-round2.pdf">https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/7e7757b1e12abcb736ab9a754ffb617a-Paper-round2.pdf</ext-link></citation>
</ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gaston</surname> <given-names>K. J.</given-names></name> <name><surname>O&#x00027;Neill</surname> <given-names>M. A.</given-names></name></person-group> (<year>2004</year>). <article-title>Automated species identification: why not?</article-title> <source>Philos. Trans. R. Soc. Lond. Ser. B Biol. Sci</source>. <volume>359</volume>, <fpage>655</fpage>&#x02013;<lpage>667</lpage>. <pub-id pub-id-type="doi">10.1098/rstb.2003.1442</pub-id><pub-id pub-id-type="pmid">15253351</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ghazi</surname> <given-names>M. M.</given-names></name> <name><surname>Yanikoglu</surname> <given-names>B.</given-names></name> <name><surname>Aptoula</surname> <given-names>E.</given-names></name></person-group> (<year>2017</year>). <article-title>Plant identification using deep neural networks via optimization of transfer learning parameters</article-title>. <source>Neurocomputing</source> <volume>235</volume>, <fpage>228</fpage>&#x02013;<lpage>235</lpage>. <pub-id pub-id-type="doi">10.1016/j.neucom.2017.01.018</pub-id></citation>
</ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Go&#x000EB;au</surname> <given-names>H.</given-names></name> <name><surname>Bonnet</surname> <given-names>P.</given-names></name> <name><surname>Joly</surname> <given-names>A.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Plant identification in an open-world (lifeclef 2016),&#x0201D; <italic>in CLEF Working Notes 2016</italic></article-title> (&#x000C9;vora).</citation>
</ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Go&#x000EB;au</surname> <given-names>H.</given-names></name> <name><surname>Bonnet</surname> <given-names>P.</given-names></name> <name><surname>Joly</surname> <given-names>A.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Plant identification based on noisy web data: the amazing performance of deep learning (lifeclef 2017),&#x0201D;</article-title> in <source>CEUR Workshop Proceedings</source> (<publisher-loc>Dublin</publisher-loc>).</citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Go&#x000EB;au</surname> <given-names>H.</given-names></name> <name><surname>Bonnet</surname> <given-names>P.</given-names></name> <name><surname>Joly</surname> <given-names>A.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Overview of expertlifeclef 2018: how far automated identification systems are from the best experts?&#x0201D;</article-title> in <source>CLEF Working Notes 2018</source> (<publisher-loc>Avignon</publisher-loc>).</citation>
</ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Go&#x000EB;au</surname> <given-names>H.</given-names></name> <name><surname>Bonnet</surname> <given-names>P.</given-names></name> <name><surname>Joly</surname> <given-names>A.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Overview of lifeclef plant identification task 2019: diving into data deficient tropical countries,&#x0201D;</article-title> in <source>CLEF 2019-Conference and Labs of the Evaluation Forum</source> (<publisher-loc>Lugano</publisher-loc>: <publisher-name>CEUR</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>13</lpage>.</citation>
</ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Go&#x000EB;au</surname> <given-names>H.</given-names></name> <name><surname>Bonnet</surname> <given-names>P.</given-names></name> <name><surname>Joly</surname> <given-names>A.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Overview of lifeclef plant identification task 2020,&#x0201D;</article-title> in <source>CLEF Task Overview 2020, CLEF: Conference and Labs of the Evaluation Forum</source> (<publisher-loc>Thessaloniki</publisher-loc>).</citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Go&#x000EB;au</surname> <given-names>H.</given-names></name> <name><surname>Bonnet</surname> <given-names>P.</given-names></name> <name><surname>Joly</surname> <given-names>A.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Overview of PlantCLEF 2021: cross-domain plant identification,&#x0201D;</article-title> in <source>Working Notes of CLEF 2021</source> - <italic>Conference and Labs of the Evaluation Forum</italic> (Bucharest).</citation>
</ref>
<ref id="B16">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Goodfellow</surname> <given-names>I.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Courville</surname> <given-names>A.</given-names></name></person-group> (<year>2016</year>). <source>Deep Learning Book. MIT Press</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="http://www.deeplearningbook.org">http://www.deeplearningbook.org</ext-link></citation>
</ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Ren</surname> <given-names>S.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). &#x0201C;Deep residual learning for image recognition,&#x0201D; <italic>in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</italic> (Las Vegas, NV), <fpage>770</fpage>&#x02013;<lpage>778</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2016.90</pub-id></citation>
</ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hu</surname> <given-names>J.</given-names></name> <name><surname>Shen</surname> <given-names>L.</given-names></name> <name><surname>Sun</surname> <given-names>G.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Squeeze-and-excitation networks,&#x0201D; <italic>in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</italic> (Salt Lake City, UT)</article-title>, <fpage>7132</fpage>&#x02013;<lpage>7141</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2018.00745</pub-id></citation>
</ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Joly</surname> <given-names>A.</given-names></name> <name><surname>Go&#x000EB;au</surname> <given-names>H.</given-names></name> <name><surname>Botella</surname> <given-names>C.</given-names></name> <name><surname>Glotin</surname> <given-names>H.</given-names></name> <name><surname>Bonnet</surname> <given-names>P.</given-names></name> <name><surname>Planqu&#x000E9;</surname> <given-names>R.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>&#x0201C;Overview of lifeclef 2018: a large-scale evaluation of species identification and recommendation algorithms in the era of AI,&#x0201D; <italic>in Proceedings of CLEF 2018</italic></article-title> (Cham: Springer International Publishing), <fpage>247</fpage>&#x02013;<lpage>266</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-98932-7_24</pub-id></citation>
</ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Joly</surname> <given-names>A.</given-names></name> <name><surname>Go&#x000EB;au</surname> <given-names>H.</given-names></name> <name><surname>Botella</surname> <given-names>C.</given-names></name> <name><surname>Kahl</surname> <given-names>S.</given-names></name> <name><surname>Servajean</surname> <given-names>M.</given-names></name> <name><surname>Glotin</surname> <given-names>H.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>&#x0201C;Overview of lifeclef 2019: identification of amazonian plants, south &#x00026; north American birds, and niche prediction,&#x0201D;</article-title> in <source>International Conference of the Cross-Language Evaluation Forum for European Languages</source> (<publisher-loc>Berlin; Heidelberg</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>387</fpage>&#x02013;<lpage>401</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-28577-7_29</pub-id></citation>
</ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Joly</surname> <given-names>A.</given-names></name> <name><surname>Go&#x000EB;au</surname> <given-names>H.</given-names></name> <name><surname>Kahl</surname> <given-names>S.</given-names></name> <name><surname>Deneu</surname> <given-names>B.</given-names></name> <name><surname>Servajean</surname> <given-names>M.</given-names></name> <name><surname>Cole</surname> <given-names>E.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>&#x0201C;Overview of lifeclef 2020: a system-oriented evaluation of automated species identification and species distribution prediction,&#x0201D;</article-title> in <source>International Conference of the Cross-Language Evaluation Forum for European Languages</source> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>342</fpage>&#x02013;<lpage>363</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-58219-7_23</pub-id></citation>
</ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Joly</surname> <given-names>A.</given-names></name> <name><surname>Go&#x000EB;au</surname> <given-names>H.</given-names></name> <name><surname>Kahl</surname> <given-names>S.</given-names></name> <name><surname>Picek</surname> <given-names>L.</given-names></name> <name><surname>Lorieul</surname> <given-names>T.</given-names></name> <name><surname>Cole</surname> <given-names>E.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>&#x0201C;Overview of lifeclef 2021: an evaluation of machine-learning based species identification and species distribution prediction,&#x0201D;</article-title> in <source>International Conference of the Cross-Language Evaluation Forum for European Languages</source> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>371</fpage>&#x02013;<lpage>393</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-85251-1_24</pub-id></citation>
</ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Keaton</surname> <given-names>M. R.</given-names></name> <name><surname>Zaveri</surname> <given-names>R. J.</given-names></name> <name><surname>Kovur</surname> <given-names>M.</given-names></name> <name><surname>Henderson</surname> <given-names>C.</given-names></name> <name><surname>Adjeroh</surname> <given-names>D. A.</given-names></name> <name><surname>Doretto</surname> <given-names>G.</given-names></name></person-group> (<year>2021</year>). <article-title>Fine-grained visual classification of plant species in the wild: object detection as a reinforced means of attention</article-title>. <source>arXiv preprint arXiv:2106.02141</source>. <pub-id pub-id-type="doi">10.48550/ARXIV.2106.02141</pub-id></citation>
</ref>
<ref id="B24">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Khosla</surname> <given-names>P.</given-names></name> <name><surname>Teterwak</surname> <given-names>P.</given-names></name> <name><surname>Wang</surname> <given-names>C.</given-names></name> <name><surname>Sarna</surname> <given-names>A.</given-names></name> <name><surname>Tian</surname> <given-names>Y.</given-names></name> <name><surname>Isola</surname> <given-names>P.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>&#x0201C;Supervised contrastive learning,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems, Vol. 33</source>, ed H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Curran Associates, Inc.), <fpage>18661</fpage>&#x02013;<lpage>18673</lpage>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://proceedings.neurips.cc/paper/2020/file/d89a66c7c80a29b1bdbab0f2a1a94af8-Paper.pdf">https://proceedings.neurips.cc/paper/2020/file/d89a66c7c80a29b1bdbab0f2a1a94af8-Paper.pdf</ext-link></citation>
</ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Krause</surname> <given-names>J.</given-names></name> <name><surname>Sapp</surname> <given-names>B.</given-names></name> <name><surname>Howard</surname> <given-names>A.</given-names></name> <name><surname>Zhou</surname> <given-names>H.</given-names></name> <name><surname>Toshev</surname> <given-names>A.</given-names></name> <name><surname>Duerig</surname> <given-names>T.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>&#x0201C;The unreasonable effectiveness of noisy data for fine-grained recognition,&#x0201D;</article-title> in <source>European Conference on Computer Vision</source> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>301</fpage>&#x02013;<lpage>320</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-46487-9_19</pub-id></citation>
</ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lasseck</surname> <given-names>M.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;<italic>Image-based plant species identification with deep convolutional neural networks</italic>,&#x0201D;</article-title> in <source>CLEF</source> (<publisher-loc>Dublin</publisher-loc>).</citation>
</ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>S. H.</given-names></name> <name><surname>Chan</surname> <given-names>C. S.</given-names></name> <name><surname>Remagnino</surname> <given-names>P.</given-names></name></person-group> (<year>2018</year>). <article-title>Multi-organ plant classification based on convolutional and recurrent neural networks</article-title>. <source>IEEE Trans. Image Process</source>. <volume>27</volume>, <fpage>4287</fpage>&#x02013;<lpage>4301</lpage>. <pub-id pub-id-type="doi">10.1109/TIP.2018.2836321</pub-id><pub-id pub-id-type="pmid">29870348</pub-id></citation></ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>T.-Y.</given-names></name> <name><surname>Goyal</surname> <given-names>P.</given-names></name> <name><surname>Girshick</surname> <given-names>R.</given-names></name> <name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Doll&#x000E1;r</surname> <given-names>P.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Focal loss for dense object detection,&#x0201D; <italic>in Proceedings of the IEEE International Conference on Computer Vision</italic> (Venice)</article-title>, <fpage>2980</fpage>&#x02013;<lpage>2988</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2017.324</pub-id></citation>
</ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Loshchilov</surname> <given-names>I.</given-names></name> <name><surname>Hutter</surname> <given-names>F.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;<italic>Decoupled weight decay regularization</italic>,&#x0201D;</article-title> in <source>International Conference on Learning Representations</source> (<publisher-loc>New Orleans, LA</publisher-loc>).</citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Malik</surname> <given-names>O. A.</given-names></name> <name><surname>Faisal</surname> <given-names>M.</given-names></name> <name><surname>Hussein</surname> <given-names>B. R.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Ensemble deep learning models for fine-grained plant species identification,&#x0201D;</article-title> in <source>2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE)</source> (<publisher-loc>IEEE</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>6</lpage>. <pub-id pub-id-type="doi">10.1109/CSDE53843.2021.9718387</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Munisami</surname> <given-names>T.</given-names></name> <name><surname>Ramsurn</surname> <given-names>M.</given-names></name> <name><surname>Kishnah</surname> <given-names>S.</given-names></name> <name><surname>Pudaruth</surname> <given-names>S.</given-names></name></person-group> (<year>2015</year>). <article-title>Plant leaf recognition using shape features and colour histogram with k-nearest neighbour classifiers</article-title>. <source>Proc. Comput. Sci</source>. <volume>58</volume>, <fpage>740</fpage>&#x02013;<lpage>747</lpage>. <pub-id pub-id-type="doi">10.1016/j.procs.2015.08.095</pub-id></citation>
</ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Patel</surname> <given-names>Y.</given-names></name> <name><surname>Tolias</surname> <given-names>G.</given-names></name> <name><surname>Matas</surname> <given-names>J.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Recall&#x00040;k surrogate loss with large batches and similarity mixup,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source> (<publisher-loc>New Orleans, LA</publisher-loc>), <fpage>7502</fpage>&#x02013;<lpage>7511</lpage>.</citation>
</ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Picek</surname> <given-names>L.</given-names></name> <name><surname>Sulc</surname> <given-names>M.</given-names></name> <name><surname>Matas</surname> <given-names>J.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Recognition of the amazonian flora by inceptionnetworks with test-time class prior estimation,&#x0201D;</article-title> in <source>CLEF (Working Notes)</source> (<publisher-loc>Lugano</publisher-loc>).</citation>
</ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Picek</surname> <given-names>L.</given-names></name> <name><surname>&#x00160;ulc</surname> <given-names>M.</given-names></name> <name><surname>Matas</surname> <given-names>J.</given-names></name> <name><surname>Jeppesen</surname> <given-names>T. S.</given-names></name> <name><surname>Heilmann-Clausen</surname> <given-names>J.</given-names></name> <name><surname>L&#x000E6;ss&#x000F8;e</surname> <given-names>T.</given-names></name> <etal/></person-group>. (<year>2022</year>). &#x0201C;Danish fungi 2020 - not just another image recognition dataset,&#x0201D; <italic>in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)</italic> (Waikoloa), <fpage>1525</fpage>&#x02013;<lpage>1535</lpage>. <pub-id pub-id-type="doi">10.1109/WACV51458.2022.00334</pub-id></citation>
</ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Polyak</surname> <given-names>B. T.</given-names></name> <name><surname>Juditsky</surname> <given-names>A. B.</given-names></name></person-group> (<year>1992</year>). <article-title>Acceleration of stochastic approximation by averaging</article-title>. <source>SIAM J. Control Opt</source>. <volume>30</volume>, <fpage>838</fpage>&#x02013;<lpage>855</lpage>. <pub-id pub-id-type="doi">10.1137/0330046</pub-id><pub-id pub-id-type="pmid">11735906</pub-id></citation></ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Prasad</surname> <given-names>S.</given-names></name> <name><surname>Kudiri</surname> <given-names>K. M.</given-names></name> <name><surname>Tripathi</surname> <given-names>R.</given-names></name></person-group> (<year>2011</year>). <article-title>&#x0201C;Relative sub-image based features for leaf recognition using support vector machine,&#x0201D;</article-title> in <source>Proceedings of the 2011 International Conference on Communication, Computing</source> &#x00026; <italic>Security</italic> (Rourkela Odisha), <fpage>343</fpage>&#x02013;<lpage>346</lpage>. <pub-id pub-id-type="doi">10.1145/1947940.1948012</pub-id></citation>
</ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Priya</surname> <given-names>C. A.</given-names></name> <name><surname>Balasaravanan</surname> <given-names>T.</given-names></name> <name><surname>Thanamani</surname> <given-names>A. S.</given-names></name></person-group> (<year>2012</year>). <article-title>&#x0201C;An efficient leaf recognition algorithm for plant classification using support vector machine,&#x0201D;</article-title> in <source>International Conference on Pattern Recognition, Informatics and Medical Engineering (PRIME-2012)</source> (<publisher-loc>Tamilnadu</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>428</fpage>&#x02013;<lpage>432</lpage>. <pub-id pub-id-type="doi">10.1109/ICPRIME.2012.6208384</pub-id></citation>
</ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Saerens</surname> <given-names>M.</given-names></name> <name><surname>Latinne</surname> <given-names>P.</given-names></name> <name><surname>Decaestecker</surname> <given-names>C.</given-names></name></person-group> (<year>2002</year>). <article-title>Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure</article-title>. <source>Neural Comput</source>. <volume>14</volume>, <fpage>21</fpage>&#x02013;<lpage>41</lpage>. <pub-id pub-id-type="doi">10.1162/089976602753284446</pub-id><pub-id pub-id-type="pmid">11747533</pub-id></citation></ref>
<ref id="B39">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sipka</surname> <given-names>T.</given-names></name> <name><surname>Sulc</surname> <given-names>M.</given-names></name> <name><surname>Matas</surname> <given-names>J.</given-names></name></person-group> (<year>2022</year>). <article-title>&#x0201C;The hitchhiker&#x00027;s guide to prior-shift adaptation,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</source> (<publisher-loc>IEEE</publisher-loc>), <fpage>1516</fpage>&#x02013;<lpage>1524</lpage>. <pub-id pub-id-type="doi">10.1109/WACV51458.2022.00209</pub-id></citation>
</ref>
<ref id="B40">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>&#x00160;ulc</surname> <given-names>M.</given-names></name></person-group> (<year>2020</year>). <source>Fine-grained recognition of plants and fungi from images</source> (<publisher-loc>Ph.D. thesis</publisher-loc>). Czech Technical University in Prague, Prague, Czechia.</citation>
</ref>
<ref id="B41">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>&#x00160;ulc</surname> <given-names>M.</given-names></name> <name><surname>Matas</surname> <given-names>J.</given-names></name></person-group> (<year>2017</year>). <article-title>Fine-grained recognition of plants from images</article-title>. <source>Plant Methods</source> <volume>13</volume>, <fpage>115</fpage>. <pub-id pub-id-type="doi">10.1186/s13007-017-0265-4</pub-id><pub-id pub-id-type="pmid">29299049</pub-id></citation></ref>
<ref id="B42">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>&#x00160;ulc</surname> <given-names>M.</given-names></name> <name><surname>Matas</surname> <given-names>J.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Improving cnn classifiers by estimating test-time priors,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops</source> (<publisher-loc>Seoul</publisher-loc>). <pub-id pub-id-type="doi">10.1109/ICCVW.2019.00402</pub-id></citation>
</ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>&#x00160;ulc</surname> <given-names>M.</given-names></name> <name><surname>Picek</surname> <given-names>L.</given-names></name> <name><surname>Matas</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;<italic>Plant recognition by inception networks with test-time class prior estimation</italic>,&#x0201D;</article-title> in <source>CLEF (Working Notes)</source> (<publisher-loc>Avignon</publisher-loc>).</citation>
</ref>
<ref id="B44">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Szegedy</surname> <given-names>C.</given-names></name> <name><surname>Ioffe</surname> <given-names>S.</given-names></name> <name><surname>Vanhoucke</surname> <given-names>V.</given-names></name> <name><surname>Alemi</surname> <given-names>A. A.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Inception-v4, inception-resnet and the impact of residual connections on learning,&#x0201D;</article-title> in <source>Thirty-first AAAI Conference on Artificial Intelligence</source> (<publisher-loc>AAAI</publisher-loc>).</citation>
</ref>
<ref id="B45">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Tan</surname> <given-names>M.</given-names></name> <name><surname>Le</surname> <given-names>Q. V.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Efficientnetv2: smaller models and faster training,&#x0201D;</article-title> in <source>Proceedings of the 38th International Conference on Machine Learning</source>, ed M, Marina and Z, Tong (PMLR), <fpage>10096</fpage>&#x02013;<lpage>10106</lpage>. Available online at: <ext-link ext-link-type="uri" xlink:href="http://proceedings.mlr.press/v139/tan21a/tan21a.pdf">http://proceedings.mlr.press/v139/tan21a/tan21a.pdf</ext-link></citation>
</ref>
<ref id="B46">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Touvron</surname> <given-names>H.</given-names></name> <name><surname>Sablayrolles</surname> <given-names>A.</given-names></name> <name><surname>Douze</surname> <given-names>M.</given-names></name> <name><surname>Cord</surname> <given-names>M.</given-names></name> <name><surname>J&#x000E9;gou</surname> <given-names>H.</given-names></name></person-group> (<year>2021</year>). &#x0201C;Grafit: learning fine-grained image representations with coarse labels,&#x0201D; <italic>in Proceedings of the IEEE/CVF International Conference on Computer Vision</italic> (Montreal), <fpage>874</fpage>&#x02013;<lpage>884</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV48922.2021.00091</pub-id></citation>
</ref>
<ref id="B47">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Van Horn</surname> <given-names>G.</given-names></name> <name><surname>Mac Aodha</surname> <given-names>O.</given-names></name> <name><surname>Song</surname> <given-names>Y.</given-names></name> <name><surname>Cui</surname> <given-names>Y.</given-names></name> <name><surname>Sun</surname> <given-names>C.</given-names></name> <name><surname>Shepard</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2018</year>). &#x0201C;The inaturalist species classification and detection dataset,&#x0201D; <italic>in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</italic> (Salt Lake City, UT), <fpage>8769</fpage>&#x02013;<lpage>8778</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2018.00914</pub-id></citation>
</ref>
<ref id="B48">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Vaswani</surname> <given-names>A.</given-names></name> <name><surname>Shazeer</surname> <given-names>N.</given-names></name> <name><surname>Parmar</surname> <given-names>N.</given-names></name> <name><surname>Uszkoreit</surname> <given-names>J.</given-names></name> <name><surname>Jones</surname> <given-names>L.</given-names></name> <name><surname>Gomez</surname> <given-names>A. N.</given-names></name> <etal/></person-group>. (<year>2017</year>). &#x0201C;Attention is all you need,&#x0201D; <italic>in Advances in Neural Information Processing Systems</italic>, Vol. 30, eds I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Curran Associates, Inc.), 5998&#x02013;6008. Available online at: <ext-link ext-link-type="uri" xlink:href="https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf">https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf</ext-link></citation>
</ref>
<ref id="B49">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wah</surname> <given-names>C.</given-names></name> <name><surname>Branson</surname> <given-names>S.</given-names></name> <name><surname>Welinder</surname> <given-names>P.</given-names></name> <name><surname>Perona</surname> <given-names>P.</given-names></name> <name><surname>Belongie</surname> <given-names>S.</given-names></name></person-group> (<year>2011</year>). <source>The Caltech-UCSD Birds-200-2011 Dataset</source>. Technical Report CNS-TR-2011-001, California Institute of Technology.</citation>
</ref>
<ref id="B50">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>W&#x000E4;ldchen</surname> <given-names>J.</given-names></name> <name><surname>M&#x000E4;der</surname> <given-names>P.</given-names></name></person-group> (<year>2018</year>). <article-title>Machine learning for image based species identification</article-title>. <source>Methods Ecol. Evol</source>. <volume>9</volume>, <fpage>2216</fpage>&#x02013;<lpage>2225</lpage>. <pub-id pub-id-type="doi">10.1111/2041-210X.13075</pub-id></citation>
</ref>
<ref id="B51">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Wightman</surname> <given-names>R.</given-names></name></person-group> (<year>2019</year>). <source>PyTorch Image Models</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://github.com/rwightman/pytorch-image-models">https://github.com/rwightman/pytorch-image-models</ext-link></citation>
</ref>
<ref id="B52">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>D.</given-names></name> <name><surname>Han</surname> <given-names>X.</given-names></name> <name><surname>Wang</surname> <given-names>G.</given-names></name> <name><surname>Sun</surname> <given-names>Y.</given-names></name> <name><surname>Zhang</surname> <given-names>H.</given-names></name> <name><surname>Fu</surname> <given-names>H.</given-names></name></person-group> (<year>2019</year>). <article-title>Deep learning with taxonomic loss for plant identification</article-title>. <source>Comput. Intell. Neurosci</source>. 2019, 2015017. <pub-id pub-id-type="doi">10.1155/2019/2015017</pub-id><pub-id pub-id-type="pmid">31871441</pub-id></citation></ref>
<ref id="B53">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>Q.</given-names></name> <name><surname>Zhou</surname> <given-names>C.</given-names></name> <name><surname>Wang</surname> <given-names>C.</given-names></name></person-group> (<year>2006</year>). <article-title>Feature extraction and automatic recognition of plant leaf using artificial neural network</article-title>. <source>Adv. Artif. Intell</source>. <volume>3</volume>, <fpage>5</fpage>&#x02013;<lpage>12</lpage>.</citation>
</ref>
<ref id="B54">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>S. G.</given-names></name> <name><surname>Bao</surname> <given-names>F. S.</given-names></name> <name><surname>Xu</surname> <given-names>E. Y.</given-names></name> <name><surname>Wang</surname> <given-names>Y.-X.</given-names></name> <name><surname>Chang</surname> <given-names>Y.-F.</given-names></name> <name><surname>Xiang</surname> <given-names>Q.-L.</given-names></name></person-group> (<year>2007</year>). <article-title>&#x0201C;A leaf recognition algorithm for plant classification using probabilistic neural network,&#x0201D;</article-title> in <source>2007 IEEE International Symposium on Signal Processing and Information Technology</source> (<publisher-loc>IEEE</publisher-loc>), <fpage>11</fpage>&#x02013;<lpage>16</lpage>. <pub-id pub-id-type="doi">10.1109/ISSPIT.2007.4458016</pub-id></citation>
</ref>
<ref id="B55">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xie</surname> <given-names>S.</given-names></name> <name><surname>Girshick</surname> <given-names>R.</given-names></name> <name><surname>Doll&#x000E1;r</surname> <given-names>P.</given-names></name> <name><surname>Tu</surname> <given-names>Z.</given-names></name> <name><surname>He</surname> <given-names>K.</given-names></name></person-group> (<year>2017</year>). &#x0201C;Aggregated residual transformations for deep neural networks,&#x0201D; <italic>in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</italic> (Honolulu), <fpage>1492</fpage>&#x02013;<lpage>1500</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2017.634</pub-id></citation>
</ref>
<ref id="B56">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>H.</given-names></name> <name><surname>Wu</surname> <given-names>C.</given-names></name> <name><surname>Zhang</surname> <given-names>Z.</given-names></name> <name><surname>Zhu</surname> <given-names>Y.</given-names></name> <name><surname>Lin</surname> <given-names>H.</given-names></name> <name><surname>Zhang</surname> <given-names>Z.</given-names></name> <etal/></person-group>. (<year>2020</year>). &#x0201C;ResNest: split-attention networks,&#x0201D; <italic>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops</italic> (New Orleans, LA), <fpage>2736</fpage>&#x02013;<lpage>2746</lpage>.</citation>
</ref>
<ref id="B57">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zheng</surname> <given-names>H.</given-names></name> <name><surname>Fu</surname> <given-names>J.</given-names></name> <name><surname>Zha</surname> <given-names>Z.-J.</given-names></name> <name><surname>Luo</surname> <given-names>J.</given-names></name></person-group> (<year>2019</year>). &#x0201C;Looking for the devil in the details: learning trilinear attention sampling network for fine-grained image recognition,&#x0201D; <italic>in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</italic> (Long Beach, CA), <fpage>5012</fpage>&#x02013;<lpage>5021</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2019.00515</pub-id></citation>
</ref>
</ref-list>
<fn-group>
<fn id="fn0001"><p><sup>1</sup>We use the term class following the machine learning wording, where classes denote the categories to be recognized, not the taxonomic rank (<italic>classis</italic>), i.e., we use the term class for species.</p></fn>
</fn-group>
</back>
</article>