<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Artif. Intell.</journal-id>
<journal-title>Frontiers in Artificial Intelligence</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Artif. Intell.</abbrev-journal-title>
<issn pub-type="epub">2624-8212</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/frai.2022.880729</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Artificial Intelligence</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Building One-Shot Semi-Supervised (BOSS) Learning Up to Fully Supervised Performance</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Smith</surname> <given-names>Leslie N.</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1346754/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Conovaloff</surname> <given-names>Adam</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>US Naval Research Laboratory</institution>, <addr-line>Washington, DC</addr-line>, <country>United States</country></aff>
<aff id="aff2"><sup>2</sup><institution>NRC Postdoctoral Fellow, US Naval Research Laboratory</institution>, <addr-line>Washington, DC</addr-line>, <country>United States</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Dongpo Xu, Northeast Normal University, China</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Arindam Chaudhuri, Samsung R &#x00026; D Institute, India; Debasmit Das, Qualcomm, United States; Javad Hassannataj Joloudari, University of Birjand, Iran</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Leslie N. Smith <email>leslie.smith&#x00040;nrl.navy.mil</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Machine Learning and Artificial Intelligence, a section of the journal Frontiers in Artificial Intelligence</p></fn></author-notes>
<pub-date pub-type="epub">
<day>02</day>
<month>06</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>5</volume>
<elocation-id>880729</elocation-id>
<history>
<date date-type="received">
<day>21</day>
<month>02</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>11</day>
<month>05</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2022 Smith and Conovaloff.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Smith and Conovaloff</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license> </permissions>
<abstract>
<p>Reaching the performance of fully supervised learning with unlabeled data and only labeling one sample per class might be ideal for deep learning applications. We demonstrate for the first time the potential for building one-shot semi-supervised (BOSS) learning on CIFAR-10 and SVHN up to attain test accuracies that are comparable to fully supervised learning. Our method combines class prototype refining, class balancing, and self-training. A good prototype choice is essential and we propose a technique for obtaining iconic examples. In addition, we demonstrate that class balancing methods substantially improve accuracy results in semi-supervised learning to levels that allow self-training to reach the level of fully supervised learning performance. Our experiments demonstrate the value with computing and analyzing test accuracies for every class, rather than only a total test accuracy. We show that our BOSS methodology can obtain total test accuracies with CIFAR-10 images and only one labeled sample per class up to 95% (compared to 94.5% for fully supervised). Similarly, the SVHN images obtains test accuracies of 97.8%, compared to 98.27% for fully supervised. Rigorous empirical evaluations provide evidence that labeling large datasets is not necessary for training deep neural networks. Our code is available at <ext-link ext-link-type="uri" xlink:href="https://github.com/lnsmith54/BOSS">https://github.com/lnsmith54/BOSS</ext-link> to facilitate replication.</p></abstract>
<kwd-group>
<kwd>one-shot learning</kwd>
<kwd>semi-supervised learning</kwd>
<kwd>image classification</kwd>
<kwd>deep learning</kwd>
<kwd>computer vision</kwd>
</kwd-group>
<counts>
<fig-count count="1"/>
<table-count count="4"/>
<equation-count count="5"/>
<ref-count count="35"/>
<page-count count="9"/>
<word-count count="7546"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>In recent years, deep learning has achieved state-of-the-art performance for computer vision tasks such as image classification. However, a major barrier to the wide-spread adoption of deep neural networks for new applications is that training state-of-the-art deep networks typically requires thousands to millions of labeled samples to perform at high levels of accuracy and to generalize well.</p>
<p>Unfortunately, manual labeling is labor-intensive and might not be practical if labeling the data requires specialized expertise, such as in medical, defense, and scientific applications. In typical real-world scenarios for deep learning, one often has access to large amounts of unlabeled data but lacks the time or expertise to label the required massive numbers needed for training, validation, and testing. An ideal solution might be to achieve performance levels that are equivalent to fully supervised trained networks with only one manually labeled image per class.</p>
<p>In this paper, we investigate the potential for building one-shot semi-supervised (BOSS) learning up to achieve comparable performance as fully supervised training. To date, one-shot semi-supervised learning has been little studied and viewed as difficult. We build on the recent observation that one-shot semi-supervised learning is plagued by class imbalance problems (Smith and Conovaloff, <xref ref-type="bibr" rid="B22">2020</xref>). In our context, class imbalance refers to a trained network with near 100% accuracy on a subset of classes and poor performance on other classes. We strongly advocate in classification tasks that practitioners evaluate and analyze test accuracies for every class, rather than only the average accuracy (Smith and Conovaloff, <xref ref-type="bibr" rid="B22">2020</xref>; Fu et al., <xref ref-type="bibr" rid="B8">2022</xref>). However, we are the first to apply data imbalance methods to unlabeled data.</p>
<p>Specifically, we demonstrate that good prototypes are crucial for successful semi-supervised learning and propose a prototype refinement method for the poorly performing classes. Also, we make use of the state-of-the-art in semi-supervised learning methods (i.e., FixMatch, Sohn et al., <xref ref-type="bibr" rid="B24">2020</xref>) in our experiments. To combat class imbalance, we tested several variations of methods found in the literature for data imbalance problems (Johnson and Khoshgoftaar, <xref ref-type="bibr" rid="B13">2019</xref>), which refers to the situation where the number of training samples per class varies substantially. We are the first to demonstrate that these methods significantly boost the performance of one-shot semi-supervised learning. Combining these methods with self-training (Rosenberg et al., <xref ref-type="bibr" rid="B20">2005</xref>) makes it possible for CIFAR-10 and SVHN to attain comparable performance as fully supervised trained deep networks with 50 K and 73 K labeled training images, respectively.</p>
<p>Our contributions are:</p>
<list list-type="order">
<list-item><p>We rigorously demonstrate for the first time the potential for one-shot semi-supervised learning to reach test accuracies with CIFAR-10 and SVHN that are comparable to fully supervised learning.</p></list-item>
<list-item><p>We propose the concept of class balancing on unlabeled data and investigate their value for one-shot semi-supervised learning. We introduce a novel measure of minority and majority classes and propose four class balancing methods that improve the performance of semi-supervised learning.</p></list-item>
<list-item><p>We investigate the causes of poor performance and hyper-parameter sensitivity. We hypothesize two causes and demonstrate solutions that improve performance.</p></list-item>
</list>
</sec>
<sec id="s2">
<title>2. Related Work</title>
<sec>
<title>2.1. Semi-Supervised Learning</title>
<p>Semi-supervised learning is a hybrid between supervised and unsupervised learning, which combines the benefits of both and is better suited to real-world scenarios where unlabeled data is abundant. As with supervised learning, semi-supervised learning defines a task (i.e., classification) from labeled data but typically it requires much fewer labeled samples. In addition, semi-supervised learning leverages feature learning from unlabeled data to avoid overfitting the limited labeled samples. Semi-supervised learning is a large and mature field and there are several surveys and books on semi-supervised learning methods (Zhu, <xref ref-type="bibr" rid="B35">2005</xref>; Chapelle et al., <xref ref-type="bibr" rid="B4">2009</xref>; Zhu and Goldberg, <xref ref-type="bibr" rid="B34">2009</xref>; Van Engelen and Hoos, <xref ref-type="bibr" rid="B27">2020</xref>) for the interested reader. In this Section we mention only the most relevant of recent methods.</p>
<p>Recently there have been a series of papers on semi-supervised learning from Google Research, including MixMatch (Berthelot et al., <xref ref-type="bibr" rid="B3">2019b</xref>), ReMixMatch (Berthelot et al., <xref ref-type="bibr" rid="B2">2019a</xref>), and FixMatch (Sohn et al., <xref ref-type="bibr" rid="B24">2020</xref>). MixMatch combines consistency regularization with data augmentation (Sajjadi et al., <xref ref-type="bibr" rid="B21">2016</xref>), entropy minimization (i.e., sharpening) (Grandvalet and Bengio, <xref ref-type="bibr" rid="B10">2005</xref>), and mixup (Zhang et al., <xref ref-type="bibr" rid="B33">2017</xref>). ReMixMatch improved on MixMatch by incorporating distribution alignment and augmentation anchors. Augmentation anchors are similar to pseudo-labeling. FixMatch is the most recent and demonstrated state-of-the-art semi-supervised learning performance. In addition, the FixMatch paper has a discussion on one-shot semi-supervised learning with CIFAR-10.</p>
<p>The FixMatch algorithm (Sohn et al., <xref ref-type="bibr" rid="B24">2020</xref>) is primarily a combination of consistency regularization (Sajjadi et al., <xref ref-type="bibr" rid="B21">2016</xref>; Zhai et al., <xref ref-type="bibr" rid="B32">2019</xref>) and pseudo-labeling (Lee, <xref ref-type="bibr" rid="B16">2013</xref>). Consistency regularization utilizes unlabeled data by relying on the assumption that the model should output the same predictions when fed perturbed versions as on the original image. Consistency regularization has recently become a popular technique in unsupervised, self-supervised, and semi-supervised learning (Zhai et al., <xref ref-type="bibr" rid="B32">2019</xref>; Van Engelen and Hoos, <xref ref-type="bibr" rid="B27">2020</xref>). Several researchers have observed that strong data augmentation should not be used when inferring pseudo-labels for the unlabeled data but should be employed for consistency regularization (Xie et al., <xref ref-type="bibr" rid="B30">2019</xref>; Sohn et al., <xref ref-type="bibr" rid="B24">2020</xref>). Pseudo-labeling is based on the idea that one can use the model to obtain artificial labels for unlabeled data by retaining pseudo-labels for samples whose probability are above a predefined threshold.</p>
<p>A recent survey of semi-supervised learning (Van Engelen and Hoos, <xref ref-type="bibr" rid="B27">2020</xref>) provides a taxonomy of classification algorithms. One of the methods in semi-supervised learning is self-training iterations (Rosenberg et al., <xref ref-type="bibr" rid="B20">2005</xref>; Triguero et al., <xref ref-type="bibr" rid="B26">2015</xref>) where a classifier is iteratively trained on labeled data plus high confidence pseudo labeled data from previous iterations. In our experiments we found that self-training provided a final boost to make the performance comparable to supervised training with the full labeled training dataset.</p>
<p>Unlike this paper, recent papers on semi-supervised learning, such as SimPLE Hu et al. (<xref ref-type="bibr" rid="B12">2021</xref>) and CoMatch Li et al. (<xref ref-type="bibr" rid="B17">2021</xref>), do not show results for one-shot semi-supervised learning. The SimPLE method uses at least 1,000 labels for CIFAR-10 and SVHN. On the other hand, CoMatch provides experiments on CIFAR-10 with as little as 20 labels but their reported performance is significantly lower than the performance obtained with the full labeled training dataset. There is one recent paper Lucas et al. (<xref ref-type="bibr" rid="B18">2021</xref>) that reports results for one-shot semi-supervised learning for CIFAR-10 and CIFAR-100. They too compare their results to FixMatch. Unlike our work, the performance they report is much lower than the fully-supervised performance.</p>
</sec>
<sec>
<title>2.2. Class Imbalance</title>
<p>Smith and Conovaloff (Smith and Conovaloff, <xref ref-type="bibr" rid="B22">2020</xref>) demonstrated that in one-shot semi-supervised learning there are large variation in class performances, with some classes achieving near 100% test accuracies while other classes near 0% accuracies. That is, strong classes starve the weak classes, which is analogous to the class imbalance problem (Johnson and Khoshgoftaar, <xref ref-type="bibr" rid="B13">2019</xref>). This observation suggests an opportunity to improve the overall performance by actively improving the performance of the weak classes.</p>
<p>We borrowed techniques from the literature on training with imbalanced data (Sun et al., <xref ref-type="bibr" rid="B25">2007</xref>; Wang and Yao, <xref ref-type="bibr" rid="B29">2012</xref>; Johnson and Khoshgoftaar, <xref ref-type="bibr" rid="B13">2019</xref>) (i.e., some classes having many more training samples than other classes) to experiment with several methods for improving the performance of the weak classes with unlabeled data. However, with unlabeled data, labels to define the ground truth as to minority and majority classes do not exist. In this paper, we propose using the pseudo-labels as a surrogate to the ground truth for example class counting. Our experiments demonstrate that combining the counting of the pseudo-labels and methods for handling data imbalance substantially improves performance.</p>
<p>Methods for handling class imbalance can be grouped into two categories: data-level and algorithm-level methods. Data-level techniques (Wang and Yao, <xref ref-type="bibr" rid="B29">2012</xref>) reduce the level of imbalance by undersampling the majority classes and oversampling the minority classes. Algorithm-level techniques (Sun et al., <xref ref-type="bibr" rid="B25">2007</xref>) are commonly implemented with smaller loss factor weights for the training samples belonging to the majority classes and larger weights for the training samples belonging to the minority classes. In our experiments we tested variations of both types of methods and a hybrid of the two.</p>
</sec>
<sec>
<title>2.3. Meta-Learning</title>
<p>Our scenario superficially bears similarity to few-shot meta learning (Koch et al., <xref ref-type="bibr" rid="B14">2015</xref>; Vinyals et al., <xref ref-type="bibr" rid="B28">2016</xref>; Finn et al., <xref ref-type="bibr" rid="B7">2017</xref>; Snell et al., <xref ref-type="bibr" rid="B23">2017</xref>), which is a highly active area of research. The majority of the work in this area relies on a large labeled dataset with similar data statistics but this can be an onerous requirement for new applications. While there are some recent efforts in unsupervised pre-training for few-shot meta learning (Hsu et al., <xref ref-type="bibr" rid="B11">2018</xref>; Antoniou and Storkey, <xref ref-type="bibr" rid="B1">2019</xref>), our experiments with these methods demonstrated their inability to adequately perform in one-shot learning to bootstrap our process. Specifically, unsupervised one-shot learning with only five classes obtained a test accuracy of about 50% on high confidence samples and the accuracy dropped sharply when increasing the number of classes.</p>
</sec>
</sec>
<sec id="s3">
<title>3. BOSS Methodology</title>
<sec>
<title>3.1. FixMatch</title>
<p>Since we build on FixMatch (Sohn et al., <xref ref-type="bibr" rid="B24">2020</xref>), we briefly describe the algorithm and adopt the formalism used in the original paper. For an N-class classification problem, let us define &#x003C7; &#x0003D; {(<italic>x</italic><sub><italic>b</italic></sub>, <italic>y</italic><sub><italic>b</italic></sub>):<italic>b</italic>&#x02208;(1, &#x02026;, <italic>B</italic>)} as a batch of B labeled examples, where <italic>x</italic><sub><italic>b</italic></sub> are the training examples and <italic>y</italic><sub><italic>b</italic></sub> are their labels. We also define <inline-formula><mml:math id="M1"><mml:mrow><mml:mi mathvariant='script'>U</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x0007B;</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mi>b</mml:mi></mml:msub><mml:mo>:</mml:mo><mml:mi>b</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>...</mml:mn><mml:mo>,</mml:mo><mml:mi>&#x003BC;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x0007D;</mml:mo></mml:mrow></mml:math></inline-formula> as a batch of &#x003BC; unlabeled examples where &#x003BC; &#x0003D; <italic>r</italic><sub><italic>u</italic></sub><italic>B</italic> and <italic>r</italic><sub><italic>u</italic></sub> is a hyperparameter that determines the ratio of <inline-formula><mml:math id="M2"><mml:mrow><mml:mi mathvariant='script'>U</mml:mi></mml:mrow></mml:math></inline-formula> to &#x003C7;. Let <italic>p</italic><sub><italic>m</italic></sub>(<italic>y</italic>|<italic>x</italic>) be the predicted class distribution produced by the model for input <italic>x</italic><sub><italic>b</italic></sub>. We denote the cross-entropy between two probability distributions <italic>p</italic> and <italic>q</italic> as <italic>H</italic>(<italic>p, q</italic>).</p>
<p>The loss function for FixMatch consists of two terms: a supervised loss <italic>L</italic><sub><italic>s</italic></sub> applied to labeled data and an unsupervised loss <italic>L</italic><sub><italic>u</italic></sub> for the unlabeled data. <italic>L</italic><sub><italic>s</italic></sub> is the cross-entropy loss on weakly augmented labeled examples:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M3"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>B</mml:mi></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>B</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mi>H</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>y</mml:mi><mml:mo>|</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003B1;(<italic>x</italic><sub><italic>b</italic></sub>) represents weak data augmentation on labeled sample <italic>x</italic><sub><italic>b</italic></sub>.</p>
<p>For the unsupervised loss, the algorithm computes the label based on weakly augmented versions of the image as <italic>q</italic><sub><italic>b</italic></sub> &#x0003D; <italic>p</italic><sub><italic>m</italic></sub>[<italic>y</italic>|&#x003B1;(<italic>u</italic><sub><italic>b</italic></sub>)]. It is essential that the label is computed on weakly augmented versions of the unlabeled training samples and not on strongly augmented versions. The pseudo-label is computed as <inline-formula><mml:math id="M4"><mml:mover accent="true"><mml:mrow><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mo class="qopname">arg</mml:mo><mml:mo class="qopname">max</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> and the unlabeled loss is given as:</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M5"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02265;</mml:mo><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>H</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover accent="true"><mml:mrow><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>y</mml:mi><mml:mo>|</mml:mo><mml:mrow><mml:mi mathvariant='script'>A</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M6"><mml:mrow><mml:mi mathvariant='script'>A</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> represents applying strong augmentation to sample <italic>u</italic><sub><italic>b</italic></sub> and &#x003C4; is a scalar confidence threshold that is used to include only high confidence terms. The total loss is given by <italic>L</italic> &#x0003D; <italic>L</italic><sub><italic>s</italic></sub>&#x0002B;&#x003BB;<sub><italic>u</italic></sub><italic>L</italic><sub><italic>u</italic></sub> where &#x003BB;<sub><italic>u</italic></sub> is a scalar hyper-parameter. Additional details on the FixMatch algorithm are available in the original paper (Sohn et al., <xref ref-type="bibr" rid="B24">2020</xref>).</p>
</sec>
<sec>
<title>3.2. Prototype Refining</title>
<p>Previous work by Sohn et al. on one-shot semi-supervised learning relied on the dataset labels to randomly choose an example for each class. The authors demonstrated that the choice of these samples significantly affected the performance of their algorithm. Specifically, they ordered the CIFAR-10 training data by how representative they were of their class by utilizing fully supervised trained models and found that using more prototypical examples achieved a median accuracy of 78% while the use of poorly representative samples failed to converge at all. The authors acknowledged that their method for finding prototypes was not practical. In contrast, we now present a practical approach for choosing an iconic prototype for each class.</p>
<p>In real-world scenarios, one&#x00027;s data is initially all unlabeled but it is not overly burdensome for an expert to manually sift through some of their dataset to find one iconic example of each class. In choosing iconic images of each class, the labeler&#x00027;s goal is to pick images that represent the class objects well, while minimizing the amount of background distractors in the image. While the labeler is choosing the most iconic examples to be class prototypes for one-shot training of the network, it is beneficial to designate the less representative examples as part of a validation or test dataset. In our own experiments with labeled datasets CIFAR-10 and SVHN, we did not rely on the training labels but reviewed a small fraction of the training data to manually choose class prototypes.</p>
<p>In addition, we also propose a simple iterative technique for improving the choice of prototypes because good prototypes are important to good performance. After choosing prototypes, the next step is to make a training run and examine the class accuracies. For any class with poor accuracy relative to the other classes, it is likely that a better prototype can be chosen. We recommend returning to the unlabeled or test datasets to find replacement prototypes for only the poorly performing classes. In our experiments we found doing this even once to be beneficial.</p>
<p>One might argue that prototype refining is as much work as labeling several examples per class and using many training samples will make it easier to train the model. From only a practical perspective, labeling 5 or 10 examples per class is not substantially more effort relative to labeling only one iconic example per class and prototype refining. While in practice one may want to start with more than one example for ease of training, there are scientific, educational, and algorithmic benefits to studying one-shot semi-supervised learning, which we discuss in our <xref ref-type="supplementary-material" rid="SM1">Appendix</xref>. Also, non-representative examples can be included in a labeled test or validation dataset for use in evaluating the quality of the training.</p>
</sec>
<sec>
<title>3.3. Class Balancing</title>
<p>We believe a class imbalance problem is an important factor in training neural networks, not only in one-shot semi-supervised learning but also a factor for small to mid-sized datasets. It is typical that a network with random weights usually outputs a single class label for every sample (i.e., randomly initialized networks do not generate random predictions). Hence, all networks start their training with elements of the class imbalance problem but the presence of large, balanced training data allows the network to overcome this problem. Since class imbalance is always present when training deep networks, class balancing methods might always be valuable, particularly when training on one-shot, few-shot, or small labeled datasets, and we leave further investigations of this for future work.</p>
<p>Unlike the data imbalance domain, the ground truth imbalance proportions are unknown with unlabeled datasets. Our innovation here is to use the model generated pseudo-labels as a surrogate for class counting and estimating class imbalance ratios (i.e., determining majority and minority classes). Specifically, as the algorithm computes the pseudo-labels for all of the unlabeled training samples, it counts the number that fall within each class, which we designate as <inline-formula><mml:math id="M7"><mml:mrow><mml:mi mathvariant='script'>C</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>:</mml:mo><mml:mi>n</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>,</mml:mo><mml:mi>N</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> where <italic>N</italic> is the number of classes. We assume a similar number of unlabeled samples in each class so the number of pseudo-labels in each class should also be similar.</p>
<p>Our first class balancing method is based on oversampling minority classes. Our algorithm reduces the pseudo-labeling thresholds for minority classes to include more examples of the minority classes in the training. Formally, in pseudo-labeling the following unsupervised loss function is used for the unlabeled data in place of Equation (2):</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M8"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02265;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>H</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover accent="true"><mml:mrow><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M9"><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>y</mml:mi><mml:mo>|</mml:mo><mml:mrow><mml:mi mathvariant='script'>A</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M10"><mml:mover accent="true"><mml:mrow><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mo class="qopname">arg</mml:mo><mml:mo class="qopname">max</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, and &#x003C4;<sub><italic>n</italic></sub> is the class dependent threshold for inclusion in the unlabeled loss <italic>L</italic><sub><italic>u</italic></sub>. We define the class dependent thresholds as:</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M11"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x003C4;</mml:mi><mml:mo>-</mml:mo><mml:mi>&#x00394;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant='script'>C</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>c</italic><sub><italic>n</italic></sub> is the number of pseudo-labeled in class <inline-formula><mml:math id="M12"><mml:mi>n</mml:mi><mml:mo>,</mml:mo><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant='script'>C</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> is the maximum count of all the classes, and &#x00394; is a scalar hyper-parameter (&#x003C4;&#x0003E;&#x00394;&#x0003E;0) guiding how much to lower the threshold for minority classes. Hence, the most frequent class will use a threshold of &#x003C4; while minority classes will use lower thresholds, down to &#x003C4;&#x02212;&#x00394;.</p>
<p>The next two class balancing methods are variations on loss function class weightings. In the FixMatch algorithm, all unlabeled samples above the threshold are included in Equation (3) with the same weight. Instead, our second class balancing algorithm becomes:</p>
<disp-formula id="E5"><label>(5)</label><mml:math id="M13"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>Z</mml:mi><mml:mi>&#x003BC;</mml:mi></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02265;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>H</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover accent="true"><mml:mrow><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>/</mml:mo><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where the loss terms are divided by <italic>c</italic><sub><italic>n</italic></sub> and <italic>Z</italic> is a normalizing factor that makes <italic>L</italic><sub><italic>u</italic></sub> the same magnitude as without this weighting scheme (this allows the unlabeled loss weighting &#x003BB;<sub><italic>u</italic></sub> to remain the same).</p>
<p>Our third class balancing algorithm is identical to the previous method except it uses an alternate class count &#x00109;<sub><italic>u</italic></sub> in Equation (5). Here we define &#x00109;<sub><italic>u</italic></sub> using only the high confidence pseudo-labeled samples (i.e., samples that are above the threshold). The intuition of this third method is that each of the classes should contribute equally to the loss <italic>L</italic><sub><italic>u</italic></sub> (i.e., each sample&#x00027;s loss is divided by the number of samples of that class included in <italic>L</italic><sub><italic>u</italic></sub>). In practice, this method&#x00027;s weights might be an order of magnitude larger than the previous method&#x00027;s weights, which might contribute to training instability, so we compare both methods in Section 4.2.</p>
<p>Our fourth class balancing algorithm is a hybrid of the data and algorithmic methods. Specifically, it is a combination of our class balancing methods 1 and 3. Our experiments with this hybrid method demonstrates the benefits of combining the class balancing methods.</p>
</sec>
<sec>
<title>3.4. Self-Training Iterations</title>
<p>Labeled and unlabeled data play different roles in semi-supervised learning. Here we propose self-training iterations where the pseudo-labels of the highest confidence unlabeled training samples are combined with labeled samples in a new iteration. Increasing the number of labeled samples per class improves performance, and substantially reduces training instability and performance variability. Although some of these pseudo-labels might be wrong, we rely on the observation that the training of deep networks are robust to small amounts of labeling noise. Hence, we aimed to achieve a 90% accuracy from the first iteration of semi-supervised learning with the class balancing methods.</p>
<p>Self-training in BOSS adds to the testing stage a computation of the model predictions on all of the unlabeled training data. These are sorted from the highest prediction probabilities down and the dataset is saved. After the original training run, the labeled data can be combined with a number of the highest prediction samples from each class and a subsequent self-training iteration run can use the larger labeled dataset for retraining a new network. We experimented with labeling 5, 10, 20, and 40 of the top predictions per class and the results are reported in Section 4.3.</p>
</sec>
</sec>
<sec id="s4">
<title>4. Experiments</title>
<p>In this section, we demonstrate that the BOSS algorithms can achieve comparable performance with fully-supervised training of CIFAR-10<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref> (Krizhevsky and Hinton, <xref ref-type="bibr" rid="B15">2009</xref>) and SVHN<xref ref-type="fn" rid="fn0002"><sup>2</sup></xref> (Netzer et al., <xref ref-type="bibr" rid="B19">2011</xref>). We compare our results to FixMatch<xref ref-type="fn" rid="fn0003"><sup>3</sup></xref> (Sohn et al., <xref ref-type="bibr" rid="B24">2020</xref>) and demonstrate the value of our approach. Our experiments use a Wide ResNet-28-2 (Zagoruyko and Komodakis, <xref ref-type="bibr" rid="B31">2016</xref>) that matches the FixMatch reported results and we used the same cosine learning rate schedule described by Sohn et al. (<xref ref-type="bibr" rid="B24">2020</xref>). We repeated our experiments with a ShakeNet model (Gastaldi, <xref ref-type="bibr" rid="B9">2017</xref>) and obtained similar result that lead to the same insights and conclusions. Our hyper-parameters were in a small range and the specifics are provided in the <xref ref-type="supplementary-material" rid="SM1">Appendix</xref>. For data and data augmentation, we used the default augmentation in FixMatch but additional experiments (not shown) did show that using RandAugment (Cubuk et al., <xref ref-type="bibr" rid="B5">2019</xref>) for strong data augmentation provides a slight improvement. Our runs with fully supervised learning of the Wide ResNet-28-2 model produced a test accuracy of 94.9&#x000B1;0.3% for CIFAR-10 (Krizhevsky and Hinton, <xref ref-type="bibr" rid="B15">2009</xref>) and test accuracy of 98.26&#x000B1;0.04% for SVHN (Netzer et al., <xref ref-type="bibr" rid="B19">2011</xref>), which we use for our basis of comparison. Our code is available at <ext-link ext-link-type="uri" xlink:href="https://github.com/lnsmith54/BOSS">https://github.com/lnsmith54/BOSS</ext-link> to facilitate replication and for use with future real-world applications.</p>
<sec>
<title>4.1. Choosing Prototypes and Prototype Refining</title>
<p>For our experiments with CIFAR-10, we manually reviewed the first few hundred images and choose five sets of prototypes that we will refer to as class prototype sets 1&#x02013;5. However, the practioner need only create one set of class prototypes and can perform prototype refining, as we describe below.</p>
<p><xref ref-type="table" rid="T1">Table 1</xref> presents the averaged (over two runs) test accuracies for each class, computed from FixMatch on the CIFAR-10 test dataset for each of the prototype sets 1&#x02013;5. This table illustrates that a good choice of prototypes (i.e., set = 3) can lead to good performance in most of the classes, which enables a good overall performance. <xref ref-type="table" rid="T1">Table 1</xref> also shows that for other sets the class accuracies can be quite high for some classes while low for other classes. Hence, the poor performance of some classes implies that the choice of prototypes for these classes in those sets can be improved. In prototype refining, one simply reviews the class accuracies to find which prototypes should be replaced.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Class accuracies.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="center"><bold>Set</bold></th>
<th valign="top" align="center"><bold>Airplane</bold></th>
<th valign="top" align="center"><bold>Auto</bold></th>
<th valign="top" align="center"><bold>Bird</bold></th>
<th valign="top" align="center"><bold>Cat</bold></th>
<th valign="top" align="center"><bold>Deer</bold></th>
<th valign="top" align="center"><bold>Dog</bold></th>
<th valign="top" align="center"><bold>Frog</bold></th>
<th valign="top" align="center"><bold>Horse</bold></th>
<th valign="top" align="center"><bold>Ship</bold></th>
<th valign="top" align="center"><bold>Truck</bold></th>
<th valign="top" align="center"><bold>Mean</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="center">1</td>
<td valign="top" align="center">29</td>
<td valign="top" align="center">98</td>
<td valign="top" align="center">71</td>
<td valign="top" align="center">89</td>
<td valign="top" align="center">97</td>
<td valign="top" align="center">16</td>
<td valign="top" align="center">98</td>
<td valign="top" align="center">97</td>
<td valign="top" align="center">97</td>
<td valign="top" align="center">97</td>
<td valign="top" align="center">79</td>
</tr>
<tr>
<td valign="top" align="center">2</td>
<td valign="top" align="center">28</td>
<td valign="top" align="center">99</td>
<td valign="top" align="center">70</td>
<td valign="top" align="center">43</td>
<td valign="top" align="center">97</td>
<td valign="top" align="center">89</td>
<td valign="top" align="center">98</td>
<td valign="top" align="center">97</td>
<td valign="top" align="center">98</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">72</td>
</tr>
<tr>
<td valign="top" align="center">3</td>
<td valign="top" align="center">96</td>
<td valign="top" align="center">98</td>
<td valign="top" align="center">63</td>
<td valign="top" align="center">20</td>
<td valign="top" align="center">97</td>
<td valign="top" align="center">96</td>
<td valign="top" align="center">98</td>
<td valign="top" align="center">87</td>
<td valign="top" align="center">98</td>
<td valign="top" align="center">97</td>
<td valign="top" align="center">86</td>
</tr>
<tr>
<td valign="top" align="center">4</td>
<td valign="top" align="center">29</td>
<td valign="top" align="center">98</td>
<td valign="top" align="center">65</td>
<td valign="top" align="center">10</td>
<td valign="top" align="center">96</td>
<td valign="top" align="center">32</td>
<td valign="top" align="center">98</td>
<td valign="top" align="center">97</td>
<td valign="top" align="center">97</td>
<td valign="top" align="center">96</td>
<td valign="top" align="center">72</td>
</tr>
<tr>
<td valign="top" align="center">5</td>
<td valign="top" align="center">28</td>
<td valign="top" align="center">97</td>
<td valign="top" align="center">70</td>
<td valign="top" align="center">46</td>
<td valign="top" align="center">96</td>
<td valign="top" align="center">48</td>
<td valign="top" align="center">53</td>
<td valign="top" align="center">76</td>
<td valign="top" align="center">96</td>
<td valign="top" align="center">97</td>
<td valign="top" align="center">72</td>
</tr>
<tr>
<td valign="top" align="center">6</td>
<td valign="top" align="center">80</td>
<td valign="top" align="center">98</td>
<td valign="top" align="center">71</td>
<td valign="top" align="center">52</td>
<td valign="top" align="center">97</td>
<td valign="top" align="center">92</td>
<td valign="top" align="center">98</td>
<td valign="top" align="center">87</td>
<td valign="top" align="center">98</td>
<td valign="top" align="center">97</td>
<td valign="top" align="center">82</td>
</tr>
<tr>
<td valign="top" align="center">7</td>
<td valign="top" align="center">28</td>
<td valign="top" align="center">99</td>
<td valign="top" align="center">75</td>
<td valign="top" align="center">54</td>
<td valign="top" align="center">95</td>
<td valign="top" align="center">86</td>
<td valign="top" align="center">95</td>
<td valign="top" align="center">86</td>
<td valign="top" align="center">96</td>
<td valign="top" align="center">94</td>
<td valign="top" align="center">83</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>One-shot semi-supervised average (of 2 runs) class accuracies for CIFAR-10 test data with the FixMatch model, that was trained on sets of manually chosen prototypes for each class. Prototype set 6 was modified from set 2 and prototype set 7 was modified from set 4 (i.e., prototype refining)</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p>We demonstrate prototype refining with two examples. The airplane and truck class accuracies in set 2 are poor so we replaced these two prototypes and name this set 6. In set 4, the cat and dog classes are performing poorly so we replaced these two prototypes and name this set 7. <xref ref-type="table" rid="T1">Table 1</xref> shows the class accuracies for sets 6 and 7 and these results are better than the original sets; that is, prototype refining of these two sets raised the overall test accuracies from 72 up to 82&#x02013;83%.</p>
</sec>
<sec>
<title>4.2. Class Balancing</title>
<p>In this section, we report the results from FixMatch and demonstrate substantial improvements with the class balancing methods in BOSS. <xref ref-type="table" rid="T2">Table 2</xref> presents our main results for CIFAR-10, which illustrates the benefits from prototype refining, class balancing, and one self-training iteration. The first five rows in the table list the results for the five sets of class prototypes (i.e., 1 prototype per class) for CIFAR-10. Rows for sets 6 and 7 provide the results for prototype refining of the original sets 2 and 4, respectively. The FixMatch column shows results (i.e., average and standard deviation over four runs) for the original FixMatch code on the prototype sets.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Main results.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th/>
<th valign="top" align="center" colspan="4" style="border-bottom: thin solid #000000;"><bold>BOSS balance method</bold></th>
<th valign="top" align="center" colspan="4" style="border-bottom: thin solid #000000;"><bold>Self-training</bold></th>
</tr>
<tr>
<th valign="top" align="center"><bold>Set</bold></th>
<th valign="top" align="center"><bold>FixMatch</bold></th>
<th valign="top" align="center"><bold>1</bold></th>
<th valign="top" align="center"><bold>2</bold></th>
<th valign="top" align="center"><bold>3</bold></th>
<th valign="top" align="center"><bold>4</bold></th>
<th valign="top" align="center"><bold>&#x0002B;5</bold></th>
<th valign="top" align="center"><bold>&#x0002B;10</bold></th>
<th valign="top" align="center"><bold>&#x0002B;20</bold></th>
<th valign="top" align="center"><bold>&#x0002B;40</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="center">1</td>
<td valign="top" align="center">79 &#x000B1; 1</td>
<td valign="top" align="center"><bold>91.4 &#x000B1; 2</bold> </td>
<td valign="top" align="center">90 &#x000B1; 5</td>
<td valign="top" align="center">84 &#x000B1; 6</td>
<td valign="top" align="center">88 &#x000B1; 2</td>
<td valign="top" align="center">94.8</td>
<td valign="top" align="center">95.2</td>
<td valign="top" align="center">95.2</td>
<td valign="top" align="center">95.2</td>
</tr>
<tr>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td valign="top" align="center">&#x000B1;0.1</td>
<td valign="top" align="center">&#x000B1;0.1</td>
<td valign="top" align="center">&#x000B1;0.1</td>
<td valign="top" align="center">&#x000B1;0.1</td>
</tr>
<tr>
<td valign="top" align="center">2</td>
<td valign="top" align="center">74 &#x000B1; 5</td>
<td valign="top" align="center"><bold>91.8 &#x000B1; 1</bold></td>
<td valign="top" align="center">90 &#x000B1; 3</td>
<td valign="top" align="center">88 &#x000B1; 2</td>
<td valign="top" align="center">80 &#x000B1; 14</td>
<td valign="top" align="center">93.6</td>
<td valign="top" align="center">95.1</td>
<td valign="top" align="center">95.1</td>
<td valign="top" align="center">95.1</td>
</tr>
<tr>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td valign="top" align="center">&#x000B1;0.2</td>
<td valign="top" align="center">&#x000B1;0.1</td>
<td valign="top" align="center">&#x000B1;0.3</td>
<td valign="top" align="center">&#x000B1;0.2</td>
</tr>
<tr>
<td valign="top" align="center">3</td>
<td valign="top" align="center">86 &#x000B1; 1</td>
<td valign="top" align="center">92.8 &#x000B1; 0.2</td>
<td valign="top" align="center">91 &#x000B1; 2</td>
<td valign="top" align="center">91 &#x000B1; 3</td>
<td valign="top" align="center"><bold>92.8 &#x000B1; 0.1</bold></td>
<td valign="top" align="center">94.6</td>
<td valign="top" align="center">94.8</td>
<td valign="top" align="center">94.9</td>
<td valign="top" align="center">95.2</td>
</tr>
<tr>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td valign="top" align="center">&#x000B1;0.5</td>
<td valign="top" align="center">&#x000B1;0.5</td>
<td valign="top" align="center">&#x000B1;0.1</td>
<td valign="top" align="center">&#x000B1;0.1</td>
</tr>
<tr>
<td valign="top" align="center">4</td>
<td valign="top" align="center">74 &#x000B1; 8</td>
<td valign="top" align="center">77.7 &#x000B1; 0.3</td>
<td valign="top" align="center">81 &#x000B1; 6</td>
<td valign="top" align="center">81 &#x000B1; 8</td>
<td valign="top" align="center"><bold>90 &#x000B1; 7</bold></td>
<td valign="top" align="center">94.9</td>
<td valign="top" align="center">94.9</td>
<td valign="top" align="center">94.9</td>
<td valign="top" align="center">95.1</td>
</tr>
<tr>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td valign="top" align="center">&#x000B1;0.1</td>
<td valign="top" align="center">&#x000B1;0.4</td>
<td valign="top" align="center">&#x000B1;0.5</td>
<td valign="top" align="center">&#x000B1;0.3</td>
</tr>
<tr>
<td valign="top" align="center">5</td>
<td valign="top" align="center">69 &#x000B1; 7</td>
<td valign="top" align="center">86 &#x000B1; 7</td>
<td valign="top" align="center">89 &#x000B1; 6</td>
<td valign="top" align="center">83 &#x000B1; 10</td>
<td valign="top" align="center"><bold>90 &#x000B1; 3</bold></td>
<td valign="top" align="center">89.6</td>
<td valign="top" align="center">95.2</td>
<td valign="top" align="center">95.2</td>
<td valign="top" align="center">95.2</td>
</tr>
<tr>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td valign="top" align="center">&#x000B1;0.3</td>
<td valign="top" align="center">&#x000B1;0.1</td>
<td valign="top" align="center">&#x000B1;0.2</td>
<td valign="top" align="center">&#x000B1;0.1</td>
</tr>
<tr>
<td valign="top" align="center">6</td>
<td valign="top" align="center">82 &#x000B1; 0.6</td>
<td valign="top" align="center">91.5 &#x000B1; 1</td>
<td valign="top" align="center">92 &#x000B1; 0.7</td>
<td valign="top" align="center">91.8 &#x000B1; 1</td>
<td valign="top" align="center"><bold>92 &#x000B1; 1</bold></td>
<td valign="top" align="center">94.6</td>
<td valign="top" align="center">95.1</td>
<td valign="top" align="center">94.7</td>
<td valign="top" align="center">94.9</td>
</tr>
<tr>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td valign="top" align="center">&#x000B1;0.1</td>
<td valign="top" align="center">&#x000B1;0.2</td>
<td valign="top" align="center">&#x000B1;0.1</td>
<td valign="top" align="center">&#x000B1;0.1</td>
</tr>
<tr>
<td valign="top" align="center">7</td>
<td valign="top" align="center">78 &#x000B1; 0.1</td>
<td valign="top" align="center">91.7 &#x000B1; 0.3</td>
<td valign="top" align="center">92.3 &#x000B1; 0.8</td>
<td valign="top" align="center">91.1 &#x000B1; 2.5</td>
<td valign="top" align="center"><bold>93 &#x000B1; 0.3</bold></td>
<td valign="top" align="center">94.9</td>
<td valign="top" align="center">94.7</td>
<td valign="top" align="center">94.9</td>
<td valign="top" align="center">95.1</td>
</tr>
<tr>
<td/>
<td/>
<td/>
<td/>
<td/>
<td/>
<td valign="top" align="center">&#x000B1;0.1</td>
<td valign="top" align="center">&#x000B1;0.2</td>
<td valign="top" align="center">&#x000B1;0.1</td>
<td valign="top" align="center">&#x000B1;0.1</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>BOSS methods are compared using five sets of class prototypes (i.e., 1 prototype per class) for CIFAR-10, plus two sets from prototype refining. The FixMatch column shows test accuracies (average and standard deviation of 4 runs) for the original FixMatch code on the prototype sets. The next four columns give the accuracy results for the class balance methods (see text for a description of class balance methods). Results for the PyTorch reimplementation of FixMatch and modified with the BOSS methods are shown in brackets [.]. The self-training iteration was performed with the top pseudo-labels from the run shown in bold and the results are in the next four columns</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p>The next four columns present the BOSS results with class balancing methods. As described in Section 3.3, class balance method 1 represents oversampling of minority classes, balance methods 2 and 3 are two forms of class-based loss weightings, and balance = 4 is a hybrid that combines balance methods 1 and 3. The use of class balancing significantly improves on the original FixMatch results, with increases of up to 20 absolute percentage points. Generally, the hybrid class balance method 4 is best, except when instabilities hurt the performance. The performance is generally in the 90% range with good performance across all the classes, which enables the self-training iteration to bump the accuracies to be comparable to the test accuracy from supervised training on the full labeled training dataset.</p>
<p><xref ref-type="table" rid="T2">Table 2</xref> indicates that good class prototypes (i.e., sets 3, 6, and 7) result in test accuracies near 90% and low variance between runs. However, when some of the class prototypes are inferior, some of the training runs exhibit instabilities that cause lower averaged accuracies and higher variance. We provide a discussion in Section 4.5 on the cause of these instabilities and on how to improve these results.</p>
</sec>
<sec>
<title>4.3. Self-Training Iterations</title>
<p>The final four columns of <xref ref-type="table" rid="T2">Table 2</xref> list the results of performing one self-training iteration. The self-training was initialized with the original single labeled sample per class, plus the most confident pseudo-labeled examples from the BOSS training run that is highlighted in bold. For example, the &#x0201C;&#x0002B;5&#x0201D; columns means that five pseudo-labeled examples per class were combined with the original labeled prototypes to make a set with a total of 60 labeled examples. These self-training results demonstrate that one-shot semi-supervised learning can reach comparable performance to the results from fully supervised training (i.e., 94.9%), often with adding as few as five samples per class. However, we expect that in practice, self-training by adding more samples per class will prove more reliable.</p>
</sec>
<sec>
<title>4.4. SVHN</title>
<p>SVHN is obtained from house numbers in Google Street View images and is used for recognizing digits (i.e., 0&#x02013;9) in natural scene images. Visual review of the images show that the training samples are of poor quality (i.e., blurry) and often contain distractors (i.e., multiple digits in an image). Because of the quality issue, we needed to review several hundred unlabeled training samples in order to find four class prototype sets that are reported in <xref ref-type="table" rid="T3">Table 3</xref>. Even though the SVHN training images are of poorer quality than the CIFAR-10 training images, one-shot semi-supervised learning with FixMatch on sets of prototypes produced higher test accuracies than with CIFAR-10. <xref ref-type="table" rid="T3">Table 3</xref> presents equivalent results for the SVHN dataset as those results that were reported in <xref ref-type="table" rid="T2">Table 2</xref> for CIFAR-10. Since the results for FixMatch are all above 89%, we did not perform prototype refining on any of these sets. However, here too the class balancing methods increase the test accuracies above the FixMatch results. With these four class prototype sets, class balance method 1 produces the best results. The test accuracies from balance method 1 are &#x0007E;1% lower than the fully supervised results of 98.26&#x000B1;0.04%. The improvements from self-training were small and the best results fell about 0.5% below the results of fully supervised training. We believe the differences between CIFAR-10 and SVHN are related to the natures of the datasets.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>SVHN.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th/>
<th valign="top" align="center" colspan="4" style="border-bottom: thin solid #000000;"><bold>BOSS balance method</bold></th>
<th valign="top" align="center" colspan="4" style="border-bottom: thin solid #000000;"><bold>Self-training</bold></th>
</tr>
<tr>
<th valign="top" align="center"><bold>Set</bold></th>
<th valign="top" align="center"><bold>FixMatch</bold></th>
<th valign="top" align="center"><bold>1</bold></th>
<th valign="top" align="center"><bold>2</bold></th>
<th valign="top" align="center"><bold>3</bold></th>
<th valign="top" align="center"><bold>4</bold></th>
<th valign="top" align="center"><bold>&#x0002B;5</bold></th>
<th valign="top" align="center"><bold>&#x0002B;10</bold></th>
<th valign="top" align="center"><bold>&#x0002B;20</bold></th>
<th valign="top" align="center"><bold>&#x0002B;40</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="center">1</td>
<td valign="top" align="center">95.9 &#x000B1; 3</td>
<td valign="top" align="center"><bold>97.4 &#x000B1; 0.2</bold> </td>
<td valign="top" align="center">96.4 &#x000B1; 0.9</td>
<td valign="top" align="center">95.7 &#x000B1; 1.6</td>
<td valign="top" align="center">96.8 &#x000B1; 0.1</td>
<td valign="top" align="center">97.9</td>
<td valign="top" align="center">97.9</td>
<td valign="top" align="center">97.9</td>
<td valign="top" align="center">97.8</td>
</tr>
<tr>
<td valign="top" align="center">2</td>
<td valign="top" align="center">91.5 &#x000B1; 3</td>
<td valign="top" align="center"><bold>97.4 &#x000B1; 0.1</bold></td>
<td valign="top" align="center">97.1 &#x000B1; 0.1</td>
<td valign="top" align="center">97.1 &#x000B1; 0.1</td>
<td valign="top" align="center">95.6 &#x000B1; 0.1</td>
<td valign="top" align="center">94.1</td>
<td valign="top" align="center">97.9</td>
<td valign="top" align="center">97.6</td>
<td valign="top" align="center">97.7</td>
</tr>
<tr>
<td valign="top" align="center">3</td>
<td valign="top" align="center">93.9 &#x000B1; 0.1</td>
<td valign="top" align="center"><bold>97.3 &#x000B1; 0.3</bold></td>
<td valign="top" align="center">97.2 &#x000B1; 0.2</td>
<td valign="top" align="center">92 &#x000B1; 7</td>
<td valign="top" align="center">91.3 &#x000B1; 0.3</td>
<td valign="top" align="center">97.8</td>
<td valign="top" align="center">97.9</td>
<td valign="top" align="center">97.8</td>
<td valign="top" align="center">97.9</td>
</tr>
<tr>
<td valign="top" align="center">4</td>
<td valign="top" align="center">89.2 &#x000B1; 12</td>
<td valign="top" align="center"><bold>96.5 &#x000B1; 0.6</bold></td>
<td valign="top" align="center">90 &#x000B1; 10</td>
<td valign="top" align="center">89 &#x000B1; 11</td>
<td valign="top" align="center">83 &#x000B1; 16</td>
<td valign="top" align="center">97.6</td>
<td valign="top" align="center">96.7</td>
<td valign="top" align="center">97.0</td>
<td valign="top" align="center">98.0</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>BOSS methods are compared using four sets of class prototypes (i.e., 1 prototype per class) for SVHN. The FixMatch column shows results for the original FixMatch code on the prototype sets. The next four columns give the accuracy results for the class balance methods. Results are an average of test accuracies for four runs. The self-training iteration was performed on the results from the class balancing shown in bold</italic>.</p>
</table-wrap-foot>
</table-wrap></sec>
<sec>
<title>4.5. Investigation of Training Instabilities</title>
<p>In our experiments we observed high sensitivity of one-shot semi-supervised learning performance to the choices for the hyper-parameters and the class prototype sets, which motivated us to investigate this matter in greater depth. That is, we observed that good choices for the prototypes and prototype refining significantly reduced the instabilities and the variability of the results (i.e., few instabilities were encountered for CIFAR-10 prototype sets 3, 6, and 7 so the final accuracies were higher and the standard deviations of the results were lower). In sets where the performance was inferior, there was always at least one class that performed poorly. In addition, we found a high sensitivity to the hyper-parameter values, which made a significant difference in the results.</p>
<p>We investigated the cases of poor performance and discovered that there were two different situations. <xref ref-type="fig" rid="F1">Figure 1</xref> provides examples of test accuracies during the training for both situations. The blue curve is the test accuracy where in one training run the network learns a final test accuracy of 77%. We hypothesize that in this situation the network can get stuck in a poor local minimum that is due to poor prototype choices and can be improved with prototype refining or by hyper-parameter fine tuning. The red curve in <xref ref-type="fig" rid="F1">Figure 1</xref> is an example of the other case and here the training is dominated by instabilities (i.e., where the model suddenly diverges during training) and the final test accuracy is 65%. Interestingly, we found that it is important when tuning the hyper-parameters to identify which scenario is occurring.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>An example of training to a poor local minimum (blue) and training with instabilities (red). Both end with poor test accuracies but for different reasons.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-05-880729-g0001.tif"/>
</fig>
<p>Our experiments with training instabilities (i.e., the red curve) implied that they can be caused by too much class balancing. We hypothesize that when the model struggles to classify some of the classes, the class balancing methods can force the pseudo-labeling to mislabel samples in order to have the appearance of class balance. In these cases, it is better to reduce the amount of class balancing by using a smaller value for &#x00394; for class balance methods 1 and 4, and using a smaller value for &#x003BB;<sub><italic>u</italic></sub> for class balance methods 2 and 3. In addition, we observed that decreasing weight decay (WD) and the learning rate (LR) improves performance when there are instabilities.</p>
<p>On the other hand, if the inferior performance is due to poor local minimum (i.e., the blue curve), one can either improve the class prototypes (i.e., prototype refining) or increase the amount of class balancing. This is the opposite of what should do for instabilities; that is, one can use a larger value for &#x00394; for class balance methods 1 and 4, use a larger value for &#x003BB;<sub><italic>u</italic></sub> for class balance methods 2 and 3, or increase weight decay (WD) and the learning rate (LR). We also observed that it helps to increase &#x003C4; if there are instabilities and to decrease &#x003C4; in the poor local minimum situation.</p>
<p><xref ref-type="table" rid="T4">Table 4</xref> demonstrates how to improve the results presented in <xref ref-type="table" rid="T2">Table 2</xref> (for consistency we used the same hyper-parameter values for all of the class balance runs shown in <xref ref-type="table" rid="T2">Table 2</xref>). <xref ref-type="table" rid="T4">Table 4</xref> contains results of hyper-parameter fine tuning where we reported earlier test accuracies below 85%. We list the class prototype set (Set), the BOSS class balancing method (Balance), weight decay (WD), initial learning rate (LR), the change in the confidence threshold for minority classes (&#x00394;), the unlabeled loss multiplicative factor (&#x003BB;<sub><italic>u</italic></sub>), the confidence threshold (&#x003C4;), and the final test accuracy. Furthermore, we provide a short description that indicates if the training curve displays instabilities (i.e., the red curve in <xref ref-type="fig" rid="F1">Figure 1</xref>) or a poor local minimum (i.e., the blue curve). Or the description points out the hyper-parameters that were tuned to improve the performance.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Illustration of the sensitivity to the hyper-parameters WD, LR, &#x00394;, &#x003BB;<sub><italic>u</italic></sub>, and &#x003C4;. See the text for guidance on how to tune these hyper-parameters for situations with inferior performance due to instabilities or local minimums.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="center"><bold>Set</bold></th>
<th valign="top" align="center"><bold>Balance</bold></th>
<th valign="top" align="left"><bold>Description</bold></th>
<th valign="top" align="center"><bold>WD</bold></th>
<th valign="top" align="center"><bold>LR</bold></th>
<th valign="top" align="center"><bold>&#x00394;</bold></th>
<th valign="top" align="center"><bold>&#x003BB;<sub><italic>u</italic></sub></bold></th>
<th valign="top" align="center"><bold>&#x003C4;</bold></th>
<th valign="top" align="center"><bold>Accuracy (%)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="center">1</td>
<td valign="top" align="center">3</td>
<td valign="top" align="left">Instabilities</td>
<td valign="top" align="center">8 &#x000D7; 10<sup>&#x02212;4</sup></td>
<td valign="top" align="center">0.06</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">0.9</td>
<td valign="top" align="center">84 &#x000B1; 6</td>
</tr>
<tr>
<td valign="top" align="center">1</td>
<td valign="top" align="center">3</td>
<td valign="top" align="left">Decrease &#x003BB;<sub><italic>u</italic></sub></td>
<td valign="top" align="center">8 &#x000D7; 10<sup>&#x02212;4</sup></td>
<td valign="top" align="center">0.06</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">0.5</td>
<td valign="top" align="center">0.9</td>
<td valign="top" align="center">87 &#x000B1; 1</td>
</tr>
<tr>
<td valign="top" align="center">2</td>
<td valign="top" align="center">4</td>
<td valign="top" align="left">Instabilities</td>
<td valign="top" align="center">8 &#x000D7; 10<sup>&#x02212;4</sup></td>
<td valign="top" align="center">0.06</td>
<td valign="top" align="center">0.25</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">0.95</td>
<td valign="top" align="center">80 &#x000B1; 14</td>
</tr>
<tr>
<td valign="top" align="center">2</td>
<td valign="top" align="center">4</td>
<td valign="top" align="left">Decrease &#x00394;, WD, LR</td>
<td valign="top" align="center">6 &#x000D7; 10<sup>&#x02212;4</sup></td>
<td valign="top" align="center">0.04</td>
<td valign="top" align="center">0.1</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">0.95</td>
<td valign="top" align="center">94.5 &#x000B1; 0.1</td>
</tr>
<tr>
<td valign="top" align="center">4</td>
<td valign="top" align="center">1</td>
<td valign="top" align="left">Local min</td>
<td valign="top" align="center">8 &#x000D7; 10<sup>&#x02212;4</sup></td>
<td valign="top" align="center">0.06</td>
<td valign="top" align="center">0.25</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">0.9</td>
<td valign="top" align="center">77.5 &#x000B1; 0.1</td>
</tr>
<tr>
<td valign="top" align="center">4</td>
<td valign="top" align="center">1</td>
<td valign="top" align="left">Increase &#x00394;, &#x003C4;</td>
<td valign="top" align="center">8 &#x000D7; 10<sup>&#x02212;4</sup></td>
<td valign="top" align="center">0.06</td>
<td valign="top" align="center">0.3</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">0.95</td>
<td valign="top" align="center">93.2 &#x000B1; 0.2</td>
</tr>
<tr>
<td valign="top" align="center">4</td>
<td valign="top" align="center">2</td>
<td valign="top" align="left">Local min</td>
<td valign="top" align="center">8 &#x000D7; 10<sup>&#x02212;4</sup></td>
<td valign="top" align="center">0.06</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">0.9</td>
<td valign="top" align="center">81 &#x000B1; 6</td>
</tr>
<tr>
<td valign="top" align="center">4</td>
<td valign="top" align="center">2</td>
<td valign="top" align="left">Increase &#x003BB;<sub><italic>u</italic></sub></td>
<td valign="top" align="center">8 &#x000D7; 10<sup>&#x02212;4</sup></td>
<td valign="top" align="center">0.06</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">0.9</td>
<td valign="top" align="center">92 &#x000B1; 2</td>
</tr>
<tr>
<td valign="top" align="center">4</td>
<td valign="top" align="center">3</td>
<td valign="top" align="left">Local min</td>
<td valign="top" align="center">8 &#x000D7; 10<sup>&#x02212;4</sup></td>
<td valign="top" align="center">0.06</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">0.9</td>
<td valign="top" align="center">81 &#x000B1; 8</td>
</tr>
<tr>
<td valign="top" align="center">4</td>
<td valign="top" align="center">3</td>
<td valign="top" align="left">Increase &#x003BB;<sub><italic>u</italic></sub></td>
<td valign="top" align="center">8 &#x000D7; 10<sup>&#x02212;4</sup></td>
<td valign="top" align="center">0.06</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">0.9</td>
<td valign="top" align="center">88 &#x000B1; 3</td>
</tr>
<tr>
<td valign="top" align="center">5</td>
<td valign="top" align="center">1</td>
<td valign="top" align="left">Instabilities</td>
<td valign="top" align="center">8 &#x000D7; 10<sup>&#x02212;4</sup></td>
<td valign="top" align="center">0.06</td>
<td valign="top" align="center">0.25</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">0.95</td>
<td valign="top" align="center">86 &#x000B1; 7</td>
</tr>
<tr>
<td valign="top" align="center">5</td>
<td valign="top" align="center">1</td>
<td valign="top" align="left">Decrease &#x00394;</td>
<td valign="top" align="center">8 &#x000D7; 10<sup>&#x02212;4</sup></td>
<td valign="top" align="center">0.06</td>
<td valign="top" align="center">0.1</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">0.95</td>
<td valign="top" align="center">90.7 &#x000B1; 0.1</td>
</tr>
<tr>
<td valign="top" align="center">5</td>
<td valign="top" align="center">2</td>
<td valign="top" align="left">Instabilities</td>
<td valign="top" align="center">8 &#x000D7; 10<sup>&#x02212;4</sup></td>
<td valign="top" align="center">0.06</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">0.9</td>
<td valign="top" align="center">89 &#x000B1; 6</td>
</tr>
<tr>
<td valign="top" align="center">5</td>
<td valign="top" align="center">2</td>
<td valign="top" align="left">Decrease &#x003BB;<sub><italic>u</italic></sub></td>
<td valign="top" align="center">8 &#x000D7; 10<sup>&#x02212;4</sup></td>
<td valign="top" align="center">0.06</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">0.75</td>
<td valign="top" align="center">0.9</td>
<td valign="top" align="center">91.7 &#x000B1; 1</td>
</tr>
<tr>
<td valign="top" align="center">5</td>
<td valign="top" align="center">3</td>
<td valign="top" align="left">Instabilities</td>
<td valign="top" align="center">8 &#x000D7; 10<sup>&#x02212;4</sup></td>
<td valign="top" align="center">0.06</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">0.9</td>
<td valign="top" align="center">83 &#x000B1; 10</td>
</tr>
<tr>
<td valign="top" align="center">5</td>
<td valign="top" align="center">3</td>
<td valign="top" align="left">Decrease WD, LR</td>
<td valign="top" align="center">6 &#x000D7; 10<sup>&#x02212;4</sup></td>
<td valign="top" align="center">0.04</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">0.9</td>
<td valign="top" align="center">93.5 &#x000B1; 2</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The examples in <xref ref-type="table" rid="T4">Table 4</xref> show improved results for both the problem of instability and for poor local minimums. The examples include modifying &#x00394;, weight decay, learning rate, and &#x003C4;. In most cases the final accuracies are improved substantially with small changes in the hyper-parameter values. This demonstrates the sensitivity of one-shot semi-supervised learning to hyper-parameter values.</p>
<p>While this sensitivity can be challenging in practice, we note that this sensitivity can also lead to new opportunities. For example, often researchers propose new network architectures, loss functions, and optimization functions that are tested in the fully supervised regime where small performance gains are used to claim a new state-of-the-art. If these algorithms were instead tested in one-shot semi-supervised learning, more substantial differences in performance would better differentiate methods. Along these lines, we also advocate the use of one-shot semi-supervised learning with AutoML and neural architecture search (NAS) (Elsken et al., <xref ref-type="bibr" rid="B6">2018</xref>) to find optimal hyper-parameters and architectures.</p>
</sec>
</sec>
<sec sec-type="conclusions" id="s5">
<title>5. Conclusions</title>
<p>The BOSS methodology relies on simple concepts: choosing iconic training samples with minimal background distractors, employing class balancing techniques, and self-training with the highest confidence pseudo-labeled samples. Our experiments in Section 4 demonstrate the potential of training a network with only one sample per class and we have confirmed the importance of class balancing methods. While our methods have limitations (as discussed in the <xref ref-type="supplementary-material" rid="SM1">Appendix</xref>), this paper breaks new ground in one-shot semi-supervised learning and attains high performance. BOSS brings one-shot and few-shot semi-supervised learning closer to reality.</p>
<p>We proposed the novel concept of class balancing on unlabeled data. We introduced a novel way to measure class imbalance with unlabeled data and proposed four class balancing methods that improve the performance of semi-supervised learning. In addition, we investigated hyper-parameter sensitivity and the causes for weak performance (i.e., training instabilities), where we proposed two opposite sets of solutions.</p>
<p>Our work provides researchers with the following observations and insights:</p>
<list list-type="order">
<list-item><p>There is evidence that labeling a large number of samples might not be necessary for training deep neural networks to high levels of performance.</p></list-item>
<list-item><p>All networks have a class imbalance problem to some degree. Examining class accuracies relative to each other provides insights into the network&#x00027;s training.</p></list-item>
<list-item><p>Each training sample can affect the training. One-shot semi-supervised learning provides a mechanism to study the atomic impact of a single sample. This opens up the opportunity to investigate the factors in a sample that help or hurt training performance.</p></list-item>
</list>
<p>Training neural networks for image classification with only one labeled example per class remain a barely studied field. Future work includes applying the BOSS methodology to more complex image classification datasets, such as ImageNet and STL-10, which has not been investigated as far as we know. While we do not expect to reach the same test accuracies as with the fully supervised training, we do anticipate substantial gains can be possible. Our work lays the foundation for one-shot learning and opens the door to future research.</p>
</sec>
<sec sec-type="data-availability" id="s6">
<title>Data Availability Statement</title>
<p>The original contributions presented in the study are included in the article/<xref ref-type="sec" rid="s9">Supplementary Material</xref>, further inquiries can be directed to the corresponding author/s.</p>
</sec>
<sec id="s7">
<title>Author Contributions</title>
<p>AC tested the final software to verify the reproducibility of the results in this article and reviewed the content of this manuscript. Everything else related to this investigation and manuscript was done by LS. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s8">
<title>Publisher&#x00027;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<ack><p>The authors wish to acknowledge that this work was supported by the US Naval Research Laboratory&#x00027;s Base program.</p>
</ack><sec sec-type="supplementary-material" id="s9">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/frai.2022.880729/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/frai.2022.880729/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.pdf" id="SM1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/></sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Antoniou</surname> <given-names>A.</given-names></name> <name><surname>Storkey</surname> <given-names>A.</given-names></name></person-group> (<year>2019</year>). <article-title>Assume, augment and learn: unsupervised few-shot meta-learning via random labels and data augmentation</article-title>. <source>arXiv [Preprint]</source>. arXiv:1902.09884. <pub-id pub-id-type="doi">10.48550/arXiv.1902.09884</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Berthelot</surname> <given-names>D.</given-names></name> <name><surname>Carlini</surname> <given-names>N.</given-names></name> <name><surname>Cubuk</surname> <given-names>E. D.</given-names></name> <name><surname>Kurakin</surname> <given-names>A.</given-names></name> <name><surname>Sohn</surname> <given-names>K.</given-names></name> <name><surname>Zhang</surname> <given-names>H.</given-names></name> <etal/></person-group>. (<year>2019a</year>). <article-title>Remixmatch: semi-supervised learning with distribution alignment and augmentation anchoring</article-title>. <source>arXiv [Preprint]</source>. arXiv:1911.09785. <pub-id pub-id-type="doi">10.48550/arXiv.1911.09785</pub-id></citation>
</ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Berthelot</surname> <given-names>D.</given-names></name> <name><surname>Carlini</surname> <given-names>N.</given-names></name> <name><surname>Goodfellow</surname> <given-names>I.</given-names></name> <name><surname>Papernot</surname> <given-names>N.</given-names></name> <name><surname>Oliver</surname> <given-names>A.</given-names></name> <name><surname>Raffel</surname> <given-names>C. A.</given-names></name></person-group> (<year>2019b</year>). <article-title>&#x0201C;Mixmatch: a holistic approach to semi-supervised learning,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source>, <volume>Vancouver</volume>, <fpage>5050</fpage>&#x02013;<lpage>5060</lpage>.</citation>
</ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chapelle</surname> <given-names>O.</given-names></name> <name><surname>Scholkopf</surname> <given-names>B.</given-names></name> <name><surname>Zien</surname> <given-names>A.</given-names></name></person-group> (<year>2009</year>). <article-title>Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]</article-title>. <source>IEEE Trans. Neural Netw</source>. <volume>20</volume>, <fpage>542</fpage>&#x02013;<lpage>542</lpage>. <pub-id pub-id-type="doi">10.1109/TNN.2009.2015974</pub-id></citation>
</ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cubuk</surname> <given-names>E. D.</given-names></name> <name><surname>Zoph</surname> <given-names>B.</given-names></name> <name><surname>Shlens</surname> <given-names>J.</given-names></name> <name><surname>Le</surname> <given-names>Q. V.</given-names></name></person-group> (<year>2019</year>). <article-title>Randaugment: practical data augmentation with no separate search</article-title>. <source>arXiv preprint arXiv:1909.13719</source>. <pub-id pub-id-type="doi">10.1109/CVPRW50498.2020.00359</pub-id></citation>
</ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Elsken</surname> <given-names>T.</given-names></name> <name><surname>Metzen</surname> <given-names>J. H.</given-names></name> <name><surname>Hutter</surname> <given-names>F.</given-names></name></person-group> (<year>2018</year>). <article-title>Neural architecture search: a survey</article-title>. <source>arXiv preprint arXiv:1808.05377</source>. <pub-id pub-id-type="doi">10.1007/978-3-030-05318-5_3</pub-id></citation>
</ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Finn</surname> <given-names>C.</given-names></name> <name><surname>Abbeel</surname> <given-names>P.</given-names></name> <name><surname>Levine</surname> <given-names>S.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Model-agnostic meta-learning for fast adaptation of deep networks,&#x0201D;</article-title> in <source>Proceedings of the 34th International Conference on Machine Learning</source>, Sydney Australia, Vol. <volume>70</volume>, <fpage>1126</fpage>&#x02013;<lpage>1135</lpage>.</citation>
</ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fu</surname> <given-names>M.</given-names></name> <name><surname>Cao</surname> <given-names>Y.-H.</given-names></name> <name><surname>Wu</surname> <given-names>J.</given-names></name></person-group> (<year>2022</year>). <article-title>Worst case matters for few-shot recognition</article-title>. <source>arXiv [Preprint]</source>. arXiv:2203.06574. <pub-id pub-id-type="doi">10.48550/arXiv.2203.06574</pub-id></citation>
</ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gastaldi</surname> <given-names>X</given-names></name></person-group>. (<year>2017</year>). <article-title>Shake-shake regularization</article-title>. <source>arXiv [Preprint]</source>. arXiv:1705.07485. <pub-id pub-id-type="doi">10.48550/arXiv.1705.07485</pub-id></citation>
</ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Grandvalet</surname> <given-names>Y.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name></person-group> (<year>2005</year>). <article-title>&#x0201C;Semi-supervised learning by entropy minimization,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source>, <volume>Vancouver</volume>, <fpage>529</fpage>&#x02013;<lpage>536</lpage>.</citation>
</ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hsu</surname> <given-names>K.</given-names></name> <name><surname>Levine</surname> <given-names>S.</given-names></name> <name><surname>Finn</surname> <given-names>C.</given-names></name></person-group> (<year>2018</year>). <article-title>Unsupervised learning via meta-learning</article-title>. <source>arXiv [Preprint]</source>. arXiv:1810.02334. <pub-id pub-id-type="doi">10.48550/arXiv.1810.02334</pub-id></citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hu</surname> <given-names>Z.</given-names></name> <name><surname>Yang</surname> <given-names>Z.</given-names></name> <name><surname>Hu</surname> <given-names>X.</given-names></name> <name><surname>Nevatia</surname> <given-names>R.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Simple: similar pseudo label exploitation for semi-supervised classification,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, Nashvile, <volume>TN</volume>, <fpage>15099</fpage>&#x02013;<lpage>15108</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR46437.2021.01485</pub-id></citation>
</ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Johnson</surname> <given-names>J. M.</given-names></name> <name><surname>Khoshgoftaar</surname> <given-names>T. M.</given-names></name></person-group> (<year>2019</year>). <article-title>Survey on deep learning with class imbalance</article-title>. <source>J. Big Data</source> <volume>6</volume>:<fpage>27</fpage>. <pub-id pub-id-type="doi">10.1186/s40537-019-0192-5</pub-id></citation>
</ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Koch</surname> <given-names>G.</given-names></name> <name><surname>Zemel</surname> <given-names>R.</given-names></name> <name><surname>Salakhutdinov</surname> <given-names>R.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Siamese neural networks for one-shot image recognition,&#x0201D;</article-title> in <source>ICML Deep Learning Workshop, Vol. 2</source> (<publisher-loc>Lille</publisher-loc>).</citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Krizhevsky</surname> <given-names>A.</given-names></name> <name><surname>Hinton</surname> <given-names>G.</given-names></name></person-group> (<year>2009</year>). <source>Learning multiple layers of features from tiny images (Technical Report)</source>. University of Toronto.</citation>
</ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>D.-H</given-names></name></person-group>. (<year>2013</year>). <article-title>&#x0201C;Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks,&#x0201D;</article-title> in <source>Workshop on Challenges in Representation Learning, ICML, Atlanta, GA</source>, Vol. <volume>3</volume>, <fpage>2</fpage>.</citation>
</ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>J.</given-names></name> <name><surname>Xiong</surname> <given-names>C.</given-names></name> <name><surname>Hoi</surname> <given-names>S. C.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Comatch: semi-supervised learning with contrastive graph regularization,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source>, Montreal, <volume>CA</volume>, <fpage>9475</fpage>&#x02013;<lpage>9484</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV48922.2021.00934</pub-id></citation>
</ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lucas</surname> <given-names>T.</given-names></name> <name><surname>Weinzaepfel</surname> <given-names>P.</given-names></name> <name><surname>Rogez</surname> <given-names>G.</given-names></name></person-group> (<year>2021</year>). <article-title>Barely-supervised learning: semi-supervised learning with very few labeled images</article-title>. <source>arXiv preprint arXiv:2112.12004</source>.</citation>
</ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Netzer</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>T.</given-names></name> <name><surname>Coates</surname> <given-names>A.</given-names></name> <name><surname>Bissacco</surname> <given-names>A.</given-names></name> <name><surname>Wu</surname> <given-names>B.</given-names></name> <name><surname>Ng</surname> <given-names>A. Y.</given-names></name></person-group> (<year>2011</year>). <article-title>&#x0201C;Reading digits in natural images with unsupervised feature learning,&#x0201D;</article-title> in <source>NIPS Workshop on Deep Learning and Unsupervised Feature Learning</source>, vol. <volume>2011</volume> (Granada).</citation>
</ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rosenberg</surname> <given-names>C.</given-names></name> <name><surname>Hebert</surname> <given-names>M.</given-names></name> <name><surname>Schneiderman</surname> <given-names>H.</given-names></name></person-group> (<year>2005</year>). <article-title>Semi-supervised self-training of object detection models</article-title>. <source>WACV/Motion</source> <fpage>2</fpage>. <pub-id pub-id-type="doi">10.1109/ACVMOT.2005.107</pub-id></citation>
</ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sajjadi</surname> <given-names>M.</given-names></name> <name><surname>Javanmardi</surname> <given-names>M.</given-names></name> <name><surname>Tasdizen</surname> <given-names>T.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Regularization with stochastic transformations and perturbations for deep semi-supervised learning,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source>, <volume>Barcelona</volume>, <fpage>1163</fpage>&#x02013;<lpage>1171</lpage>.</citation>
</ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Smith</surname> <given-names>L. N.</given-names></name> <name><surname>Conovaloff</surname> <given-names>A.</given-names></name></person-group> (<year>2020</year>). <article-title>Empirical perspectives on one-shot semi-supervised learning</article-title>. <source>arXiv [Preprint]</source>. arXiv:2004.04141. <pub-id pub-id-type="doi">10.48550/arXiv.2004.04141</pub-id></citation>
</ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Snell</surname> <given-names>J.</given-names></name> <name><surname>Swersky</surname> <given-names>K.</given-names></name> <name><surname>Zemel</surname> <given-names>R.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Prototypical networks for few-shot learning,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source>, Los Angeles, <volume>CA</volume>, <fpage>4077</fpage>&#x02013;<lpage>4087</lpage>.<pub-id pub-id-type="pmid">34495842</pub-id></citation></ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sohn</surname> <given-names>K.</given-names></name> <name><surname>Berthelot</surname> <given-names>D.</given-names></name> <name><surname>Carlini</surname> <given-names>N.</given-names></name> <name><surname>Zhang</surname> <given-names>Z.</given-names></name> <name><surname>Zhang</surname> <given-names>H.</given-names></name> <name><surname>Raffel</surname> <given-names>C. A.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>Fixmatch: simplifying semi-supervised learning with consistency and confidence</article-title>. <source>Adv. Neural Inform. Process. Syst</source>. <volume>33</volume>, <fpage>596</fpage>&#x02013;<lpage>608</lpage>. <pub-id pub-id-type="doi">10.48550/arXiv.2001.07685</pub-id></citation>
</ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sun</surname> <given-names>Y.</given-names></name> <name><surname>Kamel</surname> <given-names>M. S.</given-names></name> <name><surname>Wong</surname> <given-names>A. K.</given-names></name> <name><surname>Wang</surname> <given-names>Y.</given-names></name></person-group> (<year>2007</year>). <article-title>Cost-sensitive boosting for classification of imbalanced data</article-title>. <source>Pattern Recogn</source>. <volume>40</volume>, <fpage>3358</fpage>&#x02013;<lpage>3378</lpage>. <pub-id pub-id-type="doi">10.1016/j.patcog.2007.04.009</pub-id></citation>
</ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Triguero</surname> <given-names>I.</given-names></name> <name><surname>Garc&#x000ED;a</surname> <given-names>S.</given-names></name> <name><surname>Herrera</surname> <given-names>F.</given-names></name></person-group> (<year>2015</year>). <article-title>Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study</article-title>. <source>Knowl. Inform. Syst</source>. <volume>42</volume>, <fpage>245</fpage>&#x02013;<lpage>284</lpage>. <pub-id pub-id-type="doi">10.1007/s10115-013-0706-y</pub-id></citation>
</ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Van Engelen</surname> <given-names>J. E.</given-names></name> <name><surname>Hoos</surname> <given-names>H. H.</given-names></name></person-group> (<year>2020</year>). <article-title>A survey on semi-supervised learning</article-title>. <source>Mach. Learn</source>. <volume>109</volume>, <fpage>373</fpage>&#x02013;<lpage>440</lpage>. <pub-id pub-id-type="doi">10.1007/s10994-019-05855-6</pub-id></citation>
</ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vinyals</surname> <given-names>O.</given-names></name> <name><surname>Blundell</surname> <given-names>C.</given-names></name> <name><surname>Lillicrap</surname> <given-names>T.</given-names></name> <name><surname>kavukcuoglu</surname> <given-names>K.</given-names></name> <name><surname>Wierstra</surname> <given-names>D.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Matching networks for one shot learning,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source>, <volume>Barcelona</volume>, <fpage>3630</fpage>&#x02013;<lpage>3638</lpage>.</citation>
</ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>S.</given-names></name> <name><surname>Yao</surname> <given-names>X.</given-names></name></person-group> (<year>2012</year>). <article-title>Multiclass imbalance problems: analysis and potential solutions</article-title>. <source>IEEE Trans. Syst. Man Cybern. Part B</source> <volume>42</volume>, <fpage>1119</fpage>&#x02013;<lpage>1130</lpage>. <pub-id pub-id-type="doi">10.1109/TSMCB.2012.2187280</pub-id><pub-id pub-id-type="pmid">22438514</pub-id></citation></ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xie</surname> <given-names>Q.</given-names></name> <name><surname>Hovy</surname> <given-names>E.</given-names></name> <name><surname>Luong</surname> <given-names>M.-T.</given-names></name> <name><surname>Le</surname> <given-names>Q. V.</given-names></name></person-group> (<year>2019</year>). <article-title>Self-training with noisy student improves imagenet classification</article-title>. <source>arXiv preprint arXiv:1911.04252</source>. <pub-id pub-id-type="doi">10.1109/CVPR42600.2020.01070</pub-id></citation>
</ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zagoruyko</surname> <given-names>S.</given-names></name> <name><surname>Komodakis</surname> <given-names>N.</given-names></name></person-group> (<year>2016</year>). <article-title>Wide residual networks</article-title>. <source>arXiv preprint arXiv:1605.07146</source>. <pub-id pub-id-type="doi">10.5244/C.30.87</pub-id></citation>
</ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhai</surname> <given-names>X.</given-names></name> <name><surname>Oliver</surname> <given-names>A.</given-names></name> <name><surname>Kolesnikov</surname> <given-names>A.</given-names></name> <name><surname>Beyer</surname> <given-names>L.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;S4l: self-supervised semi-supervised learning,&#x0201D;</article-title> in <source>Proceedings of the IEEE International Conference on Computer Vision</source>, <volume>Venice</volume>, <fpage>1476</fpage>&#x02013;<lpage>1485</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2019.00156</pub-id></citation>
</ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>H.</given-names></name> <name><surname>Cisse</surname> <given-names>M.</given-names></name> <name><surname>Dauphin</surname> <given-names>Y. N.</given-names></name> <name><surname>Lopez-Paz</surname> <given-names>D.</given-names></name></person-group> (<year>2017</year>). <article-title>mixup: beyond empirical risk minimization</article-title>. <source>arXiv [Preprint]</source>. arXiv:1710.09412. <pub-id pub-id-type="doi">10.48550/arXiv.1710.09412</pub-id></citation>
</ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhu</surname> <given-names>X.</given-names></name> <name><surname>Goldberg</surname> <given-names>A. B.</given-names></name></person-group> (<year>2009</year>). <article-title>Introduction to semi-supervised learning</article-title>. <source>Synthes. Lect. Artif. Intell. Mach. Learn</source>. <volume>3</volume>, <fpage>1</fpage>&#x02013;<lpage>130</lpage>. <pub-id pub-id-type="doi">10.2200/S00196ED1V01Y200906AIM006</pub-id></citation>
</ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhu</surname> <given-names>X. J</given-names></name></person-group>. (<year>2005</year>). <source>Semi-Supervised Learning Literature Survey</source>. Technical report, University of Wisconsin-Madison Department of Computer Sciences.</citation>
</ref>
</ref-list>
<fn-group>
<fn id="fn0001"><p><sup>1</sup>Available at <ext-link ext-link-type="uri" xlink:href="https://www.cs.toronto.edu/&#x0007E;kriz/cifar.html">https://www.cs.toronto.edu/&#x0007E;kriz/cifar.html</ext-link>.</p></fn>
<fn id="fn0002"><p><sup>2</sup>Available at <ext-link ext-link-type="uri" xlink:href="http://ufldl.stanford.edu/housenumbers/">http://ufldl.stanford.edu/housenumbers/</ext-link>.</p></fn>
<fn id="fn0003"><p><sup>3</sup>With appreciation, we acknowledge the use of the code kindly provided by the authors at <ext-link ext-link-type="uri" xlink:href="https://github.com/google-research/fixmatch">https://github.com/google-research/fixmatch</ext-link>.</p></fn>
</fn-group>
</back>
</article>