<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neurorobot.</journal-id>
<journal-title>Frontiers in Neurorobotics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neurorobot.</abbrev-journal-title>
<issn pub-type="epub">1662-5218</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnbot.2022.859610</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Generative Adversarial Training for Supervised and Semi-supervised Learning</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Wang</surname> <given-names>Xianmin</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1577564/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Li</surname> <given-names>Jing</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1645474/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Liu</surname> <given-names>Qi</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Zhao</surname> <given-names>Wenpeng</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Li</surname> <given-names>Zuoyong</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Wang</surname> <given-names>Wenhao</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Institute of Artificial Intelligence and Blockchain, Guangzhou University</institution>, <addr-line>Guangzhou</addr-line>, <country>China</country></aff>
<aff id="aff2"><sup>2</sup><institution>Fujian Provincial Key Laboratory of Information Processing and Intelligent Control, Minjiang University</institution>, <addr-line>Fuzhou</addr-line>, <country>China</country></aff>
<aff id="aff3"><sup>3</sup><institution>State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences</institution>, <addr-line>Beijing</addr-line>, <country>China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Song Deng, Nanjing University of Posts and Telecommunications, China</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Yi He, Old Dominion University, United States; Lina Yao, University of New South Wales, Australia</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Jing Li <email>lijing&#x00040;gzhu.edu.cn</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Original Research Article, a section of the journal Frontiers in Neurorobotics</p></fn></author-notes>
<pub-date pub-type="epub">
<day>24</day>
<month>03</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>16</volume>
<elocation-id>859610</elocation-id>
<history>
<date date-type="received">
<day>21</day>
<month>01</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>25</day>
<month>02</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2022 Wang, Li, Liu, Zhao, Li and Wang.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Wang, Li, Liu, Zhao, Li and Wang</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license> </permissions>
<abstract>
<p>Neural networks have played critical roles in many research fields. The recently proposed adversarial training (AT) can improve the generalization ability of neural networks by adding intentional perturbations in the training process, but sometimes still fail to generate worst-case perturbations, thus resulting in limited improvement. Instead of designing a specific smoothness function and seeking an approximate solution used in existing AT methods, we propose a new training methodology, named Generative AT (GAT) in this article, for supervised and semi-supervised learning. The key idea of GAT is to formulate the learning task as a minimax game, in which the perturbation generator aims to yield the worst-case perturbations that maximize the deviation of output distribution, while the target classifier is to minimize the impact of this perturbation and prediction error. To solve this minimax optimization problem, a new adversarial loss function is constructed based on the cross-entropy measure. As a result, the smoothness and confidence of the model are both greatly improved. Moreover, we develop a trajectory-preserving-based alternating update strategy to enable the stable training of GAT. Numerous experiments conducted on benchmark datasets clearly demonstrate that the proposed GAT significantly outperforms the state-of-the-art AT methods in terms of supervised and semi-supervised learning tasks, especially when the number of labeled examples is rather small in semi-supervised learning.</p></abstract>
<kwd-group>
<kwd>neural networks</kwd>
<kwd>adversarial training</kwd>
<kwd>generative AT</kwd>
<kwd>worst-case perturbations</kwd>
<kwd>smoothness function</kwd>
<kwd>trajectory-preserving-based alternating update strategy</kwd>
</kwd-group>
<contract-num rid="cn001">62002076</contract-num>
<contract-num rid="cn001">62072127</contract-num>
<contract-num rid="cn002">201904010493</contract-num>
<contract-num rid="cn002">202002030131</contract-num>
<contract-num rid="cn003">2019A1515110213</contract-num>
<contract-num rid="cn004">2020A1515010423</contract-num>
<contract-sponsor id="cn001">National Natural Science Foundation of China<named-content content-type="fundref-id">10.13039/501100001809</named-content></contract-sponsor>
<contract-sponsor id="cn002">Guangzhou Municipal Science and Technology Project<named-content content-type="fundref-id">10.13039/501100010256</named-content></contract-sponsor>
<contract-sponsor id="cn003">Natural Science Foundation of Guangdong Province for Distinguished Young Scholars<named-content content-type="fundref-id">10.13039/501100018540</named-content></contract-sponsor>
<contract-sponsor id="cn004">Natural Science Foundation of Guangdong Province<named-content content-type="fundref-id">10.13039/501100003453</named-content></contract-sponsor>
<counts>
<fig-count count="5"/>
<table-count count="3"/>
<equation-count count="17"/>
<ref-count count="36"/>
<page-count count="10"/>
<word-count count="7065"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Neural networks have launched a profound reformation in various fields, such as intelligent driving (Feng et al., <xref ref-type="bibr" rid="B9">2021</xref>), neuro-inspired computing (Zhang et al., <xref ref-type="bibr" rid="B35">2020</xref>; Deng et al., <xref ref-type="bibr" rid="B6">2021b</xref>), smart health (Khan et al., <xref ref-type="bibr" rid="B14">2021</xref>), and human computer interaction (Deng et al., <xref ref-type="bibr" rid="B5">2021a</xref>; Pustejovsky and Krishnaswamy, <xref ref-type="bibr" rid="B23">2021</xref>; Fang et al., <xref ref-type="bibr" rid="B8">2022</xref>). However, in practical classification and regression applications (Wu et al., <xref ref-type="bibr" rid="B30">2021a</xref>), since the number of training examples is finite, the error rate calculated by the training examples may be considerably deviated from the one by test examples. This fact causes the overfitting problem (Wu et al., <xref ref-type="bibr" rid="B32">2021b</xref>), which greatly impacts the generalization performance of neural networks. In order to prevent the neural networks from overfitting, one popular approach is to augment the loss function by introducing a regularization term, which encourages the model to be less dependent on the empirical risk for the finite training examples. Based on Bayesian theory, this regularization term can be interpreted as a prior distribution reflecting the preconceived notion of the model (Bishop and Nasser, <xref ref-type="bibr" rid="B1">2006</xref>; Wu et al., <xref ref-type="bibr" rid="B31">2020</xref>). Accordingly, the prior distribution of a model is usually assumed to be smooth. That is to say, the outputs of a naturally occurring system tend to be smooth with respect to the spatial or temporal inputs (Wahba, <xref ref-type="bibr" rid="B28">1990</xref>). This assumption indicates that the data points close to each other should be highly likely to infer the same predictions. Unfortunately, recent studies show that most of the neural networks suffer from misclassifying some data points that have only small differences from the correctly classified data points (Goodfellow et al., <xref ref-type="bibr" rid="B11">2014b</xref>; Strauss et al., <xref ref-type="bibr" rid="B25">2017</xref>; Yuan et al., <xref ref-type="bibr" rid="B33">2019</xref>). These misclassified data points are called the adversarial examples, which are crafted by the addition of some imperceptive perturbations to the natural examples in the input space.</p>
<p>To overcome the problem that the neural networks are vulnerable to small but malicious perturbations, adversarial training (AT) is proposed (Goodfellow et al., <xref ref-type="bibr" rid="B11">2014b</xref>; Wang et al., <xref ref-type="bibr" rid="B29">2019</xref>; Cui et al., <xref ref-type="bibr" rid="B3">2021</xref>; Zhang et al., <xref ref-type="bibr" rid="B34">2022</xref>). AT aims to smooth the model outputs by penalizing the deviations caused by the adversarial perturbations. The major challenge of AT is how to accurately estimate such perturbations that alter the output distribution around the input data points. To this end, several perturbation-based methods have been proposed by solving an internal optimization problem at the current status of the model. For instance, random AT (RAT) (Zheng et al., <xref ref-type="bibr" rid="B36">2016</xref>) improves the model smoothness by adding the randomly generated perturbations to the input data. These perturbed data points are encouraged to produce the same prediction given by its corresponding unperturbed versions. Since the perturbations around the input appear in random directions, RAT is referred to as an isotropic smoothing approach. However, it is shown that the isotropic smoothing makes the model particularly sensitive to adversarial examples (Szegedy et al., <xref ref-type="bibr" rid="B26">2013</xref>; Goodfellow et al., <xref ref-type="bibr" rid="B11">2014b</xref>). Based on this consideration, Goodfellow et al. (<xref ref-type="bibr" rid="B11">2014b</xref>) proposed a standard AT (SAT). SAT is an anisotropic method that smoothes the output distribution by making the model robust against perturbations in a specific direction. This specific direction in the input space is called the adversarial direction, in which the output of the model is the most sensitive. To identify the perturbations in the adversarial direction, SAT first formulates an objective function based on the differences between the prediction and correct labels and then solves this function with an efficient Frank-Wolfe optimizer. SAT requires the use of labels when calculating the adversarial perturbations. Hence, SAT cannot be applied to the regime of semi-supervised learning. Virtual AT (VAT) (Miyato et al., <xref ref-type="bibr" rid="B21">2018</xref>) extends the notion of SAT in the sense that it defines the adversarial direction without label information, and thus can be applied to both supervised and semi-supervised learning tasks. We observe that in order to generate the adversarial perturbations, the existing AT methods explicitly define a smoothness function to regularize the neural networks. This leads to two limitations. First, it is extremely difficult to find a universal smoothness function due to the various output patterns and distance metrics. Second, there is no analytical solution to such a box-constrained function. Consequently, a numerical method is generally used to seek an approximate solution, which greatly affects the performance of identifying the worst-case adversarial perturbations.</p>
<p>Different from previous methodologies, we propose a novel AT methodology, named generative AT (GAT) in this article, to improve the smoothness of output distribution of neural networks for the supervised and semi-supervised learning tasks. The objective of the proposed GAT is to train the target classifier such that it not only achieve the minimum prediction error but also has the best robustness against the adversarial perturbations. To this aim, we formalize the regularizing process as a minimax game. To be specific, we exploit the cross entropy method to construct a new <italic>adversarial loss</italic> function. Moreover, we develop an effective alternating update strategy to optimize the challenging non-convex problems. The experimental results tested on benchmark datasets show that the proposed GAT obtains the empirical equilibrium point and state-of-the-art performance.</p>
<p>The main contributions of this article are summarized as follows:</p>
<list list-type="bullet">
<list-item><p>We formulate the regularizing for the learning task as a minimax game according to the outputs of the target classifier from the natural example and its adversarial version derived by a perturbation generator. As the game approaches the empirical equilibrium, the target classifier achieves the best performance.</p></list-item>
<list-item><p>A new <italic>adversarial loss</italic> function is constructed based on the <italic>cross entropy</italic> method, which not only accurately reflects the deviation caused by the perturbation but also efficiently assesses the confidence of network output.</p></list-item>
<list-item><p>An effective alternating update strategy based on trajectory preserving is proposed to control the minimax optimization training to be stable.</p></list-item>
<list-item><p>The proposed GAT regularizes the model without label information, hence it can be applied to the supervised and semi-supervised learning tasks.</p></list-item>
</list>
<p>It is worth emphasizing that our method differs from any one of the generative-model-based AT methods (Kingma et al., <xref ref-type="bibr" rid="B15">2014</xref>; Maal&#x000F8;e et al., <xref ref-type="bibr" rid="B19">2016</xref>; Salimans et al., <xref ref-type="bibr" rid="B24">2016</xref>; Dai et al., <xref ref-type="bibr" rid="B4">2017</xref>). This family of methods is considered to be an improvement of Generative Adversarial Network (GAN), in the sense that the target classifier in their frameworks is the extension of the GAN&#x00027;s discriminator serving for distinguishing the natural and generated examples. For our method, the discriminator is not the target classifier; instead, it is manually designed according to the outputs of the target classifier over the natural example and its adversarial version.</p>
</sec>
<sec id="s2">
<title>2. Problem Setting and Related Works</title>
<p>Without loss of generality, we consider the classification tasks in a semi-supervised setting. Let <inline-formula><mml:math id="M1"><mml:mi>x</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">X</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> be the input vector with <italic>I</italic>-dimension and <inline-formula><mml:math id="M2"><mml:mi>y</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Y</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> be the one-hot vector of labels with <italic>K</italic> categories. <inline-formula><mml:math id="M3"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">|</mml:mo><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M4"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">|</mml:mo><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> denote the labeled and unlabeled dataset, where <italic>N</italic><sup><italic>l</italic></sup> and <italic>N</italic><sup><italic>ul</italic></sup> are the number of labeled and unlabeled examples. AT regularizes the neural network such that both the natural and perturbed examples output the intended predictions. That is, we aim to learn a mapping &#x1D53D;:<italic>X</italic> &#x02192; [0, 1]<sup><italic>K</italic></sup> parameterized with &#x003B8; &#x02208; &#x00398; <italic>via</italic> solving the following optimization problem</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M5"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mo class="qopname">min</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003BB;</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mo>}</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The symbol <inline-formula><mml:math id="M6"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> in Equation 1 represents the <italic>supervised loss</italic> over the labeled dataset, which can be expanded as</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M7"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>E</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow><mml:mo>&#x0007E;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mo>&#x00393;</mml:mo><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>F</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M8"><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>F</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> denotes the output distribution vector of the neural network on the input <italic>x</italic><sup><italic>l</italic></sup> given the model parameter &#x003B8;, <italic>y</italic><sup><italic>l</italic></sup> is the one-hot vector of the true label for <italic>x</italic><sup><italic>l</italic></sup>. The operator &#x00393;(&#x000B7;, &#x000B7;) denotes the distance measure used to evaluate the similarity of two distributions. A common choice of &#x00393; for the supervised cost <inline-formula><mml:math id="M9"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the measure of <italic>cross entropy</italic>. <inline-formula><mml:math id="M10"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the <italic>adversarial loss</italic>, which is served as a regularization term for promoting the smoothness of the model. The <italic>adversarial loss</italic> plays an important role in enhancing the generalization performance while the number of labeled examples is small relative to the number of the whole training examples (i.e., <italic>N</italic><sup><italic>l</italic></sup> &#x0003C; &#x0003C; <italic>N</italic><sup><italic>ul</italic></sup>&#x0002B;<italic>N</italic><sup><italic>l</italic></sup>). &#x003BB; is a non-negative value that controls the relative balance between the <italic>supervised loss</italic> and the <italic>adversarial loss</italic>.</p>
<p>Many approaches are presented to construct <inline-formula><mml:math id="M11"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> based on the smoothness assumption, which can be generally represented in a framework as</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M12"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="double-struck"><mml:mtext>E</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>x</mml:mi><mml:mo>&#x0007E;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mo>&#x00393;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>F</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>;</mml:mo><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>F</mml:mtext></mml:mstyle></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>x</italic> is sampled from the dataset <inline-formula><mml:math id="M13"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> which consists of both labeled and unlabel examples. <inline-formula><mml:math id="M14"><mml:mo>&#x00393;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>F</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>;</mml:mo><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>F</mml:mtext></mml:mstyle></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> is termed as the smoothness function, which is comprised of a teacher model <bold>F</bold><sub>&#x003B8;</sub>(<italic>x</italic>; &#x003BE;) and a student model <inline-formula><mml:math id="M15"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>F</mml:mtext></mml:mstyle></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. The teacher model is parameterized with parameter &#x003B8; and perturbation &#x003BE;, while the student model is parameterized with parameter &#x003B8;&#x02032; and perturbation &#x003BE;&#x02032;. The goal of <inline-formula><mml:math id="M16"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is to improve the model&#x00027;s smoothness by forcing the student model to follow the teacher model. That is to say, the output distributions yielded by <inline-formula><mml:math id="M17"><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>F</mml:mtext></mml:mstyle></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:math></inline-formula> is supported to be consistent with the outputs derived by <bold>F</bold>. To this end, the teacher model, student model, and similarity measure are required to be carefully crafted for formulating an appropriate smoothness function against the perturbation of the input and the variance of the parameters. Based on the implementations of this smoothness function, some typical AT approaches can be explicitly defined.</p>
<p><bold>Random Adversarial Training:</bold> In RAT, random noises are introduced in the student model instead of the teacher model, and the parameters of the student model are shared with the teacher model. Moreover, <italic>L</italic><sub>2</sub> distance is used to measure the similarity of the output distributions derived by <inline-formula><mml:math id="M18"><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>F</mml:mtext></mml:mstyle></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:math></inline-formula> and <bold>F</bold> on the whole training examples. That is, &#x003B8;&#x02032; &#x0003D; &#x003B8;, <inline-formula><mml:math id="M19"><mml:msup><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mo>&#x0007E;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">N</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, &#x003BE; &#x0003D; 0, and <inline-formula><mml:math id="M20"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>&#x022C3;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> for Equation 3.</p>
<p><bold>Adversarial Training With</bold> <bold>&#x003A0;-Model:</bold> In contrast to RAT, &#x003A0;-model introduces random noises to both the teacher model and student model, i.e., <inline-formula><mml:math id="M21"><mml:msup><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi>&#x003BE;</mml:mi><mml:mo>&#x0007E;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">N</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. The reason for this is based on the assumption that predictions yielded by natural example may itself be an outlier, hence it is reasonable to make two noisy predictions learn from each other. In this case, optimizing the smoothness function for &#x003A0;-model is equivalent to minimizing the prediction variance of the classifier (Luo et al., <xref ref-type="bibr" rid="B18">2018</xref>).</p>
<p><bold>Standard Adversarial Training:</bold> Instead of adding random noises to the teacher/student model, the perturbation adopted in SAT is some imperceptible noise that is carefully designed to fool the neural network. The <italic>adversarial loss</italic> <inline-formula><mml:math id="M22"><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> of SAT can be written as</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M23"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="double-struck"><mml:mtext>E</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow><mml:mo>&#x0007E;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mtext class="textrm" mathvariant="normal">KL</mml:mtext><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>||</mml:mo><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>F</mml:mtext></mml:mstyle></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>d</mml:mi><mml:mi>v</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>s</mml:mi><mml:mo>.</mml:mo><mml:mi>t</mml:mi><mml:mo>.</mml:mo><mml:mtext>&#x000A0;&#x000A0;</mml:mtext><mml:msub><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>d</mml:mi><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mo class="qopname">arg</mml:mo><mml:mo class="qopname">max</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x003BE;</mml:mi><mml:mo>;</mml:mo><mml:mo>||</mml:mo><mml:mi>&#x003BE;</mml:mi><mml:mo>||</mml:mo><mml:mo>&#x02264;</mml:mo><mml:mi>&#x003B5;</mml:mi></mml:mrow></mml:munder><mml:mtext class="textrm" mathvariant="normal">KL</mml:mtext><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>||</mml:mo><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>F</mml:mtext></mml:mstyle></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>;</mml:mo><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where the operator KL(&#x000B7;||&#x000B7;) denotes the similarity measure of <italic>Kullback-Leibler (K-L) divergence</italic>. &#x003BE;<sub><italic>adv</italic></sub> denotes adversarial perturbation which is added into <italic>x</italic><sup><italic>l</italic></sup> to make the output distribution of the student model most greatly deviate <italic>y</italic><sup><italic>l</italic></sup>. &#x003B5; is a prior constant that controls the perturbation strength. Note that the teacher model, in this case, is degenerated into the one-hot vector of the true label. Generally, we cannot obtain the exact adversarial direction of &#x003BE;<sub><italic>adv</italic></sub> in a closed form. Hence, a linear approximation of this objective function is applied to approximate the adversarial perturbation. For &#x02113;<sub>&#x0221E;</sub> norm, the adversarial perturbation &#x003BE;<sub><italic>adv</italic></sub> can be efficiently approximated by using the famous fast gradient sign method (FGSM) (Madry et al., <xref ref-type="bibr" rid="B20">2017</xref>). That is,</p>
<disp-formula id="E5"><label>(5)</label><mml:math id="M24"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>d</mml:mi><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02248;</mml:mo><mml:mi>&#x003B5;</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:mtext class="textrm" mathvariant="normal">sign</mml:mtext><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mtext class="textrm" mathvariant="normal">KL</mml:mtext><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>||</mml:mo><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>F</mml:mtext></mml:mstyle></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>;</mml:mo><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Some alternative invariants such as the iterative gradient sign method (IGSM) (Tram&#x000E8;r et al., <xref ref-type="bibr" rid="B27">2017</xref>) and the momentum IGSM (M-IGSM) (Dong et al., <xref ref-type="bibr" rid="B7">2018</xref>) are available to solve the objective function. By adding adversarial perturbations to the student model, SAT obtains better generalization performance than RAT and &#x003A0;-model. Unfortunately, SAT can only be applied in supervised learning tasks since it has to use the labeled examples to compute the <italic>adversarial loss</italic>.</p>
<p><bold>Virtual Adversarial Training:</bold> Different from SAT, the key idea of VAT is to define the <italic>adversarial loss</italic> based on the output distribution inferred on the unlabeled examples. In this regard, the <italic>adversarial loss</italic> <inline-formula><mml:math id="M25"><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> of VAT can be written as</p>
<disp-formula id="E6"><label>(6)</label><mml:math id="M26"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="double-struck"><mml:mtext>E</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>x</mml:mi><mml:mo>&#x0007E;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msup><mml:mo>&#x0222A;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mtext class="textrm" mathvariant="normal">KL</mml:mtext><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>F</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow><mml:mo>||</mml:mo><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>F</mml:mtext></mml:mstyle></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>d</mml:mi><mml:mi>v</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>s</mml:mi><mml:mo>.</mml:mo><mml:mi>t</mml:mi><mml:mo>.</mml:mo><mml:mtext>&#x000A0;&#x000A0;</mml:mtext><mml:msub><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>d</mml:mi><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mo class="qopname">arg</mml:mo><mml:mo class="qopname">max</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x003BE;</mml:mi><mml:mo>;</mml:mo><mml:mo>||</mml:mo><mml:mi>&#x003BE;</mml:mi><mml:mo>||</mml:mo><mml:mo>&#x02264;</mml:mo><mml:mi>&#x003B5;</mml:mi></mml:mrow></mml:munder><mml:mtext class="textrm" mathvariant="normal">KL</mml:mtext><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>F</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow><mml:mo>||</mml:mo><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>F</mml:mtext></mml:mstyle></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>;</mml:mo><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>To obtain the adversarial perturbation &#x003B5;<sub><italic>adv</italic></sub>, Miyato et al. (<xref ref-type="bibr" rid="B21">2018</xref>) proposed to approximate the objective function with a second-order Taylor&#x00027;s expansion at &#x003B5; &#x0003D; 0. That is,</p>
<disp-formula id="E7"><label>(7)</label><mml:math id="M27"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>d</mml:mi><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02248;</mml:mo><mml:munder><mml:mrow><mml:mo class="qopname">arg</mml:mo><mml:mo class="qopname">max</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x003BE;</mml:mi><mml:mo>;</mml:mo><mml:mo>||</mml:mo><mml:mi>&#x003BE;</mml:mi><mml:mo>||</mml:mo><mml:mo>&#x02264;</mml:mo><mml:mi>&#x003B5;</mml:mi></mml:mrow></mml:munder><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac><mml:msup><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mi>H</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>&#x003BE;</mml:mi><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>H</italic> is a Hessian matrix which is defined by <inline-formula><mml:math id="M28"><mml:mi>H</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x02207;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow></mml:msub><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">KL</mml:mtext></mml:mstyle><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>F</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>|</mml:mo><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>F</mml:mtext></mml:mstyle></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>;</mml:mo><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. This binomial optimization is an eigenvalue problem that can be solved using power iteration algorithm. Since VAT acquires the adversarial perturbation in the absence of label information, this method is applicable to both supervised and semi-supervised learning.</p>
</sec>
<sec id="s3">
<title>3. The Proposed Method</title>
<p>Adversarial training methods regularize the neural network <italic>via</italic> forcing the output distribution to be robust against adversarial examples. To obtain intentional perturbations, the existing AT methods require to explicitly define a smoothness function to compute the perturbations. Due to the non-convex characteristic of the smoothness function, the existing AT methods usually fail to generate worst-case perturbation by approximation analysis. To tackle this problem, we propose a novel AT framework termed GAT for improving the smoothness of the neural network, where the worst-case perturbation of the input is generated by a generator. In the following sections, we construct our framework by answering two central questions: (1) how to formulate the loss function with the perturbation generator and target classifier and (2) how to effectively optimize this loss function during the training process.</p>
<sec>
<title>3.1. GAT Loss Based on Minimax Game</title>
<p>In our framework, two neural networks are considered, i.e., the target classifier <bold>T</bold><sub>&#x003B8;</sub>(<italic>x</italic>) parameterized with &#x003B8; and the perturbation generator <bold>G</bold><sub>&#x003C6;</sub>(<italic>x</italic>) parameterized with &#x003C6;. In our framework, the target classifier is the optimization objective that will be required eventually. The perturbation generator is constructed by an auto-encoder-like neural network. Specifically, the perturbation generator can be defined as a mapping <bold>G</bold><sub>&#x003C6;</sub>:<inline-formula><mml:math id="M29"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">X</mml:mi></mml:mrow><mml:mo>&#x02192;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">X</mml:mi></mml:mrow></mml:math></inline-formula>, which takes a natural example in <inline-formula><mml:math id="M30"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">X</mml:mi></mml:mrow></mml:math></inline-formula> and then transforms it into an imperceptible perturbation in the same space <inline-formula><mml:math id="M31"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">X</mml:mi></mml:mrow></mml:math></inline-formula>. For &#x02113;<sub>&#x0221E;</sub> norm, such constraints can be represented as</p>
<disp-formula id="E8"><label>(8)</label><mml:math id="M32"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mo>&#x02200;</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mo>||</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>G</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x003C6;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mo>||</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x0221E;</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02264;</mml:mo><mml:mi>&#x003B5;</mml:mi><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003B5; is the perturbation bounds that controls the adversarial strength. To implement the constraints indicated by Equation 8, the activation function of the last layer in <bold>G</bold><sub>&#x003C6;</sub> is particularly defined as &#x003B5;&#x000B7;<italic>tanh</italic>(&#x000B7;). Then, the generated perturbation is added into the corresponding natural example to composite an adversarial example.</p>
<p>The goal of <bold>G</bold><sub>&#x003C6;</sub> is to find a perturbation that most deviates the current inferred output of the target classifier from the status quo, while <bold>T</bold><sub>&#x003B8;</sub>(<italic>x</italic>) is to minimize the prediction error for the natural example as well as the deviation caused by such perturbation. This problem can be formulated as a minimax game and the loss function of which can be formulated as</p>
<disp-formula id="E9"><label>(9)</label><mml:math id="M33"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:munder><mml:mrow><mml:mi>min</mml:mi></mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:munder><mml:munder><mml:mrow><mml:mi>max</mml:mi></mml:mrow><mml:mi>&#x003C6;</mml:mi></mml:munder><mml:mtext>&#x000A0;&#x000A0;</mml:mtext><mml:msub><mml:mi mathvariant='double-struck'>E</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi>x</mml:mi><mml:mi>l</mml:mi></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mi>l</mml:mi></mml:msup></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>~</mml:mo><mml:msup><mml:mi mathvariant='script'>D</mml:mi><mml:mi>l</mml:mi></mml:msup></mml:mrow></mml:msub><mml:msub><mml:mo>&#x00393;</mml:mo><mml:mi>S</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi>y</mml:mi><mml:mi>l</mml:mi></mml:msup><mml:mo>,</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mtext>T</mml:mtext></mml:mstyle><mml:mi>&#x003B8;</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi>x</mml:mi><mml:mi>l</mml:mi></mml:msup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;</mml:mtext><mml:mo>+</mml:mo><mml:mi>&#x003BB;</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mi mathvariant='double-struck'>E</mml:mi><mml:mrow><mml:mi>x</mml:mi><mml:mo>~</mml:mo><mml:msup><mml:mi mathvariant='script'>D</mml:mi><mml:mi>l</mml:mi></mml:msup><mml:mo>&#x0222A;</mml:mo><mml:msup><mml:mi mathvariant='script'>D</mml:mi><mml:mrow><mml:mi>u</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:msub><mml:mo>&#x00393;</mml:mo><mml:mi>R</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mtext>T</mml:mtext></mml:mstyle><mml:mi>&#x003B8;</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mtext>T</mml:mtext></mml:mstyle><mml:mi>&#x003B8;</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mtext>G</mml:mtext></mml:mstyle><mml:mi>&#x003C6;</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mi>x</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Equation 9 is referred to as the GAT loss, which is comprised of a <italic>supervised loss</italic> <inline-formula><mml:math id="M34"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and an <italic>adversarial loss</italic> <inline-formula><mml:math id="M35"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> . <inline-formula><mml:math id="M36"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is determined by labeled examples, while <inline-formula><mml:math id="M37"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is independent of the labels and served as a regularization term smoothing the model. The parameter &#x003BB; controls the balance of <inline-formula><mml:math id="M38"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula><mml:math id="M39"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. For the maximization and minimization loop of the minimax game, &#x003C6; and &#x003B8; are the parameters required to be optimized. Since <inline-formula><mml:math id="M40"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is defined over the whole data set, our method is applicable to semi-supervised learning. Note that for the <italic>adversarial loss</italic>, the target classifier <bold>T</bold><sub>&#x003B8;</sub>(<italic>x</italic>) is considered as the teacher model, while the compound function of <bold>T</bold><sub>&#x003B8;</sub>(<bold>G</bold><sub>&#x003C6;</sub>(<italic>x</italic>)&#x0002B;<italic>x</italic>) is served as the student model.</p>
<p>In addition, the operator &#x00393;<sub><italic>S</italic></sub>(&#x000B7;, &#x000B7;) and &#x00393;<sub><italic>R</italic></sub>(&#x000B7;, &#x000B7;) are the similarity measures for <inline-formula><mml:math id="M41"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula><mml:math id="M42"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, respectively. Here, &#x00393;<sub><italic>R</italic></sub> is crucial for the construction of <italic>adversarial loss</italic>. Instead of using <italic>K-L divergence</italic> to define the <italic>adversarial loss</italic> as VAT/SAT does, we exploit <italic>cross entropy</italic> measures to formulate the <italic>adversarial loss</italic> function. There are two beneficial effects for this implementation. First, <italic>cross entropy</italic> overcomes the problem of zero avoiding, an inward nature for the <italic>K-L divergence</italic>(Bishop and Nasser, <xref ref-type="bibr" rid="B1">2006</xref>). Second, since <italic>cross entropy</italic> can be represented as the sum of <italic>K-L divergence</italic> and <italic>information entropy</italic>, <inline-formula><mml:math id="M43"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> not only implies the deviation of the output distributions, but also signifies the confidence of the prediction of the target classifier. In particular, by substituting &#x00393;<sub><italic>R</italic></sub> with <italic>cross entropy</italic> in Equation 9, <inline-formula><mml:math id="M44"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> in GAT loss can be rewritten as</p>
<disp-formula id="E10"><label>(10)</label><mml:math id="M45"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mtext class="textrm" mathvariant="normal">CE</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>T</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>T</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>G</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x003C6;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mtext class="textrm" mathvariant="normal">KL</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>T</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>||</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>T</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>G</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x003C6;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mtext class="textrm" mathvariant="normal">H</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>T</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where the operator CE(&#x000B7;, &#x000B7;) and H(&#x000B7;) denote <italic>cross entropy</italic> and <italic>information entropy</italic>. In Equation 10, KL(<bold>T</bold><sub>&#x003B8;</sub>(<italic>x</italic>)||<bold>T</bold><sub>&#x003B8;</sub>(<bold>G</bold><sub>&#x003C6;</sub>(<italic>x</italic>)&#x0002B;<italic>x</italic>)) is termed as smoothness term, which reflects the deviation of the output distributions, while H(<bold>T</bold><sub>&#x003B8;</sub>(<italic>x</italic>)) is termed as confidence term, which indicates the confidence of the output distribution. Moreover, we observed that the confidence term is independent with parameter &#x003C6;. Hence, for the maximization loop of the minimax game, maximizing <inline-formula><mml:math id="M46"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> requires to maximize the smoothness term only. Whereas, for the minimization loop, minimizing <inline-formula><mml:math id="M47"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> requires to minimize both the smoothness term and confidence term. Note that minimizing the confidence term facilitates boosting of the prediction confidence of the neural network. Thus, our <italic>adversarial loss</italic> has the effect of entropy minimization proposed in Grandvalet and Bengio (<xref ref-type="bibr" rid="B12">2004</xref>) and Sajjadi et al. (<xref ref-type="bibr" rid="B2">2016</xref>).</p>
</sec>
<sec>
<title>3.2. Alternating Update Process Based on Trajectory Preserving</title>
<p><xref ref-type="fig" rid="F1">Figure 1</xref> depicts the framework of GAT, in which two neural networks are required to be optimized, i.e., the target classifier <inline-formula><mml:math id="M48"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">T</mml:mi></mml:mrow></mml:math></inline-formula> and the perturbation generator <inline-formula><mml:math id="M49"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">G</mml:mi></mml:mrow></mml:math></inline-formula>. <inline-formula><mml:math id="M50"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">G</mml:mi></mml:mrow></mml:math></inline-formula> takes natural example <italic>x</italic> from the full dataset comprising of both the labeled and unlabeled examples and generates a perturbation <bold>G</bold><sub>&#x003C6;</sub>(<italic>x</italic>). Then, <bold>G</bold><sub>&#x003C6;</sub>(<italic>x</italic>) is appended into <italic>x</italic> to composite an adversarial example. Both the adversarial example and its corresponding natural example are fed into <inline-formula><mml:math id="M51"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">T</mml:mi></mml:mrow></mml:math></inline-formula> for constructing the <italic>adversarial loss</italic> <inline-formula><mml:math id="M52"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. Meanwhile, labeled example <italic>x</italic><sup><italic>l</italic></sup> sampled from the labeled dataset is input to <inline-formula><mml:math id="M53"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">T</mml:mi></mml:mrow></mml:math></inline-formula> for formulating the <italic>supervised loss</italic> <inline-formula><mml:math id="M54"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>The overall framework of Generative AT (GAT).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-859610-g0001.tif"/>
</fig>
<p>The objective of our framework is to find stable &#x003B8; and &#x003C6; such that <inline-formula><mml:math id="M55"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">G</mml:mi></mml:mrow></mml:math></inline-formula> maximizes the GAT loss for the given fixed &#x003B8;, while <inline-formula><mml:math id="M56"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">T</mml:mi></mml:mrow></mml:math></inline-formula> minimizes the GAT loss for the given fixed &#x003C6;. Due to the non-linear constraint of the perturbation and non-convex properties of the loss function, this optimization problem is very challenging. Inspired by the training pattern of GAN (Goodfellow et al., <xref ref-type="bibr" rid="B10">2014a</xref>) and some common tricks in reinforcement learning (Mnih et al., <xref ref-type="bibr" rid="B22">2015</xref>), we propose to optimize the GAT loss by an alternative updating procedure and stabilize this procedure based on trajectory preserving.</p>
<p>First, we decompose the minimax optimization problem into the inner loop and outer loop. The inner loop aims to derive an optimal &#x003C6; for maximizing the loss, while the outer loop aims to obtain an optimal &#x003B8; for minimizing the loss. Due to the fact that the parameter &#x003C6; in the inner loop is independent of the <italic>supervised loss</italic> during the maximizing procedure, then the optimal &#x003C6; of <inline-formula><mml:math id="M63"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">G</mml:mi></mml:mrow></mml:math></inline-formula> under the fixed &#x003B8; can be written as Equation 11. Meanwhile, the optimal &#x003B8; of <inline-formula><mml:math id="M64"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">T</mml:mi></mml:mrow></mml:math></inline-formula> under the given fixed &#x003C6; can be represented as Equation 12.</p>
<disp-formula id="E11"><label>(11)</label><mml:math id="M65"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:mi>&#x003C6;</mml:mi><mml:mo>=</mml:mo><mml:mtext>arg&#x000A0;</mml:mtext><mml:munder><mml:mrow><mml:mi>max&#x000A0;</mml:mi></mml:mrow><mml:mi>&#x003C6;</mml:mi></mml:munder><mml:msub><mml:mi mathvariant='double-struck'>E</mml:mi><mml:mrow><mml:mi>x</mml:mi><mml:mo>~</mml:mo><mml:msup><mml:mi mathvariant='script'>D</mml:mi><mml:mi>l</mml:mi></mml:msup><mml:mo>&#x0222A;</mml:mo><mml:msup><mml:mi mathvariant='script'>D</mml:mi><mml:mrow><mml:mi>u</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mtext>CE</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mtext>T</mml:mtext></mml:mstyle><mml:mi>&#x003B8;</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mtext>T</mml:mtext></mml:mstyle><mml:mi>&#x003B8;</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>+</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mtext>G</mml:mtext></mml:mstyle><mml:mi>&#x003C6;</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E12"><label>(12)</label><mml:math id="M66"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>&#x003B8;</mml:mi><mml:mo>=</mml:mo><mml:mtext>arg</mml:mtext><mml:munder><mml:mrow><mml:mi>min</mml:mi></mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:munder><mml:mtext>&#x000A0;&#x000A0;</mml:mtext><mml:msub><mml:mi mathvariant='double-struck'>E</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi>x</mml:mi><mml:mi>l</mml:mi></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi>y</mml:mi><mml:mi>l</mml:mi></mml:msup></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>~</mml:mo><mml:msup><mml:mi mathvariant='script'>D</mml:mi><mml:mi>l</mml:mi></mml:msup></mml:mrow></mml:msub><mml:mtext>CE</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi>y</mml:mi><mml:mi>l</mml:mi></mml:msup><mml:mo>,</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mtext>T</mml:mtext></mml:mstyle><mml:mi>&#x003B8;</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msup><mml:mi>x</mml:mi><mml:mi>l</mml:mi></mml:msup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mi>&#x003BB;</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mi mathvariant='double-struck'>E</mml:mi><mml:mrow><mml:mi>x</mml:mi><mml:mo>~</mml:mo><mml:msup><mml:mi mathvariant='script'>D</mml:mi><mml:mi>l</mml:mi></mml:msup><mml:mo>&#x0222A;</mml:mo><mml:msup><mml:mi mathvariant='script'>D</mml:mi><mml:mrow><mml:mi>u</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mtext>CE</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mtext>T</mml:mtext></mml:mstyle><mml:mi>&#x003B8;</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mtext>T</mml:mtext></mml:mstyle><mml:mi>&#x003B8;</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>+</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mtext>G</mml:mtext></mml:mstyle><mml:mi>&#x003C6;</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Second, since the perturbation generator and the target classifier are assumed to be neural networks, the parameters &#x003B8; and &#x003C6; in Equations 11 and 12 can be calculated by stochastic-gradient-based methods (Liu et al., <xref ref-type="bibr" rid="B17">2021</xref>; Jin et al., <xref ref-type="bibr" rid="B13">2022</xref>). A traditional solution to this minimax problem is to alternatively update &#x003C6; by gradient ascent over the full dataset and update &#x003B8; by gradient descent over the labeled dataset. However, since the number of labeled training examples is small, both &#x003C6; and &#x003B8; are not easy to converge in practice. We develop a trajectory preserving strategy to tackle this problem. In our method, for each epoch of alternating, we update &#x003C6; using gradient ascent and record the update trajectories of &#x003C6;. Then, based on these trajectories, we retrieve the intermediate parameter &#x003C6;&#x02032; by executing a pseudo-update procedure for &#x003C6;. Finally, we update &#x003B8; by gradient descent under the given &#x003C6;&#x02032;.</p>
<table-wrap position="float" id="T4"> 
<label>Algorithm 1</label>
<caption><p>Trajectory preserving training process.</p></caption>
<graphic xlink:href="fnbot-16-859610-i0001.tif"/>
</table-wrap>
<p>The implementation details of the proposed trajectory preserving training procedure are illustrated in <xref ref-type="table" rid="T4">Algorithm 1</xref>, where <italic>E</italic> is the number of training epochs, <italic>T</italic> is the maximum iterations in each epochs. Equations 13 and 14 represent the updating and pseudo-updating for &#x003C6; by gradient ascent. Equation 15 describes the updating process for &#x003B8; by gradient descent. &#x003B1;<sub><italic>g</italic></sub> and &#x003B1;<sub><italic>t</italic></sub> are the learning rate for the perturbation generator and target classifier, respectively.</p>
<disp-formula id="E13"><label>(13)</label><mml:math id="M67"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:msup><mml:mi>&#x003C6;</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi>&#x003C6;</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x003B1;</mml:mi><mml:mi>g</mml:mi></mml:msub><mml:msub><mml:mo>&#x02207;</mml:mo><mml:mrow><mml:msup><mml:mi>&#x003C6;</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mfrac><mml:mn>1</mml:mn><mml:mi>M</mml:mi></mml:mfrac><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>M</mml:mi></mml:munderover><mml:mrow><mml:mtext>CE</mml:mtext></mml:mrow></mml:mstyle><mml:mrow><mml:mo>(</mml:mo> <mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mtext>T</mml:mtext></mml:mstyle><mml:mi>&#x003B8;</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi>x</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:mrow><mml:mrow> <mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mtext>T</mml:mtext></mml:mstyle><mml:mi>&#x003B8;</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi>x</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mtext>G</mml:mtext></mml:mstyle><mml:mrow><mml:msup><mml:mi>&#x003C6;</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi>x</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow> <mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E14"><label>(14)</label><mml:math id="M68"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:msup><mml:mi>&#x003C6;</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mi>&#x003C6;</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x003B1;</mml:mi><mml:mi>g</mml:mi></mml:msub><mml:msub><mml:mo>&#x02207;</mml:mo><mml:mrow><mml:msup><mml:mi>&#x003C6;</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mfrac><mml:mn>1</mml:mn><mml:mi>M</mml:mi></mml:mfrac><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>M</mml:mi></mml:munderover><mml:mrow><mml:mtext>CE</mml:mtext></mml:mrow></mml:mstyle><mml:mrow><mml:mo>(</mml:mo> <mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mtext>T</mml:mtext></mml:mstyle><mml:mi>&#x003B8;</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi>x</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:mrow><mml:mrow> <mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mtext>T</mml:mtext></mml:mstyle><mml:mi>&#x003B8;</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi>x</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mtext>G</mml:mtext></mml:mstyle><mml:mrow><mml:msup><mml:mi>&#x003C6;</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi>x</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow> <mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E15"><label>(15)</label><mml:math id="M69"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>&#x003B8;</mml:mi><mml:mo>=</mml:mo><mml:mi>&#x003B8;</mml:mi><mml:mo>&#x02212;</mml:mo><mml:msub><mml:mi>&#x003B1;</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:msub><mml:mo>&#x02207;</mml:mo><mml:mi>&#x003B8;</mml:mi></mml:msub><mml:mrow><mml:mo>{</mml:mo> <mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mi>N</mml:mi></mml:mfrac><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:mrow><mml:mtext>CE</mml:mtext></mml:mrow></mml:mstyle><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi>y</mml:mi><mml:mi>j</mml:mi><mml:mi>l</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mtext>T</mml:mtext></mml:mstyle><mml:mi>&#x003B8;</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi>x</mml:mi><mml:mi>j</mml:mi><mml:mi>l</mml:mi></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo></mml:mrow> </mml:mrow><mml:mfrac><mml:mi>&#x003BB;</mml:mi><mml:mi>M</mml:mi></mml:mfrac><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>M</mml:mi></mml:munderover><mml:mrow><mml:mtext>CE</mml:mtext></mml:mrow></mml:mstyle><mml:mrow><mml:mo>(</mml:mo> <mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mtext>T</mml:mtext></mml:mstyle><mml:mi>&#x003B8;</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi>x</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mrow><mml:mrow><mml:mrow> <mml:mrow><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mtext>T</mml:mtext></mml:mstyle><mml:mi>&#x003B8;</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi>x</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mtext>G</mml:mtext></mml:mstyle><mml:msup><mml:mi>&#x003C6;</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi>x</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>t</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow> <mml:mo>)</mml:mo></mml:mrow></mml:mrow> <mml:mo>}</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
</sec>
</sec>
<sec id="s4">
<title>4. Experiments</title>
<p>To validate the performance of our method on supervised and semi-supervised task, we carried out experiments on synthetic datasets and practical benchmarks by comparing with various strong competitors.</p>
<sec>
<title>4.1. Supervised Learning on a Synthetic Dataset</title>
<p>This section tests the supervised learning performance of our method for binary classification problems using two well-known synthetic datasets, i.e., the &#x0201C;Moons&#x0201D; dataset (termed as <italic>M</italic>-dataset) and the &#x0201C;Circles&#x0201D; dataset (termed as <italic>C</italic>-dataset). The data points in the two datasets are sampled uniformly from two trajectories over the space of <italic>R</italic><sup>2</sup> and embedded linearly into 100-dimension vector space. Each dataset contains 16 training data points and 1,000 testing points. <bold>Figures 4</bold>, <bold>5</bold> provide the visualizations for <italic>M</italic>-dataset and <italic>N</italic>-dataset, where the red circles and blue triangles separately stand for the training examples with labels 1 and 0. The target classifier used in this experiment is a neural network with one hidden layer comprised of 100 hidden units, where ReLU and softmax activation function are applied to the hidden units and output units. We compare our method with some popular AT methods, such as SAT (Goodfellow et al., <xref ref-type="bibr" rid="B11">2014b</xref>), RAT (Zheng et al., <xref ref-type="bibr" rid="B36">2016</xref>), and VAT (Miyato et al., <xref ref-type="bibr" rid="B21">2018</xref>). These AT methods and the proposed GAT are conducted under the setting of &#x003BB; &#x0003D; 1 and &#x003F5; &#x0003D; 0.2. Particularly, the perturbation generator in our method has three hidden layers with the unit number 128, 64, and 128, respectively.</p>
<p>Since the number of the training examples is extremely small compared to the input dimension, the target classifier for binary classification is very vulnerable to the problem of overfitting. <xref ref-type="fig" rid="F2">Figures 2A,B</xref> depict the transitions of the accuracy rates for the target classifier with the GAT regularization and without this regularization (termed as Plain NN). It can be observed that the training accuracy of Plain NN and GAT achieved 100% for the two datasets. Nevertheless, the test accuracy rate of GAT is noticeably higher than that of Plain NN. Although our method suffers from some fluctuations with the accuracy rate at the initial stage of the training process, the test accuracy rate of our method finally achieves a stable value after a few iterations, thanks to the trajectory preserving training strategy. <xref ref-type="fig" rid="F3">Figure 3</xref> visualizes the output distributions of the trained target classifier on the <italic>M</italic>-dataset and <italic>C</italic>-dataset with our method and Plain NN. We can observe that compared to plain NN, GAT provides more flat regions for the landscape of the output distribution. This phenomenon indicates that our method is conducive to the smoothness of the model in the sense that flat surfaces of the landscape imply small deviations of the output.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>The transition curves of accuracy rates by Plain NN and the proposed GAT on <italic>M</italic>-dataset and <italic>C</italic>-dataset. <bold>(A)</bold> Plots the results for <italic>M</italic>-dataset, <bold>(B)</bold> plots the results for <italic>C</italic>-dataset.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-859610-g0002.tif"/>
</fig>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>The visualization of model distributions of GAT and Plain NN on the synthetic datasets. <bold>(A,B)</bold> Show the distribution surface on <italic>M</italic>-dataset, <bold>(C,D)</bold> show the distribution surface on <italic>C</italic>-dataset, where flat surface regions implicate small output deviations.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-859610-g0003.tif"/>
</fig>
<p>Moreover, we plot the contours of the target classifier&#x00027;s predictions for label 1 on the two synthetic datasets by various regularization methods. As shown in <xref ref-type="fig" rid="F4">Figures 4</xref>, <xref ref-type="fig" rid="F5">5</xref>, the black line in each plot stands for the contour of value 0.5, which is usually used as the decision boundary for the binary classification tasks. From these figures, we can see that the <italic>L</italic><sub>2</sub> regularization method fails to acquire correct decision boundary on both the <italic>M</italic>-dataset and <italic>C</italic>-dataset, hence, many false predictions are produced by this method. RAT obtains convincing decision boundary for <italic>M</italic>-dataset, but it generates an unreasonable decision boundary for <italic>C</italic>-dataset. Among these methods, only SAT, VAT, and our method yield applicable decision boundary for both the <italic>M</italic>-dataset and <italic>C</italic>-dataset, because these methods employ an anisotropic way to smooth the classifier. Compared to RAT and VAT, the decision boundaries of our method for different contour values are more compact. This phenomenon illustrates that our method can provide more confidence predictions for the new instances, thanks to the <italic>cross entropy</italic> measure for the adversarial loss. Our method also achieves the highest test accuracy rate against its competitors on both the <italic>M</italic>-dataset and <italic>C</italic>-dataset.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>The contour of output confidences for label 1 on <italic>M</italic>-dataset with various regularization methods. The red circles and blue triangles represent the data points with labels 1 and 0, respectively. The decision boundaries with different confidences are plotted with different colored contours. Note that the black line represents the contour of probability value 0.5, which is usually served as the decision boundary for the binary classification task. The accuracy rate of each method for the test examples is displayed above the panel.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-859610-g0004.tif"/>
</fig>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>The contour of output confidence for label 1 on <italic>C</italic>-dataset with various regularization methods. The detailed illustrations for this figure can be referred to the caption of <xref ref-type="fig" rid="F4">Figure 4</xref>.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-859610-g0005.tif"/>
</fig>
</sec>
<sec>
<title>4.2. Supervised Learning on the Benchmark Dataset</title>
<p>In this section, we evaluate the performance of our methods on the MNIST dataset for a supervised learning scenario. The origin 60,000 training examples are split into 50,000 training examples and 10,000 test examples. The target classifier is made up of four hidden dense layers, whose unit numbers are 1200, 600, 300, and 150, respectively. The input dimension of the target classifier is 784 and the output dimension is 10. For each method, we use the setting of hyper-parameters that exhibits the best performance on the test dataset to train the neural network and record their test errors. The perturbation generator in our method is comprised of hidden layers whose unit numbers are 1200, 600, 300, and 600, respectively. The control parameters of the methods by our implementations are set &#x003BB; &#x0003D; 1 and &#x003F5; &#x0003D; 0.2. We compare our method with some typical AT methods on the MNIST dataset for supervised learning task. To verify the capability of the trajectory preserving strategy, we also conducted an ablation experiment for GAT-woTP, a method using the proposed GAT framework but Without Trajectory Preserving strategy during the training. The test error rates of these methods are reported in <xref ref-type="table" rid="T1">Table 1</xref>. The experimental results demonstrate that our method surpasses the previous state-of-the-art AT methods by a large margin. Moreover, our method also outperforms advanced generation-based algorithms such as Ladder network and CatGAN. Besides, note that the error rate obtained by our method is much lower than that acquired by GAT-woTP. This is because the trajectory preserving strategy is benefit to ensure the stability of the training process. Without this strategy, GAT is usually difficult to achieve a favorable convergent point during the training.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Test error rates of various regularization methods for supervised learning task on MNIST dataset.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="center"><bold>Test error rate (%)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">SVM (gaussian kernel)</td>
<td valign="top" align="center">1.40</td>
</tr>
<tr>
<td valign="top" align="left">Dropout</td>
<td valign="top" align="center">1.05</td>
</tr>
<tr>
<td valign="top" align="left">Maxout networks</td>
<td valign="top" align="center">0.94</td>
</tr>
<tr>
<td valign="top" align="left">DBM</td>
<td valign="top" align="center">0.79</td>
</tr>
<tr>
<td valign="top" align="left">Ladder network<sup>&#x02020;</sup></td>
<td valign="top" align="center">0.57</td>
</tr>
<tr>
<td valign="top" align="left">Conv-CatGAN<sup>&#x02020;</sup></td>
<td valign="top" align="center">0.48</td>
</tr>
<tr>
<td valign="top" align="left">Plain NN (Baseline)</td>
<td valign="top" align="center">1.15</td>
</tr>
<tr>
<td valign="top" align="left">RAT</td>
<td valign="top" align="center">0.85</td>
</tr>
<tr>
<td valign="top" align="left">SAT (<italic>L</italic><sub>&#x0221E;</sub>)</td>
<td valign="top" align="center">0.78</td>
</tr>
<tr>
<td valign="top" align="left">VAT</td>
<td valign="top" align="center">0.66</td>
</tr>
<tr>
<td valign="top" align="left">GAT-woTP</td>
<td valign="top" align="center">0.65</td>
</tr>
<tr>
<td valign="top" align="left">GAT (Our method)</td>
<td valign="top" align="center">0.45</td>
</tr>
</tbody>
</table><table-wrap-foot> 
<p><italic>The upper panel refers to the experimental results reported in prior work, the error rates in the bottom panel are derived by our implementations. <sup>&#x02020;</sup>Represents the generation-based methods</italic>.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>4.3. Semi-supervised Learning on Benchmark Dataset</title>
<p>This section validates the effectiveness of our method for semi-supervised learning tasks on three popular benchmarks of MNIST, SVHN, and CIFAR-10. According to the experimental setups in Miyato et al. (<xref ref-type="bibr" rid="B21">2018</xref>), we take a test dataset with fixed size 1,000 from the training examples and train the classifier under four sizes of the labeled dataset, i.e., <italic>N</italic><sub><italic>l</italic></sub> &#x0003D; {100, 600, 1000, 3000}, where <italic>N</italic><sub><italic>l</italic></sub> is size of the dataset. The rest instances of the training examples are served as unlabeled examples. Then, we record the test errors under different values of <italic>N</italic><sub><italic>l</italic></sub>. For our method, we use a mini-batch of size 64 to calculate the <italic>supervised loss</italic> in Equation 11 and a mini-batch of size 256 to calculate the <italic>adversarial loss</italic> in Equation 12. The control parameters of the methods by our implementations are set at &#x003BB; &#x0003D; 1 and &#x003F5; &#x0003D; 0.2. To test the performance of the trajectory preserving strategy for semi-supervised learning, we make several ablation experiments for GAT-woTP which is described in Section 4.2. For the reason that SAT can only be applied to supervised learning task, the results of SAT have not been reported in these experiments.</p>
<p>For the MNIST dataset, the structures of the target classifier and perturbation generator are identical to the structures employed in Section 4.2. <xref ref-type="table" rid="T2">Table 2</xref> lists the test error rates of the comparing semi-supervised learning methods for different values of <italic>N</italic><sub><italic>l</italic></sub> on MNIST. The experimental results show that our method achieves the lowest error rates among all the methods for different numbers of labeled examples. Moreover, our method significantly outperforms the state-of-the-art AT methods when the number of labeled examples is small. For the experiments on SVHN and CIFAR-10, two type of convolution neural networks (CNNs), named &#x0201C;Small&#x0201D; (Salimans et al., <xref ref-type="bibr" rid="B24">2016</xref>) and &#x0201C;Large&#x0201D; (Laine and Aila, <xref ref-type="bibr" rid="B16">2018</xref>), are employed as the target classifiers. More details about the settings and structures of the two CNNs can be referred to (Miyato et al., <xref ref-type="bibr" rid="B21">2018</xref>). The structure of the perturbation generator in this experiment is the same as the one applied in the experiment for the MNIST dataset. The performance of various comparing methods for SVHN and CIFAR-10 is reported in <xref ref-type="table" rid="T3">Table 3</xref>. From the table, we can find that GAT obtains the best generalization capability for the SVHN dataset and achieves comparable performance to the state-of-the-art generation-based method such as TNAR-VAE for the CIFAR-10 dataset. In addition, GAT reaches lower error rates compared to GAT-woTP for all the three benchmarks, which verifies the favorable performance of the trajectory preserving strategy for stabilizing the training for our proposal.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Test error rates of semi-supervised learning methods on MNIST datasets.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="center" colspan="4" style="border-bottom: thin solid #000000;"><bold>Test error rate (%)</bold></th>
</tr>
<tr style="border-bottom: thin solid #000000;">
<th/>
<th valign="top" align="center"><bold><italic>N</italic><sub><italic>l</italic></sub> &#x0003D; 100</bold></th>
<th valign="top" align="center"><bold><italic>N</italic><sub><italic>l</italic></sub> &#x0003D; 600</bold></th>
<th valign="top" align="center"><bold><italic>N</italic><sub><italic>l</italic></sub> &#x0003D; 1, 000</bold></th>
<th valign="top" align="center"><bold><italic>N</italic><sub><italic>l</italic></sub> &#x0003D; 3, 000</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">SVM</td>
<td valign="top" align="center">23.44</td>
<td valign="top" align="center">8.85</td>
<td valign="top" align="center">7.77</td>
<td valign="top" align="center">4.21</td>
</tr>
<tr>
<td valign="top" align="left">EmbedNN</td>
<td valign="top" align="center">16.9</td>
<td valign="top" align="center">5.97</td>
<td valign="top" align="center">5.73</td>
<td valign="top" align="center">3.59</td>
</tr>
<tr>
<td valign="top" align="left">PEA</td>
<td valign="top" align="center">10.79</td>
<td valign="top" align="center">2.44</td>
<td valign="top" align="center">2.23</td>
<td valign="top" align="center">1.91</td>
</tr>
<tr>
<td valign="top" align="left">Conv-CatGAN<sup>&#x02020;</sup></td>
<td valign="top" align="center">1.93(&#x000B1;0.01)</td>
<td valign="top" align="center">1.86(&#x000B1;0.11)</td>
<td valign="top" align="center">1.73(&#x000B1;0.18)</td>
<td valign="top" align="center">1.67(&#x000B1;0.12)</td>
</tr>
<tr>
<td valign="top" align="left">Ladder networks<sup>&#x02020;</sup></td>
<td valign="top" align="center">1.06(&#x000B1;0.37)</td>
<td valign="top" align="center">0.93(&#x000B1;0.07)</td>
<td valign="top" align="center">0.84(&#x000B1;0.08)</td>
<td valign="top" align="center">0.79(&#x000B1;0.09)</td>
</tr>
<tr>
<td valign="top" align="left">Auxiliary DGM<sup>&#x02020;</sup></td>
<td valign="top" align="center">0.96(&#x000B1;0.02)</td>
<td valign="top" align="center">0.90(&#x000B1;0.05)</td>
<td valign="top" align="center">0.86(&#x000B1;0.13)</td>
<td valign="top" align="center">0.78(&#x000B1;0.05)</td>
</tr>
<tr>
<td valign="top" align="left">RAT</td>
<td valign="top" align="center">6.62(&#x000B1;1.02)</td>
<td valign="top" align="center">3.75(&#x000B1;0.14)</td>
<td valign="top" align="center">1.61(&#x000B1;0.09)</td>
<td valign="top" align="center">1.51(&#x000B1;0.08)</td>
</tr>
<tr>
<td valign="top" align="left">VAT</td>
<td valign="top" align="center">2.38(&#x000B1;0.11)</td>
<td valign="top" align="center">1.38(&#x000B1;0.08)</td>
<td valign="top" align="center">1.35(&#x000B1;0.12)</td>
<td valign="top" align="center">1.28(&#x000B1;0.07)</td>
</tr>
<tr>
<td valign="top" align="left">GAT-woTP</td>
<td valign="top" align="center">1.97(&#x000B1;0.87)</td>
<td valign="top" align="center">1.66(&#x000B1;0.85)</td>
<td valign="top" align="center">1.58(&#x000B1;0.96)</td>
<td valign="top" align="center">1.32(&#x000B1;0.65)</td>
</tr>
<tr>
<td valign="top" align="left">GAT (Our method)</td>
<td valign="top" align="center">0.90(&#x000B1;0.11)</td>
<td valign="top" align="center">0.85(&#x000B1;0.09)</td>
<td valign="top" align="center">0.83(&#x000B1;0.17)</td>
<td valign="top" align="center">0.75(&#x000B1;0.08)</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>N<sub>l</sub> denotes the number of labeled examples for the training dataset.</italic></p> 
<p><italic>The results in the upper panel are referred to the reports in prior work, the error rates in the bottom panel are derived by our implementations. <sup>&#x02020;</sup>Represents the generation-based methods</italic>.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Test error rates (%) of semi-supervised learning methods on SVHN and CIFAR-10 datasets.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>SVHN</bold></th>
<th valign="top" align="center" style="border-bottom: thin solid #000000;"><bold>CIFAR-10</bold></th>
</tr>
<tr>
<th/>
<th valign="top" align="center"><bold><italic>N</italic><sub><italic>l</italic></sub> &#x0003D; 1, 000</bold></th>
<th valign="top" align="center"><bold><italic>N</italic><sub><italic>l</italic></sub> &#x0003D; 4, 000</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">&#x003A0;-model</td>
<td valign="top" align="center">5.43(&#x000B1;0.25)</td>
<td valign="top" align="center">16.55(&#x000B1;0.29)</td>
</tr>
<tr>
<td valign="top" align="left">Mean teacher</td>
<td valign="top" align="center">5.21(&#x000B1;0.21)</td>
<td valign="top" align="center">17.74(&#x000B1;0.30)</td>
</tr>
<tr>
<td valign="top" align="left">ALI</td>
<td valign="top" align="center">7.41(&#x000B1;0.65)</td>
<td valign="top" align="center">17.99(&#x000B1;1.62)</td>
</tr>
<tr>
<td valign="top" align="left">Ban GAN<sup>&#x02020;</sup></td>
<td valign="top" align="center">4.25(&#x000B1;0.03)</td>
<td valign="top" align="center">14.41(&#x000B1;0.30)</td>
</tr>
<tr>
<td valign="top" align="left">Tripple GAN<sup>&#x02020;</sup></td>
<td valign="top" align="center">5.77(&#x000B1;0.17)</td>
<td valign="top" align="center">16.99(&#x000B1;0.36)</td>
</tr>
<tr>
<td valign="top" align="left">Improved GAN<sup>&#x02020;</sup></td>
<td valign="top" align="center">4.39(&#x000B1;1.20)</td>
<td valign="top" align="center">16.20(&#x000B1;1.60)</td>
</tr>
<tr>
<td valign="top" align="left">TNAR-LGAN (Small)<sup>&#x02020;</sup></td>
<td valign="top" align="center">4.25(&#x000B1;0.09)</td>
<td valign="top" align="center">12.97(&#x000B1;0.31)</td>
</tr>
<tr>
<td valign="top" align="left">TNAR-LGAN (Large)<sup>&#x02020;</sup></td>
<td valign="top" align="center">4.03(&#x000B1;0.13)</td>
<td valign="top" align="center">12.76(&#x000B1;0.04)</td>
</tr>
<tr>
<td valign="top" align="left">RAT (Small)</td>
<td valign="top" align="center">8.42(&#x000B1;0.22)</td>
<td valign="top" align="center">18.58(&#x000B1;0.26)</td>
</tr>
<tr>
<td valign="top" align="left">RAT (Large)</td>
<td valign="top" align="center">8.36(&#x000B1;0.22)</td>
<td valign="top" align="center">18.23(&#x000B1;0.16)</td>
</tr>
<tr>
<td valign="top" align="left">VAT (Small)</td>
<td valign="top" align="center">6.83(&#x000B1;0.24)</td>
<td valign="top" align="center">14.87(&#x000B1;0.13)</td>
</tr>
<tr>
<td valign="top" align="left">VAT (Large)</td>
<td valign="top" align="center">5.77(&#x000B1;0.32)</td>
<td valign="top" align="center">14.18(&#x000B1;0.38)</td>
</tr>
<tr>
<td valign="top" align="left">GAT-woTP (Small)</td>
<td valign="top" align="center">6.53(&#x000B1;0.95)</td>
<td valign="top" align="center">14.36(&#x000B1;1.03)</td>
</tr>
<tr>
<td valign="top" align="left">GAT-woTP (Large)</td>
<td valign="top" align="center">5.26(&#x000B1;0.92)</td>
<td valign="top" align="center">14.02(&#x000B1;0.88)</td>
</tr>
<tr>
<td valign="top" align="left">GAT (Our method, Small)</td>
<td valign="top" align="center">4.27(&#x000B1;0.14)</td>
<td valign="top" align="center">12.96(&#x000B1;0.15)</td>
</tr>
<tr>
<td valign="top" align="left">GAT (Our method, Large)</td>
<td valign="top" align="center">4.01(&#x000B1;0.11)</td>
<td valign="top" align="center">12.81(&#x000B1;0.13)</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>N<sub>l</sub> represents the number of labeled examples in the training dataset. The results in the upper panel are referred to the reports in prior work, the results in the bottom panel are derived from our implementations. <sup>&#x02020;</sup>Stands for the generation-based methods</italic>.</p>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
<sec sec-type="conclusions" id="s5">
<title>5. Conclusion</title>
<p>In this article, a novel GAT framework has been proposed to improve the generalization performance of neural networks for both the supervised and semi-supervised learning tasks. In the proposed framework, the target classifier is regularized by letting the perturbation generator watch and move against the target classifier in a minimax game. We exploit the <italic>cross entropy</italic> to evaluate the output deviation for the regularization term such that the prediction of the target classifier can be reinforced. Furthermore, an effective alternating update method is developed to stably train the target classifier and perturbation generator. Numerous experiments are conducted on synthetic and real datasets and their results demonstrate the effectiveness of our proposal.</p>
</sec>
<sec sec-type="data-availability" id="s6">
<title>Data Availability Statement</title>
<p>The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.</p>
</sec>
<sec id="s7">
<title>Author Contributions</title>
<p>XW contributed to the conception of the study, performed the data analyses, and wrote the manuscript. JL contributed significantly to analysis and manuscript preparation. QL, WZ, ZL, and WW performed the experiments.</p>
</sec>
<sec sec-type="funding-information" id="s8">
<title>Funding</title>
<p>This work was supported by the National Natural Science Foundation of China (Nos. 62072127 and 62002076), Project 6142111180404 supported by CNKLSTISS, Science and Technology Program of Guangzhou, China (Nos. 202002030131 and 201904010493), Guangdong Basic and Applied Basic Research Fund Joint Fund Youth Fund (No. 2019A1515110213), Open Fund Project of Fujian Provincial Key Laboratory of Information Processing and Intelligent Control (Minjiang University) (No. MJUKF-IPIC202101), Natural Science Foundation of Guangdong Province (No. 2020A1515010423).</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x00027;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec> </body>
<back>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bishop</surname> <given-names>C. M.</given-names></name> <name><surname>Nasser</surname> <given-names>M. N.</given-names></name></person-group> (<year>2006</year>). <source>Pattern Recognition and Machine Learning, Vol. 4.</source> <publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer</publisher-name>.</citation>
</ref>
<ref id="B3">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Cui</surname> <given-names>J.</given-names></name> <name><surname>Liu</surname> <given-names>S.</given-names></name> <name><surname>Wang</surname> <given-names>L.</given-names></name> <name><surname>Jia</surname> <given-names>J.</given-names></name></person-group> (<year>2021</year>). <article-title>Learnable boundary guided adversarial training,</article-title> in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source> (<publisher-loc>Montreal, QC</publisher-loc>), <fpage>15721</fpage>&#x02013;<lpage>15730</lpage>.</citation>
</ref>
<ref id="B4">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Dai</surname> <given-names>Z.</given-names></name> <name><surname>Yang</surname> <given-names>Z.</given-names></name> <name><surname>Yang</surname> <given-names>F.</given-names></name> <name><surname>Cohen</surname> <given-names>W. W.</given-names></name> <name><surname>Salakhutdinov</surname> <given-names>R. R.</given-names></name></person-group> (<year>2017</year>). <article-title>Good semi-supervised learning that requires a bad gan,</article-title> in <source>Advances in Neural Information Processing Systems</source>, eds I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Long Beach, CA: Curran Associates), <fpage>6510</fpage>&#x02013;<lpage>6520</lpage>.</citation>
</ref>
<ref id="B5">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Deng</surname> <given-names>S.</given-names></name> <name><surname>Cai</surname> <given-names>Q.</given-names></name> <name><surname>Zhang</surname> <given-names>Z.</given-names></name> <name><surname>Wu</surname> <given-names>X.</given-names></name></person-group> (<year>2021a</year>). <article-title>User behavior analysis based on stacked autoencoder and clustering in complex power grid environment</article-title>. <source>IEEE Trans. Intell. Transp. Syst.</source> 1&#x02013;15. <pub-id pub-id-type="doi">10.1109/TITS.2021.3076607</pub-id></citation>
</ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Deng</surname> <given-names>S.</given-names></name> <name><surname>Chen</surname> <given-names>F.</given-names></name> <name><surname>Dong</surname> <given-names>X.</given-names></name> <name><surname>Gao</surname> <given-names>G.</given-names></name> <name><surname>Wu</surname> <given-names>X.</given-names></name></person-group> (<year>2021b</year>). <article-title>Short-term load forecasting by using improved gep and abnormal load recognition</article-title>. <source>ACM Trans. Internet Technol. (TOIT)</source> <volume>21</volume>, <fpage>1</fpage>&#x02013;<lpage>28</lpage>. <pub-id pub-id-type="doi">10.1145/3447513</pub-id></citation>
</ref>
<ref id="B7">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Dong</surname> <given-names>Y.</given-names></name> <name><surname>Liao</surname> <given-names>F.</given-names></name> <name><surname>Pang</surname> <given-names>T.</given-names></name> <name><surname>Su</surname> <given-names>H.</given-names></name> <name><surname>Zhu</surname> <given-names>J.</given-names></name> <name><surname>Hu</surname> <given-names>X.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>Boosting adversarial attacks with momentum,</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Salt Lake City, UT</publisher-loc>), <fpage>9185</fpage>&#x02013;<lpage>9193</lpage>.</citation>
</ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Fang</surname> <given-names>Y.</given-names></name> <name><surname>Chen</surname> <given-names>P.</given-names></name> <name><surname>Han</surname> <given-names>T.</given-names></name></person-group> (<year>2022</year>). <article-title>Hint: harnessing the wisdom of crowds for handling multi-phase tasks</article-title>. <source>Neural Comput. Appl.</source> 1&#x02013;23. <pub-id pub-id-type="doi">10.1007/s00521-021-06825-7</pub-id></citation>
</ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Feng</surname> <given-names>S.</given-names></name> <name><surname>Yan</surname> <given-names>X.</given-names></name> <name><surname>Sun</surname> <given-names>H.</given-names></name> <name><surname>Feng</surname> <given-names>Y.</given-names></name> <name><surname>Liu</surname> <given-names>H. X.</given-names></name></person-group> (<year>2021</year>). <article-title>Intelligent driving intelligence test for autonomous vehicles with naturalistic and adversarial environment</article-title>. <source>Nat. Commun.</source> <volume>12</volume>, <fpage>1</fpage>&#x02013;<lpage>14</lpage>. <pub-id pub-id-type="doi">10.1038/s41467-021-21007-8</pub-id><pub-id pub-id-type="pmid">33531506</pub-id></citation></ref>
<ref id="B10">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Goodfellow</surname> <given-names>I. J.</given-names></name> <name><surname>Pouget-Abadie</surname> <given-names>J.</given-names></name> <name><surname>Mirza</surname> <given-names>M.</given-names></name> <name><surname>Bing</surname> <given-names>X.</given-names></name> <name><surname>Warde-Farley</surname> <given-names>D.</given-names></name> <name><surname>Ozair</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2014a</year>). <article-title>Generative adversarial nets,</article-title> in <source>International Conference on Neural Information Processing Systems</source> (<publisher-loc>Montreal, QC</publisher-loc>).</citation>
</ref>
<ref id="B11">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Goodfellow</surname> <given-names>I. J.</given-names></name> <name><surname>Shlens</surname> <given-names>J.</given-names></name> <name><surname>Szegedy</surname> <given-names>C.</given-names></name></person-group> (<year>2014b</year>). <article-title>Explaining and harnessing adversarial examples</article-title>. <source>arXiv preprint</source> arXiv:1412.6572.</citation>
</ref>
<ref id="B12">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Grandvalet</surname> <given-names>Y.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name></person-group> (<year>2004</year>). <article-title>Semi-supervised learning by entropy minimization,</article-title> in <source>International Conference on Neural Information Processing Systems</source> (<publisher-loc>Vancouver, BC</publisher-loc>).</citation>
</ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jin</surname> <given-names>L.</given-names></name> <name><surname>Wei</surname> <given-names>L.</given-names></name> <name><surname>Li</surname> <given-names>S.</given-names></name></person-group> (<year>2022</year>). <article-title>Gradient-based differential neural-solution to time-dependent nonlinear optimization</article-title>. <source>IEEE Trans. Autom. Control</source> 1. <pub-id pub-id-type="doi">10.1109/TAC.2022.3144135</pub-id></citation>
</ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Khan</surname> <given-names>M. M.</given-names></name> <name><surname>Mehnaz</surname> <given-names>S.</given-names></name> <name><surname>Shaha</surname> <given-names>A.</given-names></name> <name><surname>Nayem</surname> <given-names>M.</given-names></name> <name><surname>Bourouis</surname> <given-names>S.</given-names></name></person-group> (<year>2021</year>). <article-title>Iot-based smart health monitoring system for covid-19 patients</article-title>. <source>Comput. Math. Methods Med.</source> <volume>2021</volume>, <fpage>1</fpage>&#x02013;<lpage>11</lpage>. <pub-id pub-id-type="doi">10.1155/2021/8591036</pub-id><pub-id pub-id-type="pmid">34824600</pub-id></citation></ref>
<ref id="B15">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kingma</surname> <given-names>D. P.</given-names></name> <name><surname>Mohamed</surname> <given-names>S.</given-names></name> <name><surname>Rezende</surname> <given-names>D. J.</given-names></name> <name><surname>Welling</surname> <given-names>M.</given-names></name></person-group> (<year>2014</year>). <article-title>Semi-supervised learning with deep generative models,</article-title> in <source>Advances in Neural Information Processing Systems, Vol. 2</source> (<publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>MIT Press</publisher-name>), <fpage>3581</fpage>&#x02013;<lpage>3589</lpage>.<pub-id pub-id-type="pmid">29989965</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Laine</surname> <given-names>S. M.</given-names></name> <name><surname>Aila</surname> <given-names>T. O.</given-names></name></person-group> (<year>2018</year>). <source>Temporal Ensembling for Semi-Supervised Learning.</source> U.S. Patent App. 15/721,433.</citation>
</ref>
<ref id="B17">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>M.</given-names></name> <name><surname>Chen</surname> <given-names>L.</given-names></name> <name><surname>Du</surname> <given-names>X.</given-names></name> <name><surname>Jin</surname> <given-names>L.</given-names></name> <name><surname>Shang</surname> <given-names>M.</given-names></name></person-group> (<year>2021</year>). <article-title>Activated gradients for deep neural networks</article-title>. <source>IEEE Trans. Neural Netw. Learn. Syst.</source> 1&#x02013;13. <pub-id pub-id-type="doi">10.1109/TNNLS.2021.3106044</pub-id><pub-id pub-id-type="pmid">34469312</pub-id></citation></ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Luo</surname> <given-names>Y.</given-names></name> <name><surname>Zhu</surname> <given-names>J.</given-names></name> <name><surname>Li</surname> <given-names>M.</given-names></name> <name><surname>Ren</surname> <given-names>Y.</given-names></name> <name><surname>Zhang</surname> <given-names>B.</given-names></name></person-group> (<year>2018</year>). <article-title>Smooth neighbors on teacher graphs for semi-supervised learning,</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Salt Lake City, UT</publisher-loc>), <fpage>8896</fpage>&#x02013;<lpage>8905</lpage>.</citation>
</ref>
<ref id="B19">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Maal&#x000F8;e</surname> <given-names>L.</given-names></name> <name><surname>S&#x000F8;nderby</surname> <given-names>C. K.</given-names></name> <name><surname>S&#x000F8;nderby</surname> <given-names>S. K.</given-names></name> <name><surname>Winther</surname> <given-names>O.</given-names></name></person-group> (<year>2016</year>). <article-title>Auxiliary deep generative models</article-title>. <source>arXiv preprint</source> arXiv:1602.05473.</citation>
</ref>
<ref id="B20">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Madry</surname> <given-names>A.</given-names></name> <name><surname>Makelov</surname> <given-names>A.</given-names></name> <name><surname>Schmidt</surname> <given-names>L.</given-names></name> <name><surname>Tsipras</surname> <given-names>D.</given-names></name> <name><surname>Vladu</surname> <given-names>A.</given-names></name></person-group> (<year>2017</year>). <article-title>Towards deep learning models resistant to adversarial attacks</article-title>. <source>arXiv preprint</source> arXiv:1706.06083.</citation>
</ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Miyato</surname> <given-names>T.</given-names></name> <name><surname>Maeda</surname> <given-names>S.-i.</given-names></name> <name><surname>Koyama</surname> <given-names>M.</given-names></name> <name><surname>Ishii</surname> <given-names>S.</given-names></name></person-group> (<year>2018</year>). <article-title>Virtual adversarial training: a regularization method for supervised and semi-supervised learning</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell.</source> <volume>41</volume>, <fpage>1979</fpage>&#x02013;<lpage>1993</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2018.2858821</pub-id><pub-id pub-id-type="pmid">30040630</pub-id></citation></ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mnih</surname> <given-names>V.</given-names></name> <name><surname>Kavukcuoglu</surname> <given-names>K.</given-names></name> <name><surname>Silver</surname> <given-names>D.</given-names></name> <name><surname>Rusu</surname> <given-names>A. A.</given-names></name> <name><surname>Veness</surname> <given-names>J.</given-names></name> <name><surname>Bellemare</surname> <given-names>M. G.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>Human-level control through deep reinforcement learning</article-title>. <source>Nature</source> <volume>518</volume>, <fpage>529</fpage>. <pub-id pub-id-type="doi">10.1038/nature14236</pub-id><pub-id pub-id-type="pmid">25719670</pub-id></citation></ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pustejovsky</surname> <given-names>J.</given-names></name> <name><surname>Krishnaswamy</surname> <given-names>N.</given-names></name></person-group> (<year>2021</year>). <article-title>Embodied human computer interaction</article-title>. <source>KI-K&#x000FC;nstliche Intelligenz</source> <volume>35</volume>, <fpage>307</fpage>&#x02013;<lpage>327</lpage>.</citation>
</ref>
<ref id="B2">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sajjadi</surname> <given-names>M.</given-names></name> <name><surname>Javanmardi</surname> <given-names>M.</given-names></name> <name><surname>Tasdizen</surname> <given-names>T.</given-names></name></person-group> (<year>2016</year>). Regularization with stochastic transformations and perturbations for deep semi-supervised learning.</citation>
</ref>
<ref id="B24">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Salimans</surname> <given-names>T.</given-names></name> <name><surname>Goodfellow</surname> <given-names>I.</given-names></name> <name><surname>Zaremba</surname> <given-names>W.</given-names></name> <name><surname>Cheung</surname> <given-names>V.</given-names></name> <name><surname>Radford</surname> <given-names>A.</given-names></name> <name><surname>Chen</surname> <given-names>X.</given-names></name></person-group> (<year>2016</year>). <article-title>Improved techniques for training gans,</article-title> in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Red Hook, NY</publisher-loc>: <publisher-name>Curran Associates</publisher-name>), <fpage>2234</fpage>&#x02013;<lpage>2242</lpage>.</citation>
</ref>
<ref id="B25">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Strauss</surname> <given-names>T.</given-names></name> <name><surname>Hanselmann</surname> <given-names>M.</given-names></name> <name><surname>Junginger</surname> <given-names>A.</given-names></name> <name><surname>Ulmer</surname> <given-names>H.</given-names></name></person-group> (<year>2017</year>). <article-title>Ensemble methods as a defense to adversarial perturbations against deep neural networks</article-title>. <source>arXiv preprint</source> arXiv:1709.03423.</citation>
</ref>
<ref id="B26">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Szegedy</surname> <given-names>C.</given-names></name> <name><surname>Zaremba</surname> <given-names>W.</given-names></name> <name><surname>Sutskever</surname> <given-names>I.</given-names></name> <name><surname>Bruna</surname> <given-names>J.</given-names></name> <name><surname>Erhan</surname> <given-names>D.</given-names></name> <name><surname>Goodfellow</surname> <given-names>I.</given-names></name> <etal/></person-group>. (<year>2013</year>). <article-title>Intriguing properties of neural networks</article-title>. <source>arXiv preprint</source> arXiv:1312.6199.</citation>
</ref>
<ref id="B27">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Tram&#x000E8;r</surname> <given-names>F.</given-names></name> <name><surname>Kurakin</surname> <given-names>A.</given-names></name> <name><surname>Papernot</surname> <given-names>N.</given-names></name> <name><surname>Goodfellow</surname> <given-names>I.</given-names></name> <name><surname>Boneh</surname> <given-names>D.</given-names></name> <name><surname>McDaniel</surname> <given-names>P.</given-names></name></person-group> (<year>2017</year>). <article-title>Ensemble adversarial training: Attacks and defenses</article-title>. <source>arXiv preprint</source> arXiv:1705.07204.</citation>
</ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wahba</surname> <given-names>G..</given-names></name></person-group> (<year>1990</year>). <article-title>Spline models for observational data</article-title>. <source>Technometrics</source> <volume>34</volume>, <fpage>113</fpage>&#x02013;<lpage>114</lpage>.</citation>
</ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>X.</given-names></name> <name><surname>Li</surname> <given-names>J.</given-names></name> <name><surname>Kuang</surname> <given-names>X.</given-names></name> <name><surname>Tan</surname> <given-names>Y.-a.</given-names></name> <name><surname>Li</surname> <given-names>J.</given-names></name></person-group> (<year>2019</year>). <article-title>The security of machine learning in an adversarial setting: a survey</article-title>. <source>J. Parallel Distrib. Comput.</source> <volume>130</volume>, <fpage>12</fpage>&#x02013;<lpage>23</lpage>. <pub-id pub-id-type="doi">10.1016/j.jpdc.2019.03.003</pub-id></citation>
</ref>
<ref id="B30">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>D.</given-names></name> <name><surname>He</surname> <given-names>Y.</given-names></name> <name><surname>Luo</surname> <given-names>X.</given-names></name> <name><surname>Zhou</surname> <given-names>M.</given-names></name></person-group> (<year>2021a</year>). <article-title>A latent factor analysis-based approach to online sparse streaming feature selection</article-title>. <source>IEEE Trans. Syst. Man Cybern. Syst.</source> 1&#x02013;15. <pub-id pub-id-type="doi">10.1109/TSMC.2021.3096065</pub-id></citation>
</ref>
<ref id="B31">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>D.</given-names></name> <name><surname>Luo</surname> <given-names>X.</given-names></name> <name><surname>Shang</surname> <given-names>M.</given-names></name> <name><surname>He</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>G.</given-names></name> <name><surname>Wu</surname> <given-names>X.</given-names></name></person-group> (<year>2020</year>). <article-title>A data-characteristic-aware latent factor model for web services qos prediction</article-title>. <source>IEEE Trans. Knowl. Data Eng.</source> 1. <pub-id pub-id-type="doi">10.1109/TKDE.2020.3014302</pub-id></citation>
</ref>
<ref id="B32">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>D.</given-names></name> <name><surname>Shang</surname> <given-names>M.</given-names></name> <name><surname>Luo</surname> <given-names>X.</given-names></name> <name><surname>Wang</surname> <given-names>Z.</given-names></name></person-group> (<year>2021b</year>). <article-title>An l1-and-l2-norm-oriented latent factor model for recommender systems</article-title>. <source>IEEE Trans. Neural Netw. Learn. Syst.</source> 1&#x02013;14. <pub-id pub-id-type="doi">10.1109/TNNLS.2021.3071392</pub-id><pub-id pub-id-type="pmid">33886475</pub-id></citation></ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yuan</surname> <given-names>X.</given-names></name> <name><surname>He</surname> <given-names>P.</given-names></name> <name><surname>Zhu</surname> <given-names>Q.</given-names></name> <name><surname>Li</surname> <given-names>X.</given-names></name></person-group> (<year>2019</year>). <article-title>Adversarial examples: Attacks and defenses for deep learning</article-title>. <source>IEEE Trans. Neural Netw. Learn. Syst.</source> <volume>30</volume>, <fpage>2805</fpage>&#x02013;<lpage>2824</lpage>. <pub-id pub-id-type="doi">10.1109/TNNLS.2018.2886017</pub-id><pub-id pub-id-type="pmid">30640631</pub-id></citation></ref>
<ref id="B34">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>C.</given-names></name> <name><surname>Li</surname> <given-names>J.</given-names></name> <name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>Liu</surname> <given-names>D.</given-names></name> <name><surname>Chang</surname> <given-names>J.</given-names></name> <name><surname>Gao</surname> <given-names>R.</given-names></name></person-group> (<year>2022</year>). <article-title>Deep recommendation with adversarial training</article-title>. <source>IEEE Trans. Emerg. Top. Comput.</source> 1. <pub-id pub-id-type="doi">10.1109/TETC.2022.3141422</pub-id><pub-id pub-id-type="pmid">34112161</pub-id></citation></ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>W.</given-names></name> <name><surname>Gao</surname> <given-names>B.</given-names></name> <name><surname>Tang</surname> <given-names>J.</given-names></name> <name><surname>Yao</surname> <given-names>P.</given-names></name> <name><surname>Yu</surname> <given-names>S.</given-names></name> <name><surname>Chang</surname> <given-names>M.-F.</given-names></name> <name><surname>Yoo</surname> <given-names>H.-J.</given-names></name> <name><surname>Qian</surname> <given-names>H.</given-names></name> <name><surname>Wu</surname> <given-names>H.</given-names></name></person-group> (<year>2020</year>). <article-title>Neuro-inspired computing chips</article-title>. <source>Nat. Electron.</source> <volume>3</volume>, <fpage>371</fpage>&#x02013;<lpage>382</lpage>. <pub-id pub-id-type="doi">10.1038/s41928-020-0435-7</pub-id></citation>
</ref>
<ref id="B36">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zheng</surname> <given-names>S.</given-names></name> <name><surname>Song</surname> <given-names>Y.</given-names></name> <name><surname>Leung</surname> <given-names>T.</given-names></name> <name><surname>Goodfellow</surname> <given-names>I.</given-names></name></person-group> (<year>2016</year>). <article-title>Improving the robustness of deep neural networks via stability training,</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Las Vegas, NV</publisher-loc>), <fpage>4480</fpage>&#x02013;<lpage>4488</lpage>.</citation>
</ref>
</ref-list> 
</back>
</article> 