<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neurorobot.</journal-id>
<journal-title>Frontiers in Neurorobotics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neurorobot.</abbrev-journal-title>
<issn pub-type="epub">1662-5218</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnbot.2022.845955</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Continual Sequence Modeling With Predictive Coding</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Annabi</surname> <given-names>Louis</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1425831/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Pitti</surname> <given-names>Alexandre</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1254/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Quoy</surname> <given-names>Mathias</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>UMR8051 Equipes Traitement de l&#x00027;Information et Systemes (ETIS), CY University, ENSEA, CNRS</institution>, <addr-line>Cergy-Pontoise</addr-line>, <country>France</country></aff>
<aff id="aff2"><sup>2</sup><institution>IPAL CNRS Singapore</institution>, <addr-line>Singapore</addr-line>, <country>Singapore</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Jun Tani, Okinawa Institute of Science and Technology Graduate University, Japan</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Andrea Cossu, University of Pisa, Italy; Yulia Sandamirskaya, Intel, Germany</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Louis Annabi <email>louis.annabi&#x00040;gmail.com</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Original Research Article, a section of the journal Frontiers in Neurorobotics</p></fn></author-notes>
<pub-date pub-type="epub">
<day>23</day>
<month>05</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>16</volume>
<elocation-id>845955</elocation-id>
<history>
<date date-type="received">
<day>30</day>
<month>12</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>19</day>
<month>04</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2022 Annabi, Pitti and Quoy.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Annabi, Pitti and Quoy</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license> </permissions>
<abstract>
<p>Recurrent neural networks (RNNs) have been proved very successful at modeling sequential data such as language or motions. However, these successes rely on the use of the backpropagation through time (BPTT) algorithm, batch training, and the hypothesis that all the training data are available at the same time. In contrast, the field of developmental robotics aims at uncovering lifelong learning mechanisms that could allow embodied machines to learn and stabilize knowledge in continuously evolving environments. In this article, we investigate different RNN designs and learning methods, that we evaluate in a continual learning setting. The generative modeling task consists in learning to generate 20 continuous trajectories that are presented sequentially to the learning algorithms. Each method is evaluated according to the average prediction error over the 20 trajectories obtained after complete training. This study focuses on learning algorithms with low memory requirements, that do not need to store past information to update their parameters. Our experiments identify two approaches especially fit for this task: conceptors and predictive coding. We suggest combining these two mechanisms into a new proposed model that we label PC-Conceptors that outperforms the other methods presented in this study.</p></abstract>
<kwd-group>
<kwd>predictive coding</kwd>
<kwd>continual learning</kwd>
<kwd>Reservoir Computing (RC)</kwd>
<kwd>recurrent neural networks (RNN)</kwd>
<kwd>conceptors</kwd>
</kwd-group>
<counts>
<fig-count count="11"/>
<table-count count="2"/>
<equation-count count="22"/>
<ref-count count="31"/>
<page-count count="11"/>
<word-count count="7550"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Continual learning is a branch of machine learning aiming at equipping learning agents with the ability to learn incrementally without forgetting previously acquired knowledge. The continual learning setting typically involves several separate tasks where we assume data to be independent and identically distributed. The learning algorithm is confronted with each source of data (i.e., each task) sequentially. After a set amount of time on a task, training switches to a new task. This process is repeated until the learning algorithm has been confronted with all tasks.</p>
<p>Learning methods based on iterative updates of model parameters, such as the backpropagation algorithm, can be performed sequentially as new data becomes available. However, these methods might suffer from the problem known as catastrophic forgetting (McCloskey and Cohen, <xref ref-type="bibr" rid="B17">1989</xref>) if the distribution of the data they process evolves over time. When adapting to the new task, they automatically overwrite the model parameters that were optimized according to the previous tasks. This is an important issue since it prevents artificial neural networks from being trained incrementally.</p>
<p>In this study, we focus on the problem of learning a repertoire of trajectories. As such, the training examples in each task are sequences that the learning algorithm has to generate from a discrete input (i.e., the index of the sequence). We study different Recurrent Neural Network (RNN) designs and learning algorithms for this continual learning task. We limit our comparison to models with low memory requirements and, thus, impose that at each time step <italic>t</italic>, the neural network computations and learning can only access the currently available quantities. In our case, these quantities are the currently hidden variables of the models, and the target output <inline-formula><mml:math id="M1"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>x</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> provided by the data set. Consequently, learning methods based on BPTT do not qualify for this criterion, as they need to store in memory the past activations of the RNN hidden states to compute gradients. The advantage of models fitting this criterion is that they could in principle be implemented on dedicated hardware reproducing the neural network architecture, with no need for an external memory storing past inputs and activations.</p>
<p>To avoid confusion about the use of the word &#x0201C;online,&#x0201D; we rather talk about <italic>continual</italic> learning to refer to the task temporality, and talk about <italic>online</italic> learning to refer to the sequence (the training example) temporality. The models studied in this section are thus trained both in a continual learning setting, since the different target trajectories are provided sequentially to the agent, and using online learning mechanisms since the algorithms for learning do not rely on a memory of past activations. The article is structured as follows: in Section 2, we review methods that have been proposed to mitigate the problem of catastrophic forgetting in artificial neural networks, as well as learning algorithms that can be performed online. In Section 3, we describe the experimental setting and the different algorithms, and present the obtained results in Section 4. Finally, in Section 5, we discuss our results in order to identify the online learning mechanisms for RNNs most suited for the continual learning of a repertoire of sequences.</p></sec>
<sec id="s2">
<title>2. Related Work</title>
<p>There exists a large spectrum of methods to mitigate catastrophic forgetting in continual learning settings. Regularization methods typically aim at limiting forgetting by constraining learning with, e.g., sparsity constraints, early stopping, or identified synaptic weights that should not be overwritten. For instance, in Elastic Weight Consolidation (EWC) (Kirkpatrick et al., <xref ref-type="bibr" rid="B12">2017</xref>), the update rule contains a regularization term that pulls the synaptic weights toward the optimal weights found for previous tasks, with a strength depending on the estimated importance of each synaptic weight.</p>
<p>Another approach is to rely on architecture modifications when new tasks are presented, for instance by freezing some of the previously learned weights (Mallya et al., <xref ref-type="bibr" rid="B16">2018</xref>), or by adding new neurons and synaptic connections to the model (Li and Hoiem, <xref ref-type="bibr" rid="B13">2017</xref>). Finally, rehearsal (Rebuffi et al., <xref ref-type="bibr" rid="B24">2017</xref>) and generative replay (Shin et al., <xref ref-type="bibr" rid="B27">2017</xref>) methods rely on saving examples or modeling past tasks for future use. By inserting training examples from the previous tasks, either saved or replayed, into the current task, these methods allow to retrain on those data points and thus limiting catastrophic forgetting.</p>
<p>In this study, we compare learning algorithms with low memory requirements in a continual learning setting. As such, we disregard approaches such as rehearsal and generative replay and only consider some simple regularization or architectural techniques to improve the performance of sequence memory models in a continual learning setting.</p>
<p>Many alternatives to BPTT have been investigated in the past decades, often with the goal of avoiding the problems known as exploding and vanishing gradients that can arise when using this learning algorithm (Pascanu et al., <xref ref-type="bibr" rid="B20">2013</xref>). Here, we study two alternative approaches, namely, learning with evolution strategies, and Reservoir Computing (RC) (Verstraeten et al., <xref ref-type="bibr" rid="B30">2007</xref>; Lukosevicius and Jaeger, <xref ref-type="bibr" rid="B14">2009</xref>).</p>
<p>Using evolution strategies allows for learning RNN parameters without having to rely on past activations. The success of a certain parameter configuration can be measured online, for instance by comparing the network&#x00027;s output at each time step <italic>t</italic> with the target output. Then, this score is used as the fitness measure to be minimized by evolution. Following this approach, Schmidhuber et al. (<xref ref-type="bibr" rid="B26">2005</xref>) and Schmidhuber et al. (<xref ref-type="bibr" rid="B25">2007</xref>) co-evolve different groups of neurons in a Long Short-Term Memory (LSTM) network. A similar approach is used by Pitti et al. (<xref ref-type="bibr" rid="B21">2017</xref>), where the fitness measure is used to directly optimize the inputs of an RNN.</p>
<p>Completely avoiding the problem of learning recurrent weights, a family of approaches has emerged in parallel with the field of computational neurosciences in the form of Liquid State Machines (Maass et al., <xref ref-type="bibr" rid="B15">2002</xref>), and from the field of machine learning in the form of Echo State Networks (ESN) (Jaeger, <xref ref-type="bibr" rid="B9">2001</xref>). These models, later brought together under the label of Reservoir Computing (Verstraeten et al., <xref ref-type="bibr" rid="B30">2007</xref>; Lukosevicius and Jaeger, <xref ref-type="bibr" rid="B14">2009</xref>), discards the difficulties of learning recurrent weights by instead developing techniques to find relevant initializations of these parameters.</p>
<p>Typically, the recurrent connections are set in order for the RNN to exhibit rich non-linear (and sometimes self-sustained) dynamics, that are decoded by a learned readout layer. If the dynamics of the RNN activation are complex enough (e.g., they do not converge too rapidly toward a point attractor or limit cycle attractor), various output sequences can be decoded from those. Training RC models then come down to learning the weights of the readout layer, which is an easier optimization problem that can be tackled with several algorithms. This output layer can, e.g., be trained using stochastic gradient descent, without the need for BP. The FORCE algorithm (Sussillo and Abbott, <xref ref-type="bibr" rid="B29">2009</xref>) improves this learning by running an iterative estimate of the correlation matrix of the hidden state activations.</p>
<p>Another interesting learning mechanism is presented in Jaeger (<xref ref-type="bibr" rid="B10">2014a</xref>,<xref ref-type="bibr" rid="B11">b</xref>) under the name of Conceptors. This method exploits the fact that the hidden state dynamics triggered by an input pattern is typically bounded to a certain subspace of lower dimension. By identifying the subspace for each possible input pattern, it is possible to decorrelate the training of each target trajectory by focusing learning on the readout connections that come from the corresponding hidden state subspace (called Conceptor). This method allows training a sequence memory where the learning of a new pattern has limited interference with already learned ones.</p>
<p>Finally, the Predictive Coding (PC) theory (Rao and Ballard, <xref ref-type="bibr" rid="B22">1999</xref>; Clark, <xref ref-type="bibr" rid="B3">2013</xref>) also provides learning rules that do not rely on past activations. According to PC, prediction error neurons are intertwined with the neural generative model and encode at each layer the discrepancy between the top-down prediction and the current representation. It has been shown that this construction allows propagating the output prediction error information back into the generative model and even approximates the backpropagation algorithm (Whittington and Bogacz, <xref ref-type="bibr" rid="B31">2017</xref>; Millidge et al., <xref ref-type="bibr" rid="B18">2020</xref>).</p>
<p>Taking inspiration from the PC theory, we propose several models that integrate prediction error neurons into a simple RNN design. These prediction error neurons transport the error information from the output layer to the hidden layer, which provides a local target that can be used to learn the recurrent and input weights. We label the resulting models PC-RNN (for Predictive Coding Recurrent Neural Network). In Appendix A, we show how these models can be derived from the application of gradient descent on a quantity called variational free-energy expressed according to different generative models. The resulting models slightly deviate from other approaches such as the Parallel Temporal Coding Network (P-TNCN) described in Ororbia et al. (<xref ref-type="bibr" rid="B19">2020</xref>), and the original PC model presented in Rao and Ballard (<xref ref-type="bibr" rid="B23">1997</xref>), which suggests learning feedback weights responsibly for the bottom-up computations instead of copying the forward weights, as performed in the proposed models.</p>
<p>There have been other evaluations of continual learning methods applied to RNNs (Sodhani et al., <xref ref-type="bibr" rid="B28">2020</xref>; Cossu et al., <xref ref-type="bibr" rid="B6">2021b</xref>), some even focusing on ESNs (Cossu et al., <xref ref-type="bibr" rid="B5">2021a</xref>). While these studies compare many continual learning techniques, they do not consider the online learning constraint, and almost exclusively focus on sequence classification tasks. In contrast, this study investigates continual learning methods that can be used online, applied to the incremental learning of a repertoire of trajectories.</p></sec>
<sec sec-type="materials and methods" id="s3">
<title>3. Materials and Methods</title>
<p>In this section, we detail our experimental setting as well as the different models that we use for the comparative study.</p>
<sec>
<title>3.1. Experimental Setting</title>
<p>Each RNN model is trained sequentially on <italic>p</italic> sequence generation tasks. The <italic>p</italic> sequences to be learned are sampled from a data set of motion capture trajectories of dimension 62. Each point <inline-formula><mml:math id="M2"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>x</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> describes a body configuration, as represented in <xref ref-type="fig" rid="F1">Figure 1</xref>. These trajectories were obtained from the <ext-link ext-link-type="uri" xlink:href="http://mocap.cs.cmu.edu">CMU Motion Capture Database</ext-link>. We make a distinction between the validation set used to optimize the hyperparameters of each model, and the test set, used to measure the performance of each model. In our experiments, the validation set is composed of <italic>p</italic> &#x0003D; 15 trajectories of a subject (&#x00023;86 in the database) practicing various sports. The test set is composed of <italic>p</italic> &#x0003D; 20 trajectories of a subject (&#x00023;62 in the database) performing construction work (screwing, hammering, etc.).</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Three body configurations taken from a trajectory capturing a jump motion.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-845955-g0001.tif"/>
</fig>
<p>We also measure the performance of each model on a different test set of <italic>p</italic> &#x0003D; 20 simple 2D trajectories corresponding to handwritten letters taken from the UCI Machine Learning Repository (Dua and Graff, <xref ref-type="bibr" rid="B7">2019</xref>). All trajectories are resampled to last for 60 time steps. These data sets were chosen since they represent potential use cases of the models presented in this work. For instance, the proposed continual learning algorithms could be used to incrementally train a robot manipulator to perform certain motor trajectories.</p>
<p>We assume that the model knows when a transition between two tasks occurs, and provide to the RNN the current task index <italic>k</italic> as a one-hot vector input of dimension <italic>p</italic>. Otherwise, this distributional shift could, e.g., be automatically detected through a significant increase of the prediction error.</p>
<p>The end goal of this experiment is to identify online learning mechanisms for RNNs that extend properly to the continual learning case. The RNN architectures typically comprise three types of weight parameters to be learned: the output weights, the recurrent weights, and the input weights. As such, we split our analysis into three comparisons focusing on the learning methods for each type of parameter.</p>
<p>For each learning mechanism, we perform an optimization of hyperparameters using Bayesian optimization with Gaussian processes and Matern 5/2 Kernel, similarly to the RNN encoding capacity comparative analysis performed in Collins et al. (<xref ref-type="bibr" rid="B4">2016</xref>).</p>
<p>This method tries to approximate the function <bold><italic>P</italic></bold>&#x02192;<italic>f</italic>(<bold><italic>P</italic></bold>) that associates a scoring function with a certain hyperparameter configuration <bold><italic>P</italic></bold>. This approximation is estimated based on points (<bold><italic>P</italic></bold><sub>0</sub>, <italic>f</italic>(<bold><italic>P</italic></bold><sub>0</sub>)), (<bold><italic>P</italic></bold><sub>1</sub>, <italic>f</italic>(<bold><italic>P</italic></bold><sub>1</sub>)), &#x022EF; sampled sequentially by the optimizer. The function used by the optimizer to guide its sampling process is called acquisition function. Here, we used an expected improvement acquisition function, meaning that at each iteration, the optimizer samples the point <italic>P</italic> which is most likely to improve the current estimated maximum of the function <italic>f</italic>. Compared to exhaustive hyperparameter optimization methods such as random search or grid search, this method is expected to converge faster and to better configurations. To perform this hyperparameter optimization we used the <monospace>gp_minimize</monospace> function from the scikit-optimize library in python.</p>
<p>The hyperparameters of the models are optimized in order to minimize the final average prediction error on the <italic>p</italic> target sequences of the validation set. For each model, the hyperparameters to optimize are the learning rates associated with the input, recurrent, and output weights, as well as some other coefficients specific to certain learning algorithms. The score function associates each hyperparameter configuration with a real-valued score computed as the negative logarithm of the average prediction error at the end of training.</p>
<p>With the hyperparameter configurations we obtain, we perform for each model 10 seeds of training in the continual learning setting to measure their performances. The final average prediction error on the <italic>p</italic> sequences can be used to evaluate and compare the different learning mechanisms.</p></sec>
<sec>
<title>3.2. Benchmark Models</title>
<p>The models for this benchmark were chosen in order to identify the relevant mechanisms for training RNNs in a continual learning setting. As already said, we also limit this analysis to learning algorithms that can be performed <italic>online</italic>, i.e., without relying on past activations. For each set of weights, we compare the different models listed in <xref ref-type="table" rid="T1">Table 1</xref>.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Summary of the models used in our benchmark.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Weights</bold></th>
<th valign="top" align="left"><bold>Model</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Output weights</td>
<td valign="top" align="left">ESN (Widrow-Hoff)</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Conceptors</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">EWC</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">ESN &#x0002B; GR</td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">Recurrent weights</td>
<td valign="top" align="left">PC-RNN-V</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">P-TNCN</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">PC-RNN-Hebb</td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">Input weights</td>
<td valign="top" align="left">PC-RNN-HC-A</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">PC-RNN-HC-M</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">PC-RNN-HC-A-RS</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">PC-RNN-HC-M-RS</td>
</tr>
</tbody>
</table>
</table-wrap>
<sec>
<title>3.2.1. Output Weights</title>
<p>For the learning of the output weights of RNNs, denoted <bold><italic>W</italic></bold><sub><italic>o</italic></sub>, we compared four learning methods, applied to the simple RNN architecture represented in <xref ref-type="fig" rid="F2">Figure 2</xref>. All methods share the same architecture, and do not provide any learning mechanism for the recurrent weights. At each time step, the hidden state and output prediction are obtained with the following equations:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M3"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mo>=</mml:mo></mml:mtd><mml:mtd><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow></mml:mfrac><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>W</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:mo class="qopname">tanh</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E2"><label>(2)</label><mml:math id="M4"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>x</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mo>=</mml:mo></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>W</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:mo class="qopname">tanh</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003C4; is a time constant controlling the velocity of the hidden state dynamics.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Simple RNN model.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-845955-g0002.tif"/>
</fig>
<p>The four methods differ with regard to the learning mechanism applied to the output weights. First, output weights can be learned using standard stochastic gradient descent. In the RNN models we consider, the prediction <bold><italic>x</italic></bold><sub><italic>t</italic></sub> is not re-injected into the recurrent computations. As such, the output weights gradients can be computed using only the target signal <inline-formula><mml:math id="M5"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>x</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>, the prediction <bold><italic>x</italic></bold><sub><italic>t</italic></sub>, and the hidden state <bold><italic>h</italic></bold><sub><italic>t</italic></sub>. These computations do not involve the backpropagation of a gradient through time and thus qualify as an online learning method. This first learning rule, also known as the Widrow-Hoff learning rule is expressed as:</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M6"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>W</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02190;</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>W</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>o</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003BB;</mml:mi><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>&#x003F5;</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:mo class="qopname">tanh</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x022BA;</mml:mo></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003BB; is the learning rate of the output weights, and <bold><italic>&#x003F5;</italic></bold><sub><italic>x, t</italic></sub> is the prediction error on the output layer, i.e., the difference <inline-formula><mml:math id="M7"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>x</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>x</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>.</p>
<p>The second learning mechanism that we study is stochastic gradient descent aided by Conceptors (Jaeger, <xref ref-type="bibr" rid="B10">2014a</xref>,<xref ref-type="bibr" rid="B11">b</xref>). Mathematically, this method can be implemented using only online computations. The Conceptor <bold><italic>C</italic></bold> associated with some input can be defined as the matrix corresponding to a soft projection on the subspace where the hidden state dynamics lie when stimulated with this input. The softness of this projection is controlled by a positive parameter &#x003B1; called the aperture. This matrix <bold><italic>C</italic></bold> can be computed using the hidden state correlation matrix <bold><italic>R</italic></bold> estimated online based on the hidden state dynamics:</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M8"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mstyle mathvariant="bold-italic"><mml:mi>C</mml:mi></mml:mstyle><mml:mo>=</mml:mo><mml:mstyle mathvariant="bold-italic"><mml:mi>R</mml:mi></mml:mstyle><mml:mo>&#x000B7;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>R</mml:mi></mml:mstyle><mml:mo>&#x0002B;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mi mathvariant="double-struck">I</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E5"><label>(5)</label><mml:math id="M9"><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>R</mml:mi></mml:mstyle><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x02212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac><mml:mo stretchy='false'>)</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>R</mml:mi></mml:mstyle><mml:mi>t</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>h</mml:mi></mml:mstyle><mml:mi>t</mml:mi></mml:msub><mml:mtext>&#x000A0;</mml:mtext><mml:mo>&#x000B7;</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>h</mml:mi></mml:mstyle><mml:mi>t</mml:mi><mml:mo>&#x022BA;</mml:mo></mml:msubsup><mml:mo stretchy='false'>)</mml:mo></mml:math></disp-formula>
<p>In a continual learning setting, for each new task, we can compute the Conceptor corresponding to the complement of the subspace where the previously seen hidden states lie, as &#x1D540;&#x02212;<bold><italic>C</italic></bold>. This Conceptor is used to project the new hidden states into a subspace orthogonal to the subspace in which lie the previously seen hidden states. Learning is then performed only on the synaptic weights involving the components of this subspace:</p>
<disp-formula id="E6"><label>(6)</label><mml:math id="M10"><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mi>o</mml:mi></mml:msub><mml:mo>&#x02190;</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mi>o</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mtext>&#x003BB;</mml:mtext><mml:msub><mml:mi>&#x003F5;</mml:mi><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mtext>&#x000A0;</mml:mtext><mml:mo>&#x000B7;</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi mathvariant='double-struck'>I</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>C</mml:mi></mml:mstyle></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mtext>&#x000A0;</mml:mtext><mml:mo>&#x000B7;</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mi>tanh</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>h</mml:mi></mml:mstyle><mml:mi>t</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x022BA;</mml:mo></mml:msup></mml:math></disp-formula>
<p>As shown in Equation 4, low values of &#x003B1; induce a Conceptor matrix close to 0, leading to a projection matrix (&#x1D540;&#x02212;<italic>C</italic>) close to the identity. On the opposite, high values of &#x003B1; induce a conceptor matrix close to the identity matrix, leading to a hard projection hindering learning.</p>
<p>The third learning mechanism that we study is the Elastic Weight Consolidation (EWC) (Kirkpatrick et al., <xref ref-type="bibr" rid="B12">2017</xref>) algorithm applied to the output weights of the RNN. On each task <italic>k</italic>, we can compute the Fischer information matrix <bold><italic>F</italic></bold><sub><italic>k</italic></sub>, where each coefficient <italic>F</italic><sub><italic>k, i</italic></sub> measures the importance of the synaptic weight <italic>W</italic><sub><italic>o, i</italic></sub>:</p>
<disp-formula id="E7"><label>(7)</label><mml:math id="M11"><mml:msub><mml:mi>F</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>T</mml:mi></mml:munderover><mml:mo stretchy='false'>(</mml:mo></mml:mstyle><mml:msub><mml:mo>&#x02207;</mml:mo><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo stretchy="false">&#x02016;</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>&#x02212;</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mo>*</mml:mo></mml:msubsup><mml:msubsup><mml:mo stretchy="false">&#x02016;</mml:mo><mml:mn>2</mml:mn><mml:mn>2</mml:mn></mml:msubsup><mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M12"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>x</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> denotes the target at time <italic>t</italic> for the task <italic>k</italic>. Then, on a new task <italic>k</italic>&#x02032;, EWC minimizes the following loss function for each synaptic weight <italic>W</italic><sub><italic>o, i</italic></sub>:</p>
<disp-formula id="E8"><label>(8)</label><mml:math id="M13"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x0003C;</mml:mo><mml:msup><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:munder></mml:mstyle><mml:mfrac><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac><mml:msub><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msubsup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M14"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub></mml:math></inline-formula> denotes the loss for task <italic>k</italic>&#x02032; without EWC regularization, &#x003B2; is a hyperparameter controlling the importance of the new task with regard to previous tasks, and <inline-formula><mml:math id="M15"><mml:msubsup><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> denotes the <italic>i</italic>-th component of the optimal synaptic weights <inline-formula><mml:math id="M16"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>W</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> learned on task <italic>k</italic>. We optimize this loss function using gradient descent on the synaptic weights <bold><italic>W</italic></bold><sub><italic>o</italic></sub>, and obtain the following learning rule:</p>
<disp-formula id="E9"><label>(9)</label><mml:math id="M17"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mi>o</mml:mi></mml:msub><mml:mo>&#x02190;</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mi>o</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:mtext>&#x003BB;</mml:mtext><mml:msub><mml:mi>&#x003F5;</mml:mi><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:mi>tanh</mml:mi><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>h</mml:mi></mml:mstyle><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x022BA;</mml:mo></mml:msup></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>&#x02212;</mml:mo><mml:mtext>&#x003BB;</mml:mtext><mml:mi>&#x003B2;</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x0003C;</mml:mo><mml:msup><mml:mi>k</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup></mml:mrow></mml:munder><mml:mrow><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>F</mml:mi></mml:mstyle><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:mstyle></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x02299;</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mi>o</mml:mi></mml:msub><mml:mo>&#x02212;</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>&#x0003C;</mml:mo><mml:msup><mml:mi>k</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup></mml:mrow></mml:munder><mml:mrow><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>F</mml:mi></mml:mstyle><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>k</mml:mi></mml:mstyle></mml:msub><mml:mo>&#x02299;</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mi>k</mml:mi><mml:mo>*</mml:mo></mml:msubsup></mml:mrow></mml:mstyle></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>We can observe that the second line pulls <bold><italic>W</italic><sub><italic>o</italic></sub></bold> toward the optimal output weights found for previous tasks, weighted by coefficients measuring the importance of each synaptic weight. In terms of memory requirements, we need to store the sum of the Fischer matrices, as well as the sum of previous optimal synaptic weights weighted by the fisher matrices.</p>
<p>Finally, we also experiment with Generative Replay (GR) as a continual learning technique mitigating catastrophic forgetting. Since each individual task consists precisely in learning to generate the task data (the trajectory), the learned generative model can directly be used to provide samples of the previous tasks. We apply this technique to the simple ESN model described beforehand. At each new task <italic>k</italic>&#x02032;, we create a copy of the model trained on the tasks <italic>k</italic>&#x0003C;<italic>k</italic>&#x02032;. This copy is used to generate samples {<bold><italic>x</italic></bold><sub>1</sub>, <bold><italic>x</italic></bold><sub><italic>T</italic></sub>} that should be close to the previous tasks&#x00027; trajectories. During training on the task <italic>k</italic>&#x02032;, the ESN is also trained in parallel to predict these replayed trajectories, which mitigates catastrophic forgetting.</p></sec>
<sec>
<title>3.2.2. Recurrent Weights</title>
<p>For the learning of the recurrent weights, we compare three learning methods inspired by PC. All three models share the same architecture, represented in <xref ref-type="fig" rid="F3">Figure 3</xref>. On top of the top-down computations predicting the output <bold><italic>x</italic></bold><sub><italic>t</italic></sub>, these models include bottom-up computations updating the value of the hidden state, and providing a prediction error signal on the hidden layer:</p>
<disp-formula id="E10"><label>(10)</label><mml:math id="M19"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>&#x003F5;</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>x</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>x</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E11"><label>(11)</label><mml:math id="M20"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mtd><mml:mtd><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>W</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>&#x003F5;</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E12"><label>(12)</label><mml:math id="M21"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>&#x003F5;</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>h</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003B1;<sub><italic>x</italic></sub> is an update rate that weights the importance of bottom-up information for the estimation of <bold><italic>h</italic></bold><sub><italic>t</italic></sub>.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>PC-RNN-V model.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-845955-g0003.tif"/>
</fig>
<p>In fact, the three models we compare propose the same update rule for the recurrent weights, they will only differ in their definition of the feedback weights, which impacts the recurrent weights update. The learning rule for the recurrent weights is based on the hidden state at time <italic>t</italic> and the prediction error on the hidden state layer at time <italic>t</italic>&#x0002B;1, according to the following equation:</p>
<disp-formula id="E13"><label>(13)</label><mml:math id="M22"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>W</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02190;</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>W</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>&#x003F5;</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>h</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:mo class="qopname">tanh</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x022BA;</mml:mo></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003BB;<sub><italic>r</italic></sub> is the learning rate of the recurrent weights.</p>
<p>The difference between the three models lies in the computation of <inline-formula><mml:math id="M23"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>. In the first model, that we label PC-RNN-V (for Vanilla), this bottom-up computation is done using the transposed of the top-down weights used for prediction. This results in a direct minimization of VFE, as shown in Appendix A. In the two other models, these feedback and bottom-up weights are instead learned. In the original PC model described in Rao and Ballard (<xref ref-type="bibr" rid="B23">1997</xref>), it was proposed to learn these feedback weights using the same rule as Equation 3 (up to a transpose to match the feedback weights shape):</p>
<disp-formula id="E14"><label>(14)</label><mml:math id="M24"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>W</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02190;</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>W</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003BB;</mml:mi><mml:mo class="qopname">tanh</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>h</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>&#x003F5;</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x022BA;</mml:mo></mml:mrow></mml:msubsup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>This learning rule ensures that with random initializations, but enough training time, the feedback weights converge to the transposed forward weights. Since this learning rule is a copy of the Hebbian rule used in Equation 3, we call PC-RNN-Hebb the RNN model using this method. The last model, inspired by the P-TNCN (Ororbia et al., <xref ref-type="bibr" rid="B19">2020</xref>), implements a different learning rule for the feedback weights, described by the following equation:</p>
<disp-formula id="E15"><label>(15)</label><mml:math id="M25"><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mi>b</mml:mi></mml:msub><mml:mo>&#x02190;</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mi>b</mml:mi></mml:msub><mml:mo>&#x02212;</mml:mo><mml:msub><mml:mtext>&#x003BB;</mml:mtext><mml:mi>b</mml:mi></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>&#x003F5;</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02212;</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mi>&#x003F5;</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mtext>&#x000A0;</mml:mtext><mml:mo>&#x000B7;</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:msubsup><mml:mi>&#x003F5;</mml:mi><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mo>&#x022BA;</mml:mo></mml:msubsup></mml:math></disp-formula>
<p>The model presented in Ororbia et al. (<xref ref-type="bibr" rid="B19">2020</xref>) also implements an additional term in the learning rule for the recurrent and output weights, on top of the rules explained here. This additional term led our experiments to worse results. For this reason, we do not provide more details about this rule and turn it off during the experiments shown below.</p></sec>
<sec>
<title>3.2.3. Input Weights</title>
<p>Finally, we compare four methods to learn RNN input weights. All methods share the same representation, displayed in <xref ref-type="fig" rid="F4">Figure 4</xref>. This architecture was derived following the principle of free-energy minimization (Friston and Kilner, <xref ref-type="bibr" rid="B8">2006</xref>), using a generative model that features a latent variable called hidden causes and labeled <bold><italic>c</italic></bold>. Similarly to hidden states, hidden causes are hidden variables that can be dynamically inferred by the PC-RNN network. However, contrary to the hidden state variable, hidden causes are not dynamic: in the absence of prediction error the value of the hidden causes is stationary <bold><italic>c</italic></bold><sub><italic>t</italic></sub> &#x0003D; <bold><italic>c</italic></bold><sub>0</sub>. The derivations of these models are summarized in Appendix A. The resulting architecture takes as input an initial value for the hidden causes and predicts an output sequence while dynamically updating the hidden states and hidden causes. During training, this input is the one-hot encoded index of the current task <bold><italic>c</italic></bold><sub>0</sub> &#x0003D; <italic>k</italic>.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>PC-RNN-HC model.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-845955-g0004.tif"/>
</fig>
<p>The four models differ according to two dimensions: whether they use evolution strategies to estimate the input weights, and according to the implementation of the influence of the input onto the hidden state dynamics. This influence can be either additive or multiplicative, the additive scheme is based on the following equation:</p>
<disp-formula id="E16"><label>(16)</label><mml:math id="M26"><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>h</mml:mi></mml:mstyle><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x02212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>&#x003C4;</mml:mi></mml:mfrac><mml:mo stretchy='false'>)</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>h</mml:mi></mml:mstyle><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>*</mml:mo></mml:msubsup><mml:mo>+</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>&#x003C4;</mml:mi></mml:mfrac><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:mi>tanh</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>h</mml:mi></mml:mstyle><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>*</mml:mo></mml:msubsup><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>c</mml:mi></mml:mstyle><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:math></disp-formula>
<p>The multiplicative scheme is based on the following equation:</p>
<disp-formula id="E17"><label>(17)</label><mml:math id="M27"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>h</mml:mi></mml:mstyle><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x02212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>&#x003C4;</mml:mi></mml:mfrac><mml:mo stretchy='false'>)</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>h</mml:mi></mml:mstyle><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>*</mml:mo></mml:msubsup></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>+</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>&#x003C4;</mml:mi></mml:mfrac><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mi>f</mml:mi><mml:mo>&#x022BA;</mml:mo></mml:msubsup><mml:mo>&#x000B7;</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mi>p</mml:mi></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:mi>tanh</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>h</mml:mi></mml:mstyle><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>*</mml:mo></mml:msubsup><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02299;</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>c</mml:mi></mml:mstyle><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where we have introduced new synaptic weights <bold><italic>W</italic></bold><sub><italic>p</italic></sub> and <bold><italic>W</italic></bold><sub><italic>f</italic></sub>, that replace the recurrent weights of the additive version. This reparameterization is used to reduce the total number of parameters of the multiplicative RNN, as already used in Annabi et al. (<xref ref-type="bibr" rid="B1">2021a</xref>,<xref ref-type="bibr" rid="B2">b</xref>).</p>
<p>We label these two models, respectively, PC-RNN-HC-A and PC-RNN-HC-M, the HC suffix standing for Hidden Causes and the A and M suffixes standing for Additive and Multiplicative. The differences between the additive and multiplicative models also impact the bottom-up update rule for <bold><italic>c</italic></bold><sub><italic>t</italic></sub>. However, in our experiments, we always turn off this mechanism by using an update rate equal to zero.</p>
<p>In these two first methods, the learning rules for the input weights follow the PC theory and attempt at minimizing the prediction error on the hidden layer. The learning rule used for the PC-RNN-HC-A model is the following:</p>
<disp-formula id="E18"><label>(18)</label><mml:math id="M29"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>W</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02190;</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>W</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>&#x003F5;</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>h</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>c</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x022BA;</mml:mo></mml:mrow></mml:msubsup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>For the PC-RNN-HC-M model, we obtain the following rule:</p>
<disp-formula id="E19"><label>(19)</label><mml:math id="M30"><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x02190;</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mi>i</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mtext>&#x003BB;</mml:mtext><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mi>p</mml:mi></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:mi>tanh</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>h</mml:mi></mml:mstyle><mml:mi>t</mml:mi><mml:mo>*</mml:mo></mml:msubsup><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02299;</mml:mo><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>W</mml:mi></mml:mstyle><mml:mi>f</mml:mi></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mi>&#x003F5;</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x000B7;</mml:mo><mml:msubsup><mml:mstyle mathvariant='bold-italic' mathsize='normal'><mml:mi>c</mml:mi></mml:mstyle><mml:mi>t</mml:mi><mml:mo>&#x022BA;</mml:mo></mml:msubsup></mml:math></disp-formula>
<p>The third and fourth methods that we study are respectively based on the PC-RNN-HC-A and PC-RNN-HC-M, but instead use random search to optimize the weights <bold><italic>W</italic></bold><sub><italic>i</italic></sub>. Our implementation of this random search is inspired by the learning algorithm proposed in Pitti et al. (<xref ref-type="bibr" rid="B21">2017</xref>):</p>
<disp-formula id="E20"><label>(20)</label><mml:math id="M31"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>&#x003B4;</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0007E;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">N</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E21"><label>(21)</label><mml:math id="M32"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mo>|</mml:mo><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>&#x003F5;</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>&#x02190;</mml:mo><mml:mtext class="textrm" mathvariant="normal">simulate</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>W</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>&#x003B4;</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E22"><label>(22)</label><mml:math id="M33"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>W</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02190;</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>W</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>g</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>|</mml:mo><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>&#x003F5;</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mo>|</mml:mo><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>&#x003F5;</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where the function <italic>sign</italic> associates &#x02212;1 to negative values and 1 to positive values. At each training iteration <italic>i</italic>, the algorithm samples a noise matrix <bold><italic>&#x003B4;</italic></bold><sub><italic>i</italic></sub> that is added to the input weights of the RNN. After generation, the difference between the old and new average norm of the prediction error ||<bold><italic>&#x003F5;</italic></bold><sub><italic>x, i</italic>&#x02212;1</sub>||<sub>2</sub>&#x02212;||<bold><italic>&#x003F5;</italic></bold><sub><italic>x, i</italic></sub>||<sub>2</sub> is used as a measure of success of the addition of <bold><italic>&#x003B4;</italic></bold><sub><italic>i</italic></sub> and weights the update of <bold><italic>W</italic></bold><sub><italic>i</italic></sub>. Since this algorithm only relies on an average of the prediction error over the predicted sequences, that can be computed iteratively, it qualifies as an online learning algorithm.</p>
<p>In summary, we have identified four learning algorithms for output weights, three learning algorithms for recurrent weights, and four learning algorithms for input weights. To connect the proposed methods to the classification of continual learning methods presented above, we could categorize the Conceptors method as a regularization method, and the fact that new tasks are associated with new inputs to the RNN in the shape of hidden causes, as an architecture modification method.</p></sec></sec></sec>
<sec sec-type="results" id="s4">
<title>4. Results</title>
<sec>
<title>4.1. Hyperparameter Optimization</title>
<p>The source code for the experiments presented in this section is available on GitHub<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref>. It contains our implementation of the different models as well as the hyperparameter optimization method. In Appendix B, we provide the optimal hyperparameters found for each model.</p>
<p>We start by showing an example of a hyperparameter optimization in <xref ref-type="fig" rid="F5">Figure 5</xref>, which was performed on the EWC model with <italic>d</italic><sub><italic>h</italic></sub> &#x0003D; 300. The optimized hyperparameters are the learning rate of the output weights, &#x003BB;, and the coefficient &#x003B2;. After trying 200 hyperparameter configurations, the optimizer can estimate the score for all the configurations within the given range of values. These figures display the evolution of the score estimation according to &#x003BB; using the optimal value for &#x003B2;, and according to &#x003B2; using the optimal value for &#x003BB;. We can see that the function according to &#x003B2; monotonically decreases, while the function according to &#x003BB; increases steadily before dropping once we attain values of the learning rate that no longer sustain convergence of the gradient descent.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Score estimation of the hyperparameter optimizer with regard to the learning rate &#x003BB; and the coefficient &#x003B2;, for the EWC model.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-845955-g0005.tif"/>
</fig>
<p>In this case, the hyperparameter optimization has found that the EWC regularization does not improve the final score, and suggests using the lowest possible value for the coefficient &#x003B2;. When &#x003B2; increases, the regularization mitigates catastrophic forgetting but prevents proper learning of new tasks.</p>
<p>For all the results presented below, we perform optimization of the hyperparameters following the same protocol.</p></sec>
<sec>
<title>4.2. Output Weights</title>
<p>In <xref ref-type="fig" rid="F6">Figure 6</xref>, we represent the average prediction error over 10 seeds for the continual learning of 20 sequential patterns obtained on the test set, with the hyperparameters found using the protocol described before. The vertical dashed lines in these figures delimit each of the training tasks. The colored lines represent the individual prediction error for each of the 20 sequence patterns (averaged over the 10 seeds). Finally, the black line represents the average prediction error over all the sequence patterns (averaged over the 10 seeds).</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Continual learning results with the ESN model (left) and the Conceptors model (right). We represent the average prediction error over 10 seeds, for the continual learning of 20 sequential patterns, obtained on the first test set. The colored lines correspond to the prediction error on each individual task, and the black line corresponds to the prediction error averaged on all tasks. The 20 tasks are delimited by the dashed gray lines.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-845955-g0006.tif"/>
</fig>
<p>During each task (for each colored line), we can observe that one of the individual prediction errors decreases rapidly, while the other prediction errors only slightly change. Once the training task corresponding to a certain sequence pattern <italic>k</italic> is over, the prediction error associated with this pattern tends to increase. The better learning mechanism is the one that can limit this undesirable forgetting of previously learned sequence patterns. We can observe in <xref ref-type="fig" rid="F6">Figure 6</xref> the Conceptors learning mechanism limits forgetting compared to the standard stochastic gradient descent rule used in our ESN model.</p>
<p>At first, it can be surprising that for each individual task, the corresponding prediction error reaches a lower value for the Conceptors model than for the ESN model. In terms of learning rules, the ESN model could potentially learn each pattern with better accuracy by increasing the learning rate. However, the hyperparameter optimizer has estimated that an increased learning rate would be detrimental to the complete continual learning task. Indeed, increasing the learning rate might improve the learning on every individual task, but it would also lead to more forgetting throughout the complete task. It is only because the Conceptors learning mechanisms naturally limit forgetting that the hyperparameter optimized &#x0201C;allows&#x0201D; a higher learning rate and, thus, better learning on each individual task.</p>
<p>We can also observe that the prediction error level that is reached during each individual task using the Conceptors model seems to increase throughout the complete task. We suppose that this is a consequence of further learning being prevented on synaptic connections associated with previous tasks&#x00027; associated Conceptors. When a large number of individual tasks are over, learning is limited to synapses corresponding to a subspace of the hidden state space not belonging to any of the previous Conceptors. Decreasing the aperture &#x003B1; would allow better learning of the late tasks, but at the detriment of an increased forgetting of the early tasks.</p>
<p><xref ref-type="fig" rid="F7">Figure 7</xref> compiles these previous figures to compare the average prediction error using the four learning mechanisms for output weights. At the end of the training, we can see that the Conceptors model and generative replay achieve a significantly lower prediction error than the ESN using the standard stochastic gradient descent rule and the EWC regularization for the learning of the output weights.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Comparison between the four learning methods for the output weights on the first test set. The 20 tasks are delimited by the dashed gray lines.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-845955-g0007.tif"/>
</fig>
<p>As explained in the last section, the hyperparameters found for EWC correspond to a configuration where the regularization is almost removed, and the EWC model, thus, has the same performance as the ESN model.</p>
<p>The generative replay strategy outperforms all other approaches, but at the cost of a longer training time. Indeed, at each task <italic>k</italic>, the model is trained on (<italic>k</italic>&#x02212;1) replayed trajectories on top of the current trajectory. For all models, we have limited the number of training iterations on each task, which induces an unfair advantage for generative replay in our experiments. For this reason, we do not include this technique in the remaining comparisons.</p>
<p>The results obtained with these models on the three data sets (validation set and two test sets) are provided in <xref ref-type="table" rid="T2">Table 2</xref>, together with the results for the learning of recurrent and input weights, discussed in the next sections.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Average prediction error after training on all <italic>p</italic> tasks.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th valign="top" align="center"><bold>Validation</bold></th>
<th valign="top" align="center"><bold>Test 1</bold></th>
<th valign="top" align="center"><bold>Test 2</bold></th>
</tr>
<tr>
<th valign="top" align="left"><bold>Model</bold></th>
<th valign="top" align="center"><bold>(MOCAP</bold>,</th>
<th valign="top" align="center"><bold>(MOCAP</bold>,</th>
<th valign="top" align="center"><bold>(Handwriting</bold>,</th>
</tr>
<tr>
<th/>
<th valign="top" align="center"><bold><italic>p</italic> &#x0003D; 15)</bold></th>
<th valign="top" align="center"><bold><italic>p</italic> &#x0003D; 20)</bold></th>
<th valign="top" align="center"><bold><italic>p</italic> &#x0003D; 20)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">ESN</td>
<td valign="top" align="center">0.90 &#x000B1; 0.07</td>
<td valign="top" align="center">1.37 &#x000B1; 0.14</td>
<td valign="top" align="center">0.71 &#x000B1; 0.04</td>
</tr>
<tr>
<td valign="top" align="left">EWC</td>
<td valign="top" align="center">0.90 &#x000B1; 0.09</td>
<td valign="top" align="center">1.35 &#x000B1; 0.15</td>
<td valign="top" align="center">0.69 &#x000B1; 0.05</td>
</tr>
<tr>
<td valign="top" align="left">Conceptors</td>
<td valign="top" align="center">0.31 &#x000B1; 0.02</td>
<td valign="top" align="center">0.52 &#x000B1; 0.04</td>
<td valign="top" align="center">0.27 &#x000B1; 0.02</td>
</tr>
<tr>
<td valign="top" align="left">ESN &#x0002B; GR</td>
<td valign="top" align="center"><bold>0.29 &#x000B1; 0.01</bold></td>
<td valign="top" align="center"><bold>0.39 &#x000B1; 0.01</bold></td>
<td valign="top" align="center"><bold>0.22 &#x000B1; 0.01</bold></td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">PC-RNN-V</td>
<td valign="top" align="center">0.87 &#x000B1; 0.09</td>
<td valign="top" align="center">1.41 &#x000B1; 0.14</td>
<td valign="top" align="center">0.79 &#x000B1; 0.10</td>
</tr>
<tr>
<td valign="top" align="left">P-TNCN</td>
<td valign="top" align="center">0.90 &#x000B1; 0.08</td>
<td valign="top" align="center">1.42 &#x000B1; 0.18</td>
<td valign="top" align="center">0.71 &#x000B1; 0.05</td>
</tr>
<tr>
<td valign="top" align="left">PC-RNN-Hebb</td>
<td valign="top" align="center">0.90 &#x000B1; 0.07</td>
<td valign="top" align="center">1.41 &#x000B1; 0.10</td>
<td valign="top" align="center">0.73 &#x000B1; 0.05</td>
</tr>
<tr>
<td valign="top" align="left">PC-RNN-HC-A</td>
<td valign="top" align="center"><bold>0.74 &#x000B1; 0.09</bold></td>
<td valign="top" align="center"><bold>1.28 &#x000B1; 0.22</bold></td>
<td valign="top" align="center"><bold>0.59 &#x000B1; 0.04</bold></td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">PC-RNN-HC-M</td>
<td valign="top" align="center">0.81 &#x000B1; 0.04</td>
<td valign="top" align="center">1.32 &#x000B1; 0.09</td>
<td valign="top" align="center">0.77 &#x000B1; 0.05</td>
</tr>
<tr>
<td valign="top" align="left">PC-RNN-HC-A-RS</td>
<td valign="top" align="center">0.90 &#x000B1; 0.08</td>
<td valign="top" align="center">1.39 &#x000B1; 0.15</td>
<td valign="top" align="center">0.77 &#x000B1; 0.05</td>
</tr>
<tr>
<td valign="top" align="left">PC-RNN-HC-M-RS</td>
<td valign="top" align="center">0.93 &#x000B1; 0.06</td>
<td valign="top" align="center">1.38 &#x000B1; 0.10</td>
<td valign="top" align="center">0.72 &#x000B1; 0.05</td>
</tr>
<tr>
<td valign="top" align="left">PC-Conceptors</td>
<td valign="top" align="center"><bold>0.28 &#x000B1; 0.01</bold></td>
<td valign="top" align="center"><bold>0.36 &#x000B1; 0.02</bold></td>
<td valign="top" align="center"><bold>0.18 &#x000B1; 0.01</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="TN1"><p><italic>Bold value indicates the best performance in each group of models</italic>.</p></fn>
</table-wrap-foot>
</table-wrap></sec>
<sec>
<title>4.3. Recurrent Weights</title>
<p>In this second experiment, we compare the PC-RNN-V with two variants using learning rules for the feedback weights instead of using the transposed feedforward weights. These three learning methods in the end provide different update rules for the recurrent weights of the RNN. The results of this second comparative analysis are provided in <xref ref-type="table" rid="T2">Table 2</xref>.</p>
<p>We can see that none of the three models brings any significant improvement compared with the ESN, which is exactly the same model without any learning occurring on the recurrent weights. In terms of hyperparameters, only the PC-RNN-V has an optimal learning rate for recurrent weights that does not correspond to the lowest value authorized during hyperparameter optimization. This means that for both P-TNCN and PC-RNN-Hebb models, the hyperparameter optimizer has estimated that training the recurrent weights only hinders the final prediction error. For the PC-RNN-V model, a slight improvement was found in the validation set using the learning rule for recurrent weights, but this improvement does not transfer to the two test sets.</p>
<p>We can conclude based on these results that recurrent weights learning in a continual learning setting is difficult and might often lead to more catastrophic forgetting.</p></sec>
<sec>
<title>4.4. Input Weights</title>
<p><xref ref-type="fig" rid="F8">Figure 8</xref> displays the results obtained with the four learning mechanisms for input weights, and the ESN as a baseline. We use the ESN as a baseline to measure the improvement brought by the learning in the input layer. The results of the validation set and other test sets are displayed in <xref ref-type="table" rid="T2">Table 2</xref>.</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>Comparison between the three learning methods for the input weights. The PC-RNN-V model, where no learning is performed on the input weights, is also displayed as a baseline. The 20 tasks are delimited by the dashed gray lines.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-845955-g0008.tif"/>
</fig>
<p>These results suggest that the learning methods using random search (RS suffix) perform poorly compared to the corresponding learning rules relying on the propagation of error using PC. The two models using random search perform similarly to the baseline ESN model. This observation is surprising since the <bold><italic>W</italic></bold><sub><italic>i</italic></sub> weights in PC-RNN-HC-A/M architectures are directly factored according to each individual task. Indeed, during the task <italic>k</italic>, we can limit learning on the <italic>k</italic>-th column of the <bold><italic>W</italic></bold><sub><italic>i</italic></sub> weights, since these are the only weights that influence the RNN trajectory. Consequently, training this layer should not cause any additional forgetting, and thus should only bring improvements over the baseline ESN model. Since the two models using random search did not bring any improvement, we suppose that this is due to the limited number of iterations allowed for the training on each individual task. We observed that in general training with random search as in the INFERNO model (Pitti et al., <xref ref-type="bibr" rid="B21">2017</xref>) needed many more iterations than gradient-based methods.</p>
<p>The PC-RNN-HC-A/M models trained using the PC-based learning rules still showed some significant improvement compared with the ESN baseline, with the PC-RNN-HC-A model performing slightly better than the PC-RNN-HC-M model. This experiment allows us to conclude that the learning rule for input weights proposed by the PC-RNN-HC-A model is the most suited to a continual learning setting.</p></sec>
<sec>
<title>4.5. Combining Conceptors and Hidden Causes</title>
<p>Finally, we can inquire whether these different learning mechanisms combine well with each other. We implement the Conceptors learning rule on the output weights of a PC-RNN-HC-A model, a new model that we label PC-Conceptors, as represented in <xref ref-type="fig" rid="F9">Figure 9</xref>. <xref ref-type="fig" rid="F10">Figure 10</xref> displays the prediction error on each individual task as well as the average prediction error throughout learning, using this model. Interestingly, virtually no forgetting seems to happen during learning, as the individual prediction errors plateau after decreasing during the corresponding individual tasks.</p>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption><p>PC-Conceptors model.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-845955-g0009.tif"/>
</fig>
<fig id="F10" position="float">
<label>Figure 10</label>
<caption><p>Continual learning results using the PC-Conceptors. We represent the average prediction error over 10 seeds, for the continual learning of 20 sequential patterns, using the PC-RNN-HC-A model with Conceptors. The colored lines correspond to the prediction error on each individual task, and the black line corresponds to the prediction error averaged on all tasks. The 20 tasks are delimited by the dashed gray lines.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-845955-g0010.tif"/>
</fig>
<p>Additionally, the hyperparameter optimizer in this case recommended using the lowest possible value for the recurrent weights learning rate. This suggests that the recurrent weights learning negatively interferes with the Conceptors model. The Conceptors model might be sensible for recurrent weight learning, since this could turn the previously learned Conceptors into obsolete descriptors of the corresponding hidden state trajectories.</p>
<p>We compare these results with the ESN, Conceptors and PC-RNN-HC-A models in <xref ref-type="fig" rid="F11">Figure 11</xref>, which confirms that this combination of learning methods seems to provide the RNN model best suited for online continual learning.</p>
<fig id="F11" position="float">
<label>Figure 11</label>
<caption><p>Comparison between the ESN, the Conceptors model, the PC-RNN-HC-A model, and the PC-Conceptors model on the first test set. The 20 tasks are delimited by the dashed gray lines.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-845955-g0011.tif"/>
</fig></sec></sec>
<sec sec-type="discussion" id="s5">
<title>5. Discussion</title>
<p>Overall, this study suggests that regularization methods such as Conceptors, and architectural methods, as proposed in the PC-RNN-HC architectures, can help design RNN models with online learning rules suitable for continual learning.</p>
<p>Additionally, we have found that combining Conceptors-based learning for the output weights with PC-based learning for the input weights further improves the model precision. In future study, it would be interesting to investigate whether the combination of these two mechanisms could be improved. Especially, the learning of the input weights is only driven by the minimization of the prediction error on the recurrent layer. This could be improved by integrating an orthogonality criterion to the learning rule: if the input weights are optimized in order to decorrelate the different hidden state trajectories, it could facilitate the learning of the output weights.</p>
<p>The models we have proposed also suffer from another limitation that should be addressed in future work. The models were trained using as input the current task index, which is information that might not be available in realistic lifelong learning settings. The model should be able to detect a distributional shift when it occurs and adapt its learning rules based on these events.</p></sec>
<sec sec-type="data-availability" id="s6">
<title>Data Availability Statement</title>
<p>The datasets presented in this study can be found in online repositories. The name of the repository and accession number can be found below: GitHub, <ext-link ext-link-type="uri" xlink:href="https://github.com/sino7/continual_online_learning_rnn_benchmark">https://github.com/sino7/continual_online_learning_rnn_benchmark</ext-link>.</p></sec>
<sec id="s7">
<title>Author Contributions</title>
<p>The models and experiments were designed by LA, AP, and MQ. The models and experiments were implemented by LA. The article was written by LA with instructions and feedback from AP and MQ. All authors contributed to the article and approved the submitted version.</p></sec>
<sec sec-type="funding-information" id="s8">
<title>Funding</title>
<p>This study was funded by the Cergy-Paris University Foundation (Facebook grant) and partially by Labex MME-DII, France (ANR11-LBX-0023-01).</p></sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p></sec>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x00027;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p></sec> </body>
<back>
<sec sec-type="supplementary-material" id="s10">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fnbot.2022.845955/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fnbot.2022.845955/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.pdf" id="SM1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Annabi</surname> <given-names>L.</given-names></name> <name><surname>Pitti</surname> <given-names>A.</given-names></name> <name><surname>Quoy</surname> <given-names>M.</given-names></name></person-group> (<year>2021a</year>). <article-title>Bidirectional interaction between visual and motor generative models using predictive coding and active inference</article-title>. <source>Neural Netw</source>. <volume>143</volume>, <fpage>638</fpage>&#x02013;<lpage>656</lpage>. <pub-id pub-id-type="doi">10.1016/j.neunet.2021.07.016</pub-id><pub-id pub-id-type="pmid">34343777</pub-id></citation></ref>
<ref id="B2">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Annabi</surname> <given-names>L.</given-names></name> <name><surname>Pitti</surname> <given-names>A.</given-names></name> <name><surname>Quoy</surname> <given-names>M.</given-names></name></person-group> (<year>2021b</year>). <article-title>&#x0201C;A predictive coding account for chaotic itinerancy,&#x0201D;</article-title> in <source>Artificial Neural Networks and Machine Learning-ICANN 2021</source>, eds I. Farka&#x00161;, P. Masulli, S. Otte, and S. Wermter (<publisher-loc>Cham</publisher-loc>: <publisher-loc>Springer International Publishing</publisher-loc>), <fpage>581</fpage>&#x02013;<lpage>592</lpage>.</citation>
</ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Clark</surname> <given-names>A..</given-names></name></person-group> (<year>2013</year>). <article-title>Whatever next? predictive brains, situated agents, and the future of cognitive science</article-title>. <source>Behav. Brain Sci</source>. <volume>36</volume>, <fpage>181</fpage>&#x02013;<lpage>204</lpage>. <pub-id pub-id-type="doi">10.1017/S0140525X12000477</pub-id><pub-id pub-id-type="pmid">23663408</pub-id></citation></ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Collins</surname> <given-names>J.</given-names></name> <name><surname>Sohl-Dickstein</surname> <given-names>J.</given-names></name> <name><surname>Sussillo</surname> <given-names>D.</given-names></name></person-group> (<year>2016</year>). <article-title>Capacity and trainability in recurrent neural networks</article-title>. <source>stat</source> <volume>1050</volume>:<fpage>29</fpage>.<pub-id pub-id-type="pmid">30407876</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Cossu</surname> <given-names>A.</given-names></name> <name><surname>Bacciu</surname> <given-names>D.</given-names></name> <name><surname>Carta</surname> <given-names>A.</given-names></name> <name><surname>Gallicchio</surname> <given-names>C.</given-names></name> <name><surname>Lomonaco</surname> <given-names>V.</given-names></name></person-group> (<year>2021a</year>). <article-title>Continual learning with echo state networks</article-title>. <source>arXiv preprint</source> arXiv:2105.07674. <pub-id pub-id-type="doi">10.14428/esann/2021.ES2021-80</pub-id><pub-id pub-id-type="pmid">31976910</pub-id></citation></ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cossu</surname> <given-names>A.</given-names></name> <name><surname>Carta</surname> <given-names>A.</given-names></name> <name><surname>Lomonaco</surname> <given-names>V.</given-names></name> <name><surname>Bacciu</surname> <given-names>D.</given-names></name></person-group> (<year>2021b</year>). <article-title>Continual learning for recurrent neural networks: an empirical evaluation</article-title>. <source>Neural Netw</source>. <volume>143</volume>, <fpage>607</fpage>&#x02013;<lpage>627</lpage>. <pub-id pub-id-type="doi">10.1016/j.neunet.2021.07.021</pub-id><pub-id pub-id-type="pmid">34343775</pub-id></citation></ref>
<ref id="B7">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Dua</surname> <given-names>D.</given-names></name> <name><surname>Graff</surname> <given-names>C.</given-names></name></person-group> (<year>2019</year>). <source>Uci machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="http://archive.ics.uci.edu/ml">http://archive.ics.uci.edu/ml</ext-link>.</citation>
</ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Friston</surname> <given-names>K.</given-names></name> <name><surname>Kilner</surname> <given-names>J.</given-names></name></person-group> (<year>2006</year>). <article-title>A free energy principle for the brain</article-title>. <source>J. Physiol. Paris</source> <volume>100</volume>:<fpage>70</fpage>&#x02013;<lpage>87</lpage>. <pub-id pub-id-type="doi">10.1016/j.jphysparis.2006.10.001</pub-id><pub-id pub-id-type="pmid">17097864</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jaeger</surname> <given-names>H..</given-names></name></person-group> (<year>2001</year>). <source>The &#x0201C;echo state&#x0201D; approach to analysing and training recurrent neural networks</source>. GMD-Report 148, German National Research Institute for Computer Science.</citation>
</ref>
<ref id="B10">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jaeger</surname> <given-names>H..</given-names></name></person-group> (<year>2014a</year>). <article-title>Conceptors: an easy introduction</article-title>. <source>CoRR abs/1406.2671</source>.</citation>
</ref>
<ref id="B11">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jaeger</surname> <given-names>H..</given-names></name></person-group> (<year>2014b</year>). <article-title>Controlling recurrent neural networks by conceptors</article-title>. <source>CoRR abs/1403.3369</source>.</citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kirkpatrick</surname> <given-names>J.</given-names></name> <name><surname>Pascanu</surname> <given-names>R.</given-names></name> <name><surname>Rabinowitz</surname> <given-names>N.</given-names></name> <name><surname>Veness</surname> <given-names>J.</given-names></name> <name><surname>Desjardins</surname> <given-names>G.</given-names></name> <name><surname>Rusu</surname> <given-names>A. A.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Overcoming catastrophic forgetting in neural networks</article-title>. <source>Proc. Natl. Acad. Sci. U.S.A</source>. <volume>114</volume>, <fpage>3521</fpage>&#x02013;<lpage>3526</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.1611835114</pub-id><pub-id pub-id-type="pmid">28292907</pub-id></citation></ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>Z.</given-names></name> <name><surname>Hoiem</surname> <given-names>D.</given-names></name></person-group> (<year>2017</year>). <article-title>Learning without forgetting</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell</source>. <volume>40</volume>, <fpage>2935</fpage>&#x02013;<lpage>2947</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2017.2773081</pub-id><pub-id pub-id-type="pmid">29990101</pub-id></citation></ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lukosevicius</surname> <given-names>M.</given-names></name> <name><surname>Jaeger</surname> <given-names>H.</given-names></name></person-group> (<year>2009</year>). <article-title>Reservoir computing approaches to recurrent neural network training</article-title>. <source>Comput. Sci. Rev</source>. <volume>3</volume>, <fpage>127</fpage>&#x02013;<lpage>149</lpage>. <pub-id pub-id-type="doi">10.1016/j.cosrev.2009.03.005</pub-id></citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Maass</surname> <given-names>W.</given-names></name> <name><surname>Natschl&#x000E4;ger</surname> <given-names>T.</given-names></name> <name><surname>Markram</surname> <given-names>H.</given-names></name></person-group> (<year>2002</year>). <article-title>Real-time computing without stable states: a new framework for neural computation based on perturbations</article-title>. <source>Neural Comput</source>. <volume>14</volume>, <fpage>2531</fpage>&#x02013;<lpage>2560</lpage>. <pub-id pub-id-type="doi">10.1162/089976602760407955</pub-id><pub-id pub-id-type="pmid">12433288</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mallya</surname> <given-names>A.</given-names></name> <name><surname>Davis</surname> <given-names>D.</given-names></name> <name><surname>Lazebnik</surname> <given-names>S.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Piggyback: adapting a single network to multiple tasks by learning to mask weights,&#x0201D;</article-title> in <source>Proceedings of the European Conference on Computer Vision (ECCV)</source>, <volume>Munich</volume>, <fpage>67</fpage>&#x02013;<lpage>82</lpage>.</citation>
</ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>McCloskey</surname> <given-names>M.</given-names></name> <name><surname>Cohen</surname> <given-names>N. J.</given-names></name></person-group> (<year>1989</year>). <article-title>Catastrophic interference in connectionist networks: the sequential learning problem</article-title>. <source>Psychol. Learn. Motivat</source>. <volume>24</volume>, <fpage>109</fpage>&#x02013;<lpage>165</lpage>. <pub-id pub-id-type="doi">10.1016/S0079-7421(08)60536-8</pub-id></citation>
</ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Millidge</surname> <given-names>B.</given-names></name> <name><surname>Tschantz</surname> <given-names>A.</given-names></name> <name><surname>Buckley</surname> <given-names>C. L.</given-names></name></person-group> (<year>2020</year>). <article-title>Predictive coding approximates backprop along arbitrary computation graphs</article-title>. <source>CoRR, abs/2006.04182</source>.<pub-id pub-id-type="pmid">35534010</pub-id></citation></ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ororbia</surname> <given-names>A.</given-names></name> <name><surname>Mali</surname> <given-names>A.</given-names></name> <name><surname>Giles</surname> <given-names>C. L.</given-names></name> <name><surname>Kifer</surname> <given-names>D.</given-names></name></person-group> (<year>2020</year>). <article-title>Continual learning of recurrent neural networks by locally aligning distributed representations</article-title>. <source>IEEE Trans. Neural Netw. Learn. Syst</source>. <volume>31</volume>, <fpage>4267</fpage>&#x02013;<lpage>4278</lpage>. <pub-id pub-id-type="doi">10.1109/TNNLS.2019.2953622</pub-id><pub-id pub-id-type="pmid">31976910</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pascanu</surname> <given-names>R.</given-names></name> <name><surname>Mikolov</surname> <given-names>T.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name></person-group> (<year>2013</year>). <article-title>&#x0201C;On the difficulty of training recurrent neural networks,&#x0201D;</article-title> in <source>Proceedings of the 30th International Conference on International Conference on Machine Learning</source> - <italic>Volume 28, ICML&#x00027;13</italic> (JMLR.org), III&#x02013;1310&#x02013;III&#x02013;1318.</citation>
</ref>
<ref id="B21">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pitti</surname> <given-names>A.</given-names></name> <name><surname>Gaussier</surname> <given-names>P.</given-names></name> <name><surname>Quoy</surname> <given-names>M.</given-names></name></person-group> (<year>2017</year>). <article-title>Iterative free-energy optimization for recurrent neural networks (inferno)</article-title>. <source>PLoS ONE</source> <volume>12</volume>, <fpage>e0173684</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0173684</pub-id><pub-id pub-id-type="pmid">28282439</pub-id></citation></ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rao</surname> <given-names>R.</given-names></name> <name><surname>Ballard</surname> <given-names>D.</given-names></name></person-group> (<year>1999</year>). <article-title>Predictive coding in the visual cortex a functional interpretation of some extra-classical receptive-field effects</article-title>. <source>Nat. Neurosci</source>. <volume>2</volume>, <fpage>79</fpage>&#x02013;<lpage>87</lpage>. <pub-id pub-id-type="doi">10.1038/4580</pub-id><pub-id pub-id-type="pmid">10195184</pub-id></citation></ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rao</surname> <given-names>R. P. N.</given-names></name> <name><surname>Ballard</surname> <given-names>D. H.</given-names></name></person-group> (<year>1997</year>). <article-title>Dynamic model of visual recognition predicts neural response properties in the visual cortex</article-title>. <source>Neural Comput</source>. <volume>9</volume>, <fpage>721</fpage>&#x02013;<lpage>763</lpage>. <pub-id pub-id-type="doi">10.1162/neco.1997.9.4.721</pub-id><pub-id pub-id-type="pmid">9161021</pub-id></citation></ref>
<ref id="B24">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Rebuffi</surname> <given-names>S.-A.</given-names></name> <name><surname>Kolesnikov</surname> <given-names>A.</given-names></name> <name><surname>Sperl</surname> <given-names>G.</given-names></name> <name><surname>Lampert</surname> <given-names>C. H.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;ICARL: Incremental classifier and representation learning,&#x0201D;</article-title> in <source>Proceedings of the IEEE conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Honolulu, HI</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2001</fpage>&#x02013;<lpage>2010</lpage>.</citation>
</ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schmidhuber</surname> <given-names>J.</given-names></name> <name><surname>Wierstra</surname> <given-names>D.</given-names></name> <name><surname>Gagliolo</surname> <given-names>M.</given-names></name> <name><surname>Gomez</surname> <given-names>F.</given-names></name></person-group> (<year>2007</year>). <article-title>Training recurrent networks by evolino</article-title>. <source>Neural Comput</source>. <volume>19</volume>, <fpage>757</fpage>&#x02013;<lpage>779</lpage>. <pub-id pub-id-type="doi">10.1162/neco.2007.19.3.757</pub-id><pub-id pub-id-type="pmid">17298232</pub-id></citation></ref>
<ref id="B26">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Schmidhuber</surname> <given-names>J.</given-names></name> <name><surname>Wierstra</surname> <given-names>D.</given-names></name> <name><surname>Gomez</surname> <given-names>F.</given-names></name></person-group> (<year>2005</year>). <article-title>&#x0201C;Evolino: hybrid neuroevolution / optimal linear search for sequence learning,&#x0201D;</article-title> in <source>Proceedings of the 19th International Joint Conference on Artificial Intelligence, IJCAI&#x00027;05</source> (<publisher-loc>San Francisco, CA</publisher-loc>: <publisher-name>Morgan Kaufmann Publishers Inc.</publisher-name>), <fpage>853</fpage>&#x02013;<lpage>858</lpage>.</citation>
</ref>
<ref id="B27">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Shin</surname> <given-names>H.</given-names></name> <name><surname>Lee</surname> <given-names>J. K.</given-names></name> <name><surname>Kim</surname> <given-names>J.</given-names></name> <name><surname>Kim</surname> <given-names>J.</given-names></name></person-group> (<year>2017</year>). <article-title>Continual learning with deep generative replay</article-title>. <source>arXiv preprint</source> arXiv:1705.08690. <pub-id pub-id-type="doi">10.48550/arXiv.1705.08690</pub-id><pub-id pub-id-type="pmid">35448220</pub-id></citation></ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sodhani</surname> <given-names>S.</given-names></name> <name><surname>Chandar</surname> <given-names>S.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name></person-group> (<year>2020</year>). <article-title>Toward training recurrent neural networks for lifelong learning</article-title>. <source>Neural Comput</source>. <volume>32</volume>, <fpage>1</fpage>&#x02013;<lpage>35</lpage>. <pub-id pub-id-type="doi">10.1162/neco_a_01246</pub-id><pub-id pub-id-type="pmid">31703175</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sussillo</surname> <given-names>D.</given-names></name> <name><surname>Abbott</surname> <given-names>L.</given-names></name></person-group> (<year>2009</year>). <article-title>Generating coherent patterns of activity from chaotic neural networks</article-title>. <source>Neuron</source> <volume>63</volume>, <fpage>544</fpage>&#x02013;<lpage>557</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuron.2009.07.018</pub-id><pub-id pub-id-type="pmid">19709635</pub-id></citation></ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Verstraeten</surname> <given-names>D.</given-names></name> <name><surname>Schrauwen</surname> <given-names>B.</given-names></name> <name><surname>D&#x00027;Haene</surname> <given-names>M.</given-names></name> <name><surname>Stroobandt</surname> <given-names>D.</given-names></name></person-group> (<year>2007</year>). <article-title>An experimental unification of reservoir computing methods</article-title>. <source>Neural Netw</source>. <volume>20</volume>, <fpage>391</fpage>&#x02013;<lpage>403</lpage>. <pub-id pub-id-type="doi">10.1016/j.neunet.2007.04.003</pub-id><pub-id pub-id-type="pmid">17517492</pub-id></citation></ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Whittington</surname> <given-names>J. C. R.</given-names></name> <name><surname>Bogacz</surname> <given-names>R.</given-names></name></person-group> (<year>2017</year>). <article-title>An approximation of the error backpropagation algorithm in a predictive coding network with local hebbian synaptic plasticity</article-title>. <source>Neural Comput</source>. <volume>29</volume>, <fpage>1229</fpage>&#x02013;<lpage>1262</lpage>. <pub-id pub-id-type="doi">10.1162/NECO_a_00949</pub-id><pub-id pub-id-type="pmid">28333583</pub-id></citation></ref>
</ref-list>
<fn-group>
<fn id="fn0001"><p><sup>1</sup><ext-link ext-link-type="uri" xlink:href="https://github.com/sino7/continual_online_learning_rnn_benchmark">https://github.com/sino7/continual_online_learning_rnn_benchmark</ext-link></p></fn>
</fn-group>
</back>
</article>