<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Robot. AI</journal-id>
<journal-title>Frontiers in Robotics and AI</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Robot. AI</abbrev-journal-title>
<issn pub-type="epub">2296-9144</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">762051</article-id>
<article-id pub-id-type="doi">10.3389/frobt.2022.762051</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Robotics and AI</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Exploratory State Representation Learning</article-title>
<alt-title alt-title-type="left-running-head">Merckling et&#x20;al.</alt-title>
<alt-title alt-title-type="right-running-head">XSRL (eXploratory State Representation Learning)</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Merckling</surname>
<given-names>Astrid</given-names>
</name>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1308674/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Perrin-Gilbert</surname>
<given-names>Nicolas</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/612525/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Coninx</surname>
<given-names>Alex</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/544216/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Doncieux</surname>
<given-names>St&#xe9;phane</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/92837/overview"/>
</contrib>
</contrib-group>
<aff>
<institution>Sorbonne Universit&#x00E9;, CNRS, Institut des Syst&#x00E8;mes Intelligents et de Robotique, ISIR</institution>, <addr-line>Paris</addr-line>, <country>France</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/28940/overview">Thomas Nowotny</ext-link>, University of Sussex, United&#x20;Kingdom</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/9004/overview">Dimitri Ognibene</ext-link>, University of Milano-Bicocca, Italy</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/816055/overview">Felix Benjamin Kern</ext-link>, International Research Center for Neurointelligence (IRCN), Japan</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Astrid Merckling, <email>astrid.merckling@isir.upmc.fr</email>
</corresp>
<fn fn-type="other">
<p>This article was submitted to Computational Intelligence in Robotics, a section of the journal Frontiers in Robotics and&#x20;AI</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>14</day>
<month>02</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>9</volume>
<elocation-id>762051</elocation-id>
<history>
<date date-type="received">
<day>20</day>
<month>08</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>11</day>
<month>01</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2022 Merckling, Perrin-Gilbert, Coninx and Doncieux.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Merckling, Perrin-Gilbert, Coninx and Doncieux</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these&#x20;terms.</p>
</license>
</permissions>
<abstract>
<p>Not having access to compact and meaningful representations is known to significantly increase the complexity of reinforcement learning (RL). For this reason, it can be useful to perform state representation learning (SRL) before tackling RL tasks. However, obtaining a good state representation can only be done if a large diversity of transitions is observed, which can require a difficult exploration, especially if the environment is initially reward-free. To solve the problems of exploration and SRL in parallel, we propose a new approach called XSRL (eXploratory State Representation Learning). On one hand, it jointly learns compact state representations and a state transition estimator which is used to remove unexploitable information from the representations. On the other hand, it continuously trains an inverse model, and adds to the prediction error of this model a <italic>k</italic>-step learning progress bonus to form the maximization objective of a discovery policy. This results in a policy that seeks complex transitions from which the trained models can effectively learn. Our experimental results show that the approach leads to efficient exploration in challenging environments with image observations, and to state representations that significantly accelerate learning in RL&#x20;tasks.</p>
</abstract>
<kwd-group>
<kwd>state representation learning</kwd>
<kwd>pretraining</kwd>
<kwd>exploration</kwd>
<kwd>unsupervised learning</kwd>
<kwd>deep reinforcement learning</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>Recent improvements in computational power and deep learning techniques have been combined with reinforcement learning (RL) to create deep RL (DRL) algorithms capable of solving complex control tasks with continuous state and action spaces (<xref ref-type="bibr" rid="B33">Li, 2018</xref>). These improvements have popularized end-to-end DRL techniques, which involve letting deep learning systems automatically learn representations and make predictions simultaneously (i.e. without performing a feature extraction as a preliminary phase). However, despite its simplicity of design, this end-to-end strategy has limitations [see <xref ref-type="bibr" rid="B15">Glasmachers (2017)</xref>] such as potential instability and slow convergence. In some cases it seems advantageous to separate representations and policies in different modules and train representations independently from the sparse and delayed rewards of RL&#x20;tasks.</p>
<p>State-of-the-art end-to-end DRL algorithms face a significant computational challenge, especially in the context of continuous control tasks with visual observations (<xref ref-type="bibr" rid="B28">Kostrikov et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B30">Laskin et&#x20;al., 2020</xref>). Instead of addressing this challenge directly, this paper focuses on the state representation learning (SRL) alternative. SRL focuses on solving the state representation learning problem independently of a control task, in order to make the inputs more machine-readable for DRL algorithms (<xref ref-type="bibr" rid="B29">Lange and Riedmiller, 2010</xref>; <xref ref-type="bibr" rid="B25">Jonschkowski and Brock, 2013</xref>; <xref ref-type="bibr" rid="B4">B&#xf6;hmer et&#x20;al., 2015</xref>). It relies on task-agnostic and reward-free interactions to capture relevant information about the agent and its environment and to represent it in a compact form (<xref ref-type="bibr" rid="B32">Lesort et&#x20;al., 2018</xref>; <xref ref-type="bibr" rid="B39">Morik et&#x20;al., 2019</xref>).</p>
<p>The main starting point of our work is the following remark: for state representations to be useful as inputs to new RL tasks, the SRL training must have observed a large diversity of transitions. In the SRL literature, this has typically been addressed with demonstrations (<xref ref-type="bibr" rid="B47">Sermanet et&#x20;al., 2018</xref>; <xref ref-type="bibr" rid="B37">Merckling et&#x20;al., 2020</xref>) or random exploration (<xref ref-type="bibr" rid="B24">Jonschkowski and Brock, 2015</xref>; <xref ref-type="bibr" rid="B60">Yarats et&#x20;al., 2019</xref>). However, it is often impossible to randomly explore all environment transitions, and generating demonstrations requires time and a priori knowledge about potential tasks. Therefore, this work proposes to extend the exploration strategies used in RL to the context of SRL. We place ourselves in a pure exploration context, where no extrinsic reward is provided by the environment. A common approach with RL in this reward-free setting is to compute intrinsic rewards that estimate a degree of uncertainty about trained models (<xref ref-type="bibr" rid="B6">Bubeck et&#x20;al., 2009</xref>; <xref ref-type="bibr" rid="B49">Shyam et&#x20;al., 2019</xref>; <xref ref-type="bibr" rid="B46">Sekar et&#x20;al., 2020</xref>). With this approach in mind, we propose a new exploration strategy to learn state representation models, called XSRL (eXploratory State Representation Learning).</p>
<p>XSRL consists of a twofold training procedure. In the first training procedure, XSRL learns state representations whose transitions are Markovian while advantageously reducing dimensionality by filtering out unexploitable information with respect to the objective of next observation prediction. In the second training procedure, XSRL learns discovery policies that perform actions considered uncertain by an inverse model. Finally, in order to cope with the two sources of non-stationarity due to changing state representations and inverse model predictions, we train two discovery policies in parallel and, given their performances, reset one of them after a given number of training steps (as explained in <xref ref-type="sec" rid="s3-2-1">Section 3.2.1</xref>), where one training step corresponds to a gradient descent for some loss on a batch of transitions. We use an online training with a set of agents, each half of which follows one of the two policies (see <xref ref-type="sec" rid="s3-3">Section&#x20;3.3</xref>).</p>
<p>The main contributions of XSRL can be summarized as follows. First, we introduce a novel SRL architecture based on recursive state estimation predictions. Second, XSRL provides an exploration strategy by optimizing discovery policies driven towards uncertain transitions (<xref ref-type="sec" rid="s3-2">Section 3.2</xref>). Third, we demonstrate the validity of XSRL representations as well as its discovery policies through quantitative and qualitative evaluations on three different environments (<xref ref-type="sec" rid="s5-1">Section 5.1</xref>). Finally, we show improvements over other representation strategies through a comparative quantitative evaluation on unseen control tasks with the popular RL algorithm SAC (<xref ref-type="bibr" rid="B17">Haarnoja et&#x20;al., 2018</xref>) (<xref ref-type="sec" rid="s5-2">Section&#x20;5.2</xref>).</p>
</sec>
<sec id="s2">
<title>2 Related Work</title>
<p>Several other SRL algorithms with a near-future prediction objective have been proposed recently (<xref ref-type="bibr" rid="B2">Assael et&#x20;al., 2015</xref>; <xref ref-type="bibr" rid="B4">B&#xf6;hmer et&#x20;al., 2015</xref>; <xref ref-type="bibr" rid="B56">Wahlstr&#xf6;m et&#x20;al., 2015</xref>; <xref ref-type="bibr" rid="B57">Watter et&#x20;al., 2015</xref>; <xref ref-type="bibr" rid="B55">van Hoof et&#x20;al., 2016</xref>; <xref ref-type="bibr" rid="B22">Jaderberg et&#x20;al., 2017</xref>; <xref ref-type="bibr" rid="B48">Shelhamer et&#x20;al., 2017</xref>; <xref ref-type="bibr" rid="B12">de Bruin et&#x20;al., 2018</xref>). However, they separately learn state representations from which current observations can be reconstructed, and train a forward model on the learned states. The main limitation of these approaches is the inefficiency of the reconstruction objective, which leads to representations that contain unnecessary information about the observations. Instead, XSRL jointly learns a state transition estimator with the next observation prediction objective. On the one hand, this forces the learned state representations to retrieve information and memorize it through the recursive loop in order to restore the observability of the environment (in this work, the partial observability is due to image observations) and to verify the Markovian property. On the other hand, this forces the learned state representations to filter out unnecessary information, in particular information about distractors (i.e. elements that are not controllable or do not affect an agent).</p>
<p>The XSRL exploration strategy is inspired by the line of work that maximizes intrinsic rewards corresponding to prediction errors of a trained forward model, which is a form of dynamics-based curiosity (<xref ref-type="bibr" rid="B21">Hester and Stone, 2012</xref>; <xref ref-type="bibr" rid="B42">Pathak et&#x20;al., 2017</xref>; <xref ref-type="bibr" rid="B7">Burda et&#x20;al., 2018</xref>). These strategies often combine intrinsic rewards with extrinsic rewards to solve the complex exploration/exploitation tradeoff. Instead, the first phase of XSRL ignores extrinsic reward to focus on SRL and prediction model learning. Extrinsic reward only comes in a second step (the RL tasks). In addition, for intrinsic motivation XSRL relies on prediction errors of an inverse model instead of those of a forward model. Prediction errors of an inverse model have the advantage of depending only on elements of the environment controllable by an agent (assuming there are no surjective transitions). It allows to discard the rest and thus to significantly reduces the size of the acquired state representation.</p>
<p>Finally, a variant of <italic>k</italic>-step learning progress bonus is used to focus on transitions for which the forward model predictions are changing. Learning progress estimation was initially proposed in the field of developmental robotics (<xref ref-type="bibr" rid="B40">Oudeyer et&#x20;al., 2007</xref>). <xref ref-type="bibr" rid="B35">Lopes et&#x20;al. (2012)</xref> initiated the estimation of learning progress bonuses to solve the exploitation/exploration tradeoff in the model-based RL domain with finite MDPs. <xref ref-type="bibr" rid="B1">Achiam and Sastry (2017)</xref> have scaled this approach to continuous MDPs with compact observations of several dozen dimensions. We apply the approach of <xref ref-type="bibr" rid="B1">Achiam and Sastry (2017)</xref> to image observations and in the SRL context.</p>
</sec>
<sec id="s3">
<title>3 Proposed Method: XSRL</title>
<sec id="s3-1">
<title>3.1 State Transition Estimator</title>
<p>The goal of SRL is to transform high-dimensional observations into machine-readable compact representations which retrieve information about an agent and the environment (<xref ref-type="bibr" rid="B32">Lesort et&#x20;al., 2018</xref>). With XSRL, we make the assumption that a good state representation must contain the information needed to predict the next observation from the previous time step, or at least the change in observation that can be explained by the agent&#x2019;s action.</p>
<p>Our state transition estimator <italic>&#x3c6;</italic> consists of two neural network parts (<italic>&#x3b1;</italic>, <italic>&#x3b2;</italic>), and a common network head <italic>&#x3b3;</italic>. While <italic>&#x3b1;</italic> is a convolutional neural network (CNN) to process image observations, <italic>&#x3b2;</italic> is a multilayer perceptron (MLP) to process the concatenated action and state vectors. Finally, the common network head <italic>&#x3b3;</italic> is a MLP that processes the concatenated output vectors of the two first networks to estimate next state vectors (<bold>s</bold>
<sub>
<italic>t</italic>&#x2b;1</sub>).</p>
<p>The graph in <xref ref-type="fig" rid="F1">Figure&#x20;1A</xref>, shows how, from current observation <bold>o</bold>
<sub>
<italic>t</italic>
</sub>, action <bold>a</bold>
<sub>
<italic>t</italic>
</sub> and state <bold>s</bold>
<sub>
<italic>t</italic>
</sub>, information is compactly merged into a next state <bold>s</bold>
<sub>
<italic>t</italic>&#x2b;1</sub> through the intermediate functions (<italic>&#x3b1;</italic>, <italic>&#x3b2;</italic>, <italic>&#x3b3;</italic>). Because of the recursive loop on the state representation, <italic>&#x3c6;</italic> bootstraps from an initial state drawn from a Gaussian distribution of mean zero and standard deviation 0.02. Putting all the functions together, we get the following definition of next state representation predictions:<disp-formula id="e1">
<mml:math id="m1">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>&#x3c6;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>&#x3b3;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:mi>&#x3b1;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:mspace width="0.28em"/>
<mml:mi>&#x3b2;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:math>
<label>(1)</label>
</disp-formula>where we abbreviate the state transition estimator network (<italic>&#x3b1;</italic>, <italic>&#x3b2;</italic>, <italic>&#x3b3;</italic>) by <italic>&#x3c6;</italic> and their parameters into the following parameter set <bold>
<italic>&#x3b8;</italic>
</bold>
<sub>
<italic>&#x3c6;</italic>
</sub> &#x3d; {<bold>
<italic>&#x3b8;</italic>
</bold>
<sub>
<italic>&#x3b1;</italic>
</sub>, <bold>
<italic>&#x3b8;</italic>
</bold>
<sub>
<italic>&#x3b2;</italic>
</sub>, <bold>
<italic>&#x3b8;</italic>
</bold>
<sub>
<italic>&#x3b3;</italic>
</sub>}. The implementation details of the whole neural network are displayed in <xref ref-type="table" rid="T3">Table&#x20;3</xref>.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>
<bold>(A)</bold> XSRL learning process of state representations by jointly training a state transition estimator <italic>&#x3c6;</italic> formed by (<italic>&#x3b1;</italic>, <italic>&#x3b2;</italic>, <italic>&#x3b3;</italic>) and a next observation predictor <italic>&#x3c9;</italic>; the action <inline-formula id="inf1">
<mml:math id="m2">
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula> is sampled from <italic>&#x3c0;</italic> &#x2208; {<italic>&#x3c0;</italic>
<sub>1</sub>, <italic>&#x3c0;</italic>
<sub>2</sub>} with equal probability (a set of agents are considered in parallel and each is assigned a randomly chosen policy, as explained in <xref ref-type="sec" rid="s3-3">Section 3.3</xref>). <bold>(B)</bold> XSRL learning process of a discovery policy by minimizing <inline-formula id="inf2">
<mml:math id="m3">
<mml:mi mathvariant="script">L</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>, which is related to intrinsic rewards. Intrinsic rewards are formed of two main terms: (i) <inline-formula id="inf3">
<mml:math id="m4">
<mml:msup>
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">I</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>: prediction errors of an inverse model <inline-formula id="inf4">
<mml:math id="m5">
<mml:mi mathvariant="script">I</mml:mi>
</mml:math>
</inline-formula> (also used in <inline-formula id="inf5">
<mml:math id="m6">
<mml:mi mathvariant="script">L</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">I</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>); (ii) <italic>r</italic>
<sup>LPB</sup>: <italic>k</italic>-step learning progress bonuses for <italic>&#x3c6;</italic> where the parameters of <italic>&#x3c6;</italic>&#x2032; formed by (<italic>&#x3b1;</italic>&#x2032;, <italic>&#x3b2;</italic>&#x2032;, <italic>&#x3b3;</italic>&#x2032;) are delayed by <italic>k</italic> training steps and kept fixed.</p>
</caption>
<graphic xlink:href="frobt-09-762051-g001.tif"/>
</fig>
<p>
<italic>&#x3c6;</italic> is trained jointly with a next observation predictor <italic>&#x3c9;</italic>. <italic>&#x3c9;</italic> is a CNN with transposed convolution layers<xref ref-type="fn" rid="fn1">
<sup>1</sup>
</xref> trained to deterministically predict from the outputs of <italic>&#x3c6;</italic> (i.e. <bold>s</bold>
<sub>
<italic>t</italic>&#x2b;1</sub>) the next observations: <inline-formula id="inf6">
<mml:math id="m7">
<mml:mi>&#x3c9;</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">o</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>. This yields the following prediction error:<disp-formula id="e2">
<mml:math id="m8">
<mml:mo stretchy="false">&#x2016;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">o</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msubsup>
<mml:mrow>
<mml:mo stretchy="false">&#x2016;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:math>
<label>(2)</label>
</disp-formula>
</p>
<p>All the parameters of <italic>&#x3c9;</italic> are gathered in a single parameter set <bold>
<italic>&#x3b8;</italic>
</bold>
<sub>
<italic>&#x3c9;</italic>
</sub>. The corresponding training process is described with the complete XSRL training process in <xref ref-type="sec" rid="s3-3">Section&#x20;3.3</xref>.</p>
<p>Thanks to this joint training of <italic>&#x3c6;</italic> and <italic>&#x3c9;</italic>, XSRL builds compact state representations which contain the information needed to predict the action consequences in the next observation. We assume that a ground truth state space exists that follows Markovian transitions. It is unknown and only image observations are available, making the environment partially observable, which may be due to perceptual aliasing or to the dynamics of the system that cannot be fully captured by an image. We therefore force <italic>&#x3c6;</italic> to memorize in the state representations (through the recursive loop) the information of past time steps in order to build a state space with Markovian transitions. Indeed, to predict the (predictable part of the) next observation with <italic>&#x3c9;</italic>, the next state representation <inline-formula id="inf7">
<mml:math id="m9">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>&#x3c6;</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> must contain the information of past and current time steps. As this information cannot only be retrieved from <bold>o</bold>
<sub>
<italic>t</italic>
</sub> and <bold>a</bold>
<sub>
<italic>t</italic>
</sub>, some of it must be memorized in <bold>s</bold>
<sub>
<italic>t</italic>
</sub> through the recursive state loop. In this way, the state representations learned by XSRL are trained to form Markovian transitions that translate mathematically as follows:<disp-formula id="e3">
<mml:math id="m10">
<mml:mi>P</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>P</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:math>
<label>(3)</label>
</disp-formula>for all states <inline-formula id="inf8">
<mml:math id="m11">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:mi mathvariant="script">S</mml:mi>
<mml:mo>&#x2282;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> and actions <inline-formula id="inf9">
<mml:math id="m12">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
<mml:mo>&#x2282;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>.</p>
<p>As perceptual aliasing may occur, <italic>&#x3c6;</italic> needs to encode information about previous steps to predict the right observations after ambiguous ones. For example, in the case of a mobile robotics setup (such as the TurtleBot Maze environment described below), the representation built by XSRL is expected to capture the topology of the environment because a form of odometry is necessary to predict next observations [see <xref ref-type="bibr" rid="B3">B&#xf6;hmer et&#x20;al. (2013)</xref>].</p>
</sec>
<sec id="s3-2">
<title>3.2 Discovery in the Face of Uncertainty</title>
<sec id="s3-2-1">
<title>3.2.1&#x20;Over-Commitment</title>
<p>A problem that arises in pure exploration with dynamics-based curiosity is the non-stationarity of intrinsic rewards. Specifically, as in other dynamics-based curiosity explorations from image observations, two sources of non-stationarity emerge (<xref ref-type="bibr" rid="B7">Burda et&#x20;al., 2018</xref>): (i) the models change and adapts to novel observations, which modifies intrinsic rewards, (ii) the state representations change, which requires further adaptation of the models. Such a non-stationary training signal can lead to slow exploration as policies have to &#x201c;unlearn&#x201d; in areas where the novelty wears off. <xref ref-type="bibr" rid="B49">Shyam et&#x20;al. (2019)</xref> have called this problem &#x201c;over-commitment.&#x201d; They proposed to circumvent it by training from scratch a new policy. We follow a similar idea in XSRL by training two discovery policies in parallel called <italic>&#x3c0;</italic>
<sub>1</sub> and <italic>&#x3c0;</italic>
<sub>2</sub>, and every <italic>T</italic>
<sub>reset</sub> iterations we reset the policy with the lowest cumulative intrinsic reward.</p>
</sec>
<sec id="s3-2-2">
<title>3.2.2 Intrinsic Rewards</title>
<p>The intrinsic rewards to be maximized by XSRL discovery policies are a combination of the following terms: (i) prediction errors of an inverse model which should be maximized on transitions with high uncertainty with respect to the elements controllable by an agent; (ii) <italic>k</italic>-step learning progress bonuses that should be maximized on transitions for which the predictions of the forward model <italic>&#x3c6;</italic> are changing; (iii) a policy entropy estimation to improve convergence stability. <xref ref-type="fig" rid="F1">Figure&#x20;1B</xref> shows the graph corresponding to the calculation of the two main terms (i) and&#x20;(ii).</p>
<sec id="s3-2-2-1">
<title>3.2.2.1 Inverse Model</title>
<p>Previous dynamics-based curiosity methods typically used a forward model to indirectly estimate action uncertainty (<xref ref-type="bibr" rid="B7">Burda et&#x20;al., 2018</xref>). The common issue with this approach is that it can drive exploration policies towards transitions with intrinsic (aleatoric) uncertainty (<xref ref-type="bibr" rid="B45">Schmidhuber, 1991</xref>). One way to solve this problem is to train an ensemble of models (<xref ref-type="bibr" rid="B10">Chua et&#x20;al., 2018</xref>). Initialized differently, the models tend to disagree in neighborhoods of transitions that have not been explored (so where there is a lack of data, i.e. epistemic uncertainty), but the models agree on transitions that have been observed, even if they contain irreducible aleatoric uncertainty. Thus, seeking transitions for which the models disagree drives the exploration towards epistemic uncertainty, which is the desired behavior. In this paper, we follow another approach, inspired by <xref ref-type="bibr" rid="B42">Pathak et&#x20;al. (2017)</xref>, that combines a forward and an inverse model. In <xref ref-type="bibr" rid="B42">Pathak et&#x20;al. (2017)</xref>, the inverse model is used to construct a feature space that erases environmental features that are not influenced by the agent&#x2019;s actions. Curiosity based on a forward model in this feature space avoids the issue of aleatoric uncertainty. We proceed in a slightly different way that removes the need for a new feature space, by encouraging discovery policies to seek transitions for which the composition of a forward model (<italic>&#x3c6;</italic>) and inverse model does not retrieve the intended action. This tends to be true when data is lacking (epistemic uncertainty), and false for data on which the models are well-trained. Aleatoric uncertainty is ignored by the forward model, but that does not prevent the inverse model from retrieving the correct action, therefore aleatoric uncertainty alone does not attract exploration.</p>
<p>Our inverse model takes as input a pair of consecutive states estimated by the forward model (<bold>s</bold>
<sub>
<italic>t</italic>
</sub>, <bold>s</bold>
<sub>
<italic>t</italic>&#x2b;1</sub>) to predict the action <inline-formula id="inf10">
<mml:math id="m13">
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mi mathvariant="script">I</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> executed by an agent to obtain the next state <bold>s</bold>
<sub>
<italic>t</italic>&#x2b;1</sub>. The prediction errors to be maximized by the discovery policies and minimized by the inverse model are calculated as follows:<disp-formula id="e4">
<mml:math id="m14">
<mml:msup>
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">I</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mfenced open="&#x2016;" close="&#x2016;">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>.</mml:mo>
</mml:math>
<label>(4)</label>
</disp-formula>
</p>
<p>The action <inline-formula id="inf11">
<mml:math id="m15">
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula> is sampled from <italic>&#x3c0;</italic> &#x2208; {<italic>&#x3c0;</italic>
<sub>1</sub>, <italic>&#x3c0;</italic>
<sub>2</sub>} with equal probability. The training process of the inverse model is detailed later in <xref ref-type="sec" rid="s3-3">Section&#x20;3.3</xref>.</p>
</sec>
<sec id="s3-2-2-2">
<title>3.2.2.2 Learning Progress Bonus</title>
<p>To ensure that actions considered uncertain by the composition of a forward and inverse model lead to diverse unknown transitions, we use a <italic>k</italic>-step learning progress bonus on <italic>&#x3c6;</italic>. It makes the agent curious mainly about things that change the predictions of <italic>&#x3c6;</italic>. Following <xref ref-type="bibr" rid="B1">Achiam and Sastry (2017)</xref>, we compute this learning progress bonus from <italic>&#x3c6;</italic> and its clone denoted <italic>&#x3c6;</italic>&#x2032; formed by (<italic>&#x3b1;</italic>&#x2032;, <italic>&#x3b2;</italic>&#x2032;, <italic>&#x3b3;</italic>&#x2032;), whose parameters are delayed by <italic>k</italic> training steps and kept frozen. The squared Euclidean distance between the outputs of these two networks is an estimate of the changes in <italic>&#x3c6;</italic> after <italic>k</italic> training&#x20;steps.</p>
<p>The <italic>k</italic>-step learning progress bonus to be maximized by the two discovery policies is defined as follows:<disp-formula id="e5">
<mml:math id="m16">
<mml:msup>
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>LPB</mml:mtext>
</mml:mrow>
</mml:msup>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mfenced open="&#x2016;" close="&#x2016;">
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2212;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:math>
<label>(5)</label>
</disp-formula>where the action <inline-formula id="inf12">
<mml:math id="m17">
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula> is sampled from <italic>&#x3c0;</italic> &#x2208; {<italic>&#x3c0;</italic>
<sub>1</sub>, <italic>&#x3c0;</italic>
<sub>2</sub>} with equal probability.</p>
</sec>
<sec id="s3-2-2-3">
<title>3.2.2.3 Policy Entropy Estimation</title>
<p>
<xref ref-type="bibr" rid="B61">Ziebart et&#x20;al. (2008)</xref> and <xref ref-type="bibr" rid="B16">Haarnoja et&#x20;al. (2017)</xref> showed that optimizing policies to maximize entropy in addition to expected return improved their convergence. The formulation depends on a temperature <inline-formula id="inf13">
<mml:math id="m18">
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">H</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> which is the weight of the entropy maximization term. Following <xref ref-type="bibr" rid="B17">Haarnoja et&#x20;al. (2018)</xref>, the temperature tuning is automated by formulating a different entropy objective, where the entropy is treated as a constraint. Approximating a dual gradient descent, <inline-formula id="inf14">
<mml:math id="m19">
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">H</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> is adapted online by gradient steps on the following expression:<disp-formula id="e6">
<mml:math id="m20">
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">H</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:mi mathvariant="script">H</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2212;</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="script">H</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
</mml:mfenced>
</mml:math>
<label>(6)</label>
</disp-formula>
</p>
<p>By default, <inline-formula id="inf15">
<mml:math id="m21">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="script">H</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula>, the target entropy, is chosen to be equal to minus the action dimension <inline-formula id="inf16">
<mml:math id="m22">
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>. See (<xref ref-type="bibr" rid="B17">Haarnoja et&#x20;al., 2018</xref>) for more details.</p>
</sec>
</sec>
<sec id="s3-2-3">
<title>3.2.3 Discovery Policies</title>
<p>Now that we have detailed the three terms for computing intrinsic rewards, we explain how we train discovery policies to maximize them. In this work, we study environments with continuous action spaces. A possible approach to learn a policy in this case is to model it as a multivariate Gaussian distribution with a diagonal covariance matrix (<xref ref-type="bibr" rid="B17">Haarnoja et&#x20;al., 2018</xref>). To do this, we use a neural network with a first common part, then one head <italic>&#x3bc;</italic>
<sub>
<italic>&#x3c0;</italic>
</sub> with parameters <bold>
<italic>&#x3b8;</italic>
</bold>
<sub>
<italic>&#x3bc;</italic>
</sub> to predict a mean vector, and a second head &#x3a3;<sub>
<italic>&#x3c0;</italic>
</sub> with parameters <bold>
<italic>&#x3b8;</italic>
</bold>
<sub>&#x3a3;</sub> to predict the diagonal covariance elements of a covariance matrix. The outputs of these two heads, which have the same dimension as the action space, allow us to parameterize a policy, so that it follows a Gaussian distribution defined as:<disp-formula id="e7">
<mml:math id="m23">
<mml:mi>&#x3c0;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x225c;</mml:mo>
<mml:mi mathvariant="script">N</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3bc;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="normal">&#x3a3;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:math>
<label>(7)</label>
</disp-formula>
</p>
<p>All parameters of a discovery policy <italic>&#x3c0;</italic> are gathered in a single parameter set <bold>
<italic>&#x3b8;</italic>
</bold>
<sub>
<italic>&#x3c0;</italic>
</sub> &#x3d; {<bold>
<italic>&#x3b8;</italic>
</bold>
<sub>
<italic>&#x3bc;</italic>
</sub>, <bold>
<italic>&#x3b8;</italic>
</bold>
<sub>&#x3a3;</sub>}. The reparametrization trick (<xref ref-type="bibr" rid="B27">Kingma and Welling, 2014</xref>) is used to sample an action from a policy (i.e. <inline-formula id="inf17">
<mml:math id="m24">
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x223c;</mml:mo>
<mml:mi>&#x3c0;</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>) to keep all its parameters differentiable:<disp-formula id="e8">
<mml:math id="m25">
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x225c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3bc;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3f5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#xd7;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="normal">&#x3a3;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mspace width="1em"/>
<mml:mo>,</mml:mo>
<mml:mspace width="1em"/>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3f5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x223c;</mml:mo>
<mml:mi mathvariant="script">N</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mn mathvariant="bold">0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">I</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:math>
<label>(8)</label>
</disp-formula>
</p>
<p>The two discovery policies (<italic>&#x3c0;</italic> &#x2208; {<italic>&#x3c0;</italic>
<sub>1</sub>, <italic>&#x3c0;</italic>
<sub>2</sub>}) can be optimized directly from the intrinsic reward gradients. The intrinsic rewards are computed with prediction errors of an inverse model, <italic>k</italic>-step learning progress bonuses on <italic>&#x3c6;</italic>, and a policy entropy estimation, all of which use actions sampled from <italic>&#x3c0;</italic>. Thus, our discovery policy training strategy is based on stochastic gradients from batches for the minimization of the expected value of the following loss function:<disp-formula id="e9">
<mml:math id="m26">
<mml:mo>&#x2212;</mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">I</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">I</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>LPB</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>LPB</mml:mtext>
</mml:mrow>
</mml:msup>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">H</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="script">H</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:math>
<label>(9)</label>
</disp-formula>
</p>
<p>This yields a maximization of the intrinsic rewards. The corresponding training process is described in the next section.</p>
</sec>
</sec>
<sec id="s3-3">
<title>3.3 Optimization Process</title>
<p>Let us define the notations for the training examples we manipulate in our online training procedure. There is an even number <italic>B</italic>&#x20;&#x2265; 2 of agents in parallel, denoted by <inline-formula id="inf18">
<mml:math id="m27">
<mml:mi>b</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:mspace width="-0.17em"/>
<mml:mspace width="-0.17em"/>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>B</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mspace width="-0.17em"/>
<mml:mspace width="-0.17em"/>
</mml:mrow>
</mml:mfenced>
</mml:math>
</inline-formula>, each of them being initialized in the same fixed configuration. At time step <italic>t</italic>, a training example for (<italic>&#x3c6;</italic>, <italic>&#x3c9;</italic>) is an element of the form <inline-formula id="inf19">
<mml:math id="m28">
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
</mml:math>
</inline-formula>, composed respectively of next observation, current observation, previously estimated state representation, and executed action sampled from one of the two discovery policies as <inline-formula id="inf20">
<mml:math id="m29">
<mml:msup>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
<mml:mo>&#x223c;</mml:mo>
<mml:mi>&#x3c0;</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> (following the sampling process defined in <xref ref-type="disp-formula" rid="e8">Eq. 8</xref>). Specifically, each half of the set of <italic>B</italic> agents follows one of the two policies <italic>&#x3c0;</italic> &#x2208; {<italic>&#x3c0;</italic>
<sub>1</sub>, <italic>&#x3c0;</italic>
<sub>2</sub>}. A state transition estimator <italic>&#x3c6;</italic> composed of three modules (<italic>&#x3b1;</italic>, <italic>&#x3b2;</italic>, <italic>&#x3b3;</italic>) estimates from the triplet input <inline-formula id="inf21">
<mml:math id="m30">
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
</mml:math>
</inline-formula> the next state <inline-formula id="inf22">
<mml:math id="m31">
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula>, from which <italic>&#x3c9;</italic> predicts the next observation <inline-formula id="inf23">
<mml:math id="m32">
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">o</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula>.</p>
<p>The optimization problem to simultaneously train <italic>&#x3c6;</italic> and <italic>&#x3c9;</italic>, is the minimization of the following objective function (based on the next observation prediction error of <xref ref-type="disp-formula" rid="e1">Eq. 1</xref>):<disp-formula id="e10">
<mml:math id="m33">
<mml:mi mathvariant="script">L</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c9;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mo stretchy="false">&#x2016;</mml:mo>
<mml:mi>&#x3c9;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2212;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msubsup>
<mml:msubsup>
<mml:mrow>
<mml:mo stretchy="false">&#x2016;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:math>
<label>(10)</label>
</disp-formula>
</p>
<p>We compute this objective function after all <italic>B</italic> agents have executed their actions <inline-formula id="inf24">
<mml:math id="m34">
<mml:msup>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>, and let the backpropagation compute the partial derivatives of this objective function with respect to the parameter sets <bold>
<italic>&#x3b8;</italic>
</bold>
<sub>
<italic>&#x3c6;</italic>
</sub> and <bold>
<italic>&#x3b8;</italic>
</bold>
<sub>
<italic>&#x3c9;</italic>
</sub>. One gradient descent on this loss is what we call a training step on <inline-formula id="inf25">
<mml:math id="m35">
<mml:mi mathvariant="script">L</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c9;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
<sec id="s3-3-1">
<title>3.3.1 Update Interval</title>
<p>The inverse model and the two discovery policies are trained in parallel to the above training. For losses other than <inline-formula id="inf26">
<mml:math id="m36">
<mml:mi mathvariant="script">L</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c9;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>, instead of performing a training step after every agent executed its action, it is performed after a chosen update interval (<italic>T</italic>
<sub>
<italic>&#x3c0;</italic>
</sub>). Since the policy optimization is much more sensible to the i.i.d. hypothesis (of the Robbins-Monro&#x2019;s conditions for stochastic gradient descent to converge (<xref ref-type="bibr" rid="B44">Robbins and Monro, 1951</xref>)), we use the largest possible sampling period <italic>k</italic> for these two types of optimization (<italic>k</italic> also corresponds to the number of training steps whose the parameters of <italic>&#x3c6;</italic>&#x2032; are delayed). To do this, we specify an update interval <inline-formula id="inf27">
<mml:math id="m37">
<mml:msub>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2b;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> defining the number of time steps before a training step is performed on the parameters of the inverse model and of the two discovery policies. Given a chosen batch size <inline-formula id="inf28">
<mml:math id="m38">
<mml:msub>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2b;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> and the number <italic>B</italic> of agents running in parallel, a batch of training examples is formed of <inline-formula id="inf29">
<mml:math id="m39">
<mml:mfenced open="&#x230a;" close="&#x230b;">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
</mml:math>
</inline-formula> samplings. To maximize the independence between each of these samplings, we choose the sampling period to be <inline-formula id="inf30">
<mml:math id="m40">
<mml:mi>k</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfenced open="&#x230a;" close="&#x230b;">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
</mml:math>
</inline-formula>.</p>
<p>The optimization problem to train the inverse model is the minimization of the following objective function (based on the action prediction error of <xref ref-type="disp-formula" rid="e4">Eq. 4</xref>):<disp-formula id="e11">
<mml:math id="m41">
<mml:mi mathvariant="script">L</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">I</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="&#x230a;" close="&#x230b;">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msubsup>
<mml:mrow>
<mml:mfenced open="&#x2016;" close="&#x2016;">
<mml:mrow>
<mml:mi mathvariant="script">I</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msubsup>
<mml:mspace width="0.28em"/>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2212;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="normal">b</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:math>
<label>(11)</label>
</disp-formula>
</p>
<p>The backpropagation computes the partial derivatives of this objective function with respect only to the parameter set <inline-formula id="inf31">
<mml:math id="m42">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">I</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>.</p>
<p>The optimization problem to train the two discovery policies is the minimization of the following objective function (based on the loss of <xref ref-type="disp-formula" rid="e9">Eq. 9</xref>):<disp-formula id="e12">
<mml:math id="m43">
<mml:mtable class="aligned">
<mml:mtr>
<mml:mtd columnalign="right"/>
<mml:mtd columnalign="left">
<mml:mi mathvariant="script">L</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="right"/>
<mml:mtd columnalign="left">
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="&#x230a;" close="&#x230b;">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mo>&#x2212;</mml:mo>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">I</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">I</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="normal">b</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>LPB</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>LPB</mml:mtext>
</mml:mrow>
</mml:msup>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="normal">b</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">H</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mtext>log</mml:mtext>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="normal">b</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<label>(12)</label>
</disp-formula>where the parameter set <inline-formula id="inf32">
<mml:math id="m44">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">I</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> is frozen, and that of <italic>&#x3c6;</italic>&#x2032; is updated every <italic>k</italic> iterations with that of <italic>&#x3c6;</italic> and kept frozen. More specifically, the backpropagation computes the partial derivatives of this objective function with respect to the parameter set of <italic>&#x3c0;</italic> &#x2208; {<italic>&#x3c0;</italic>
<sub>1</sub>, <italic>&#x3c0;</italic>
<sub>2</sub>}. This objective function is low where the inverse model fails to predict actions, and the predictions of the forward model (<italic>&#x3c6;</italic>) vary greatly.</p>
<p>Finally, to automatically tune the temperature <inline-formula id="inf33">
<mml:math id="m45">
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">H</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>, we minimize the following objective function:<disp-formula id="e13">
<mml:math id="m46">
<mml:mi mathvariant="script">L</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">H</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="&#x230a;" close="&#x230b;">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">H</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mtext>log</mml:mtext>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="normal">b</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2212;</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="script">H</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
</mml:mfenced>
</mml:math>
<label>(13)</label>
</disp-formula>
</p>
<p>As explained in <xref ref-type="sec" rid="s3-2-1">Section 3.2.1</xref>, we choose to simultaneously train two discovery policies to mitigate the &#x201c;over-commitment&#x201d; (<xref ref-type="bibr" rid="B49">Shyam et&#x20;al., 2019</xref>). Specifically, our XSRL algorithm (as displayed in <xref ref-type="statement" rid="Algorithm_1">Algorithm 1</xref>) resets the policy with the lowest accumulation of the two main intrinsic reward terms, which are the prediction error of the inverse model (<xref ref-type="disp-formula" rid="e4">Eq. 4</xref>) and the <italic>k</italic>-step learning progress bonus (<xref ref-type="disp-formula" rid="e5">Eq. 5</xref>). This accumulation is computed by summing over the indices (<italic>b</italic>) and <italic>i</italic> as in <xref ref-type="disp-formula" rid="e12">Eq. 12</xref> and also over <italic>T</italic>
<sub>reset</sub> time steps (defined in <xref ref-type="table" rid="T3">Table&#x20;3</xref>), which results in:<disp-formula id="e14">
<mml:math id="m47">
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mfenced open="&#x230a;" close="&#x230b;">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2217;</mml:mo>
<mml:mfenced open="&#x230a;" close="&#x230b;">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>&#x2009;reset</mml:mtext>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:munderover>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">I</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">I</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="normal">b</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>LPB</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>LPB</mml:mtext>
</mml:mrow>
</mml:msup>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="normal">b</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
</mml:math>
<label>(14)</label>
</disp-formula>where <italic>T</italic>
<sub>reset</sub> &#x3e; <italic>T</italic>
<sub>
<italic>&#x3c0;</italic>
</sub>.</p>
<p>
<statement content-type="algorithm" id="Algorithm_1">
<label>Algorithm 1</label>
<p>XSRL algorithm</p>
<p>
<inline-graphic xlink:href="frobt-09-762051-fx1.tif"/>
</p>
<p>In summary, our XSRL algorithm described in <xref ref-type="statement" rid="Algorithm_1">Algorithm 1</xref>, performs four types of optimization: (i) of a state transition estimator with <xref ref-type="disp-formula" rid="e10">Eq. 10</xref>, (ii) of an inverse model with <xref ref-type="disp-formula" rid="e11">Eq. 11</xref>, (iii) of two distinct discovery policies with <xref ref-type="disp-formula" rid="e12">Eq. 12</xref>, (iv) of an automatic temperature tuning with <xref ref-type="disp-formula" rid="e13">Eq. 13</xref>. See <xref ref-type="table" rid="T3">Table&#x20;3</xref> for more details on the hyperparameters of our XSRL implementation.</p>
<p>
<xref ref-type="fig" rid="F2">Figure&#x20;2</xref> shows the two phases of XSRL considered in this work. <bold>A</bold>: the twofold training procedure that XSRL follows in order to effectively explore the environment and to estimate state representations consistent with the true state of the system. <bold>B</bold>: the use of the trained representation model <italic>&#x3c6;</italic> in an unseen RL&#x20;task.</p>
</statement>
</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>
<bold>(A)</bold> Schematic representation of the XSRL twofold training procedure to provide compact state representations by jointly training a state transition estimator <italic>&#x3c6;</italic> with a next observation predictor <italic>&#x3c9;</italic>, guided by two discovery policies <italic>&#x3c0;</italic> &#x2208; {<italic>&#x3c0;</italic>
<sub>1</sub>, <italic>&#x3c0;</italic>
<sub>2</sub>} in an online manner (see <xref ref-type="sec" rid="s3-3">Section 3.3</xref>). Here <inline-formula id="inf47">
<mml:math id="m61">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>&#x3c6;</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>, and <inline-formula id="inf48">
<mml:math id="m62">
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> where <italic>&#x3c6;</italic>&#x2032; is a clone of <italic>&#x3c6;</italic> whose parameters are delayed by <italic>k</italic> iterations and kept frozen. <bold>(B)</bold> A schematic illustration of the transfer of the pretrained state representation model (<italic>&#x3c6;</italic>) to an unknown RL&#x20;task.</p>
</caption>
<graphic xlink:href="frobt-09-762051-g002.tif"/>
</fig>
</sec>
</sec>
</sec>
<sec id="s4">
<title>4 Experimental Setup</title>
<p>This section describes a systematic evaluation of the criteria that the XSRL algorithm should fulfill. XSRL should learn state representations which (i) retrieve information (possibly by memorizing information from past time steps) to garantee that their transitions are Markovian and (ii) filter unnecessary information. Furthermore, XSRL should learn discovery policies which (iii) explore efficiently even in the presence of aleatoric uncertainty. Finally, after the XSRL pretraining, the state transition estimator <italic>&#x3c6;</italic> must (iv) provide advantageous inputs to solve unseen RL&#x20;tasks.</p>
<p>We evaluate criterion (i) by measuring the average of the next observation prediction errors on a training dataset and a test dataset. While the former is made up of samples generated during the training process, the latter is carefully designed for each environment, as described in <xref ref-type="sec" rid="s4-2-6">Section 4.2.6</xref>. Although some parts of the next observations are irreducibly unpredictable, the lower the error, the more likely the transitions are to be Markovian. Furthermore, we compare the observation prediction error of XSRL with the observation reconstruction error obtained by RAE (Regularized Autoencoder (<xref ref-type="bibr" rid="B14">Ghosh et&#x20;al., 2019</xref>)). However, since it is more complicated to predict the next observation from past time step information than to reconstruct it, it is expected that the latter will perform better.</p>
<p>We evaluate criterion (ii) on state representations and criterion (iii) on discovery policies by training XSRL in a TurtleBot Maze environment with artificial aleatoric uncertainty in its transitions. The aleatoric uncertainty is introduced as follows: at every time step, the color of one of the walls, initially in front of the robot, is randomly sampled (see <xref ref-type="fig" rid="F5">Figure&#x20;5</xref>). Besides, to fulfill the criterion (iii), we perform an exploration evaluation during the state embedding pretraining of XSRL. We measure the average number of training steps on <inline-formula id="inf49">
<mml:math id="m63">
<mml:mi mathvariant="script">L</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c9;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> before one of the <italic>B</italic> (<italic>B</italic>&#x20;&#x3d; 32 as detailed in <xref ref-type="table" rid="T3">Table&#x20;3</xref>) agents reaches the other end of the maze. Furthermore, since <inline-formula id="inf50">
<mml:math id="m64">
<mml:mo stretchy="false">&#x2016;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">o</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msubsup>
<mml:mrow>
<mml:mo stretchy="false">&#x2016;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula> is a useful prediction error measure to quantitatively evaluate the generalization performance of <italic>&#x3c9;</italic> (which is directly related to the performance of discovery policies), a high error measure will indicate that the exploration strategy is not effective. To complete the evaluation of the discovery policy criterion, we also compare two XSRL ablations:<list list-type="simple">
<list-item>
<p>&#x2022; XSRL-MaxEnt: trains a policy to maximize its entropy estimation by keeping only the entropy term in <xref ref-type="disp-formula" rid="e12">Eq.&#x20;12</xref>
</p>
</list-item>
<list-item>
<p>&#x2022; XSRL-random: samples actions randomly from the action&#x20;space.</p>
</list-item>
</list>
</p>
<p>Here, XSRL-random is expected to give minimal performance, while XSRL-MaxEnt should be worse than XSRL, as it only depends on the policy distribution.</p>
<p>We evaluate criterion (iv) with the transfer of the trained state representation network <italic>&#x3c6;</italic> to unseen RL tasks. During RL, the environment provides an agent with extrinsic rewards to train an optimal policy, while <italic>&#x3c6;</italic> transforms large observations into compact state vectors as shown in <xref ref-type="fig" rid="F2">Figure&#x20;2B</xref>. To rigorously conduct this evaluation, we use a popular RL algorithm with continuous actions&#x2014;SAC (Soft Actor-Critic) (<xref ref-type="bibr" rid="B17">Haarnoja et&#x20;al., 2018</xref>)&#x2014;on each of the three environment tasks shown in <xref ref-type="fig" rid="F3">Figure&#x20;3</xref>. These continuous control tasks (presented in detail in <xref ref-type="sec" rid="s4-2">Section 4.2</xref>) are challenging because of their high-dimensional observation spaces consisting of images. In order to obtain a quantitative evaluation of our results, we compare the performance with other representation strategies detailed&#x20;below.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>High-rendered images of the three continous control environments in PyBullet (<xref ref-type="bibr" rid="B11">Coumans and Bai, 2016&#x2013;2019</xref>). <bold>(A)</bold> The novel TurtleBot Maze environment proposed in this work, where the observation space corresponds to a first-person perspective camera view. We use a goal reaching task in this environment to quantify the exploration performance of XSRL. <bold>(B)</bold> The InvertedPendulum environment provides a swing up task. <bold>(C)</bold> The HalfCheetah environment provides a locomotion task. <bold>(B,C)</bold> are two popular torque-controlled benchmark environments where the observation space corresponds to the view of a camera tracking the agent, as in the DMControl benchmark (<xref ref-type="bibr" rid="B52">Tassa et&#x20;al., 2018</xref>).</p>
</caption>
<graphic xlink:href="frobt-09-762051-g003.tif"/>
</fig>
<sec id="s4-1">
<title>4.1 Baselines</title>
<p>We compare the performances of XSRL representations on unseen RL tasks to the following five baselines: <italic>ground truth, open-loop, position, RAE, random network</italic>.</p>
<p>Of all these baselines, only RAE (Regularized Autoencoder) (<xref ref-type="bibr" rid="B14">Ghosh et&#x20;al., 2019</xref>) is a state-of-the-art SRL method. We train it using the same three rewardless environments with fixed state initializations as for XSRL (described in <xref ref-type="sec" rid="s4-2-4">Section 4.2.4</xref>). However, since it has no associated exploration strategy to generate observations, we use either a random policy (which is defined as above for XSRL-random) as previously done by <xref ref-type="bibr" rid="B60">Yarats et&#x20;al. (2019)</xref>, or an effective exploration designed with expert knowledge (indicated by the suffix <italic>-explor</italic>). In TurtleBot Maze, this effective exploration corresponds to episodes with 50&#x20;time steps, with random actions, and random resets (i.e. random initial states anywhere in the maze). In the two torque-controlled environments, this effective exploration has 0.5 probability to take a random action and otherwise takes an action sampled from an optimal policy pretrained in the RL context (i.e. where extrinsic rewards are available) with SAC from the ground truth state&#x20;space.</p>
<p>RAE is a deterministic alternative to the variational autoencoder (VAE) (<xref ref-type="bibr" rid="B27">Kingma and Welling, 2014</xref>), which preserves the regularizing effect of the latter. To the best of our knowledge, we do not know of any other method than RAE, belonging to the SRL context and that achieves state-of-the-art performance on the torque-controlled tasks of the DeepMind Control Suite (DMControl) benchmark (<xref ref-type="bibr" rid="B52">Tassa et&#x20;al., 2018</xref>) with visual observations (similar to those considered in this article). Specifically, in the DMControl benchmark, <xref ref-type="bibr" rid="B60">Yarats et&#x20;al. (2019)</xref> obtain results in which RAE with the SAC algorithm performs as well as PlaNet (<xref ref-type="bibr" rid="B18">Hafner et&#x20;al., 2018</xref>), a state-of-the-art model-based RL method.</p>
<p>We also use a random network representation in which, instead of training a network (i.e. similar to the <italic>&#x3b1;</italic> function of XSRL), its parameters are simply fixed to random values sampled from a Gaussian distribution of mean zero and standard deviation 0.02. This strategy without any training was popularized for classification problems by <xref ref-type="bibr" rid="B23">Jarrett et&#x20;al. (2009)</xref> and then for RL tasks by <xref ref-type="bibr" rid="B13">Gaier and Ha (2019)</xref>.</p>
<p>We use, only in the InvertedPendulum environment, the position baseline which corresponds to position measurements without velocities. The absence of velocities let us show the relevance of such physical dynamic information to solve the swing up task. To achieve a good performance, XSRL must extract this information from the observation of consecutive time steps by memorizing through the recursive&#x20;loop.</p>
<p>Finally, we use a ground truth baseline, which is a state directly extracted from the environment dynamics (see <xref ref-type="sec" rid="s4-2">Section 4.2</xref> for details in each environment), and an open-loop baseline, where the state is defined as the time step of an agent. Wile the ground truth baseline is expected to constitute an upper bound on RL performance, the open-loop baseline serves as a sanity check. The latter would enable us to validate whether the three RL tasks require closed-loop policies. That is, whether it is necessary to use the agent&#x2019;s perception and proprioceptive information to solve the task, or whether open-loop policy learning strategies may be sufficient. In particular, this gives the minimum performance to beat to show the relevance of different state representation strategies.</p>
<p>We justify the absence of state-of-the-art end-to-end RL baselines such as (<xref ref-type="bibr" rid="B31">Lee et&#x20;al., 2019</xref>; <xref ref-type="bibr" rid="B28">Kostrikov et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B30">Laskin et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B50">Srinivas et&#x20;al., 2020</xref>), despite their open source implementations, by their too high computational complexity which is impractical in our hardware setting and limited computational&#x20;time.</p>
</sec>
<sec id="s4-2">
<title>4.2 Environment Details</title>
<p>We perform our experiments on the three environments presented in <xref ref-type="fig" rid="F3">Figure&#x20;3</xref> which are all partially observable due to image observations. InvertedPendulum and HalfCheetah belong to the MuJoCo torque-controlled benchmark (<xref ref-type="bibr" rid="B53">Todorov et&#x20;al., 2012</xref>), and we chose their implemention on PyBullet (<xref ref-type="bibr" rid="B11">Coumans and Bai, 2016&#x2013;2019</xref>) for compatibility reasons on our computers.</p>
<sec id="s4-2-1">
<title>4.2.1 TurtleBot Maze</title>
<p>We have implemented this environment as a U-shaped maze with the TurtleBot robot from PyBullet (<xref ref-type="bibr" rid="B11">Coumans and Bai, 2016&#x2013;2019</xref>), inspired by Ant Maze from OpenAI Gym (<xref ref-type="bibr" rid="B5">Brockman et&#x20;al., 2016</xref>) used by <xref ref-type="bibr" rid="B49">Shyam et&#x20;al. (2019)</xref>. The two-dimensional action applies a velocity to each of the left and right wheels of the robot. The three-dimensional ground truth state is formed by the cartesian coordinates in <italic>x</italic> and <italic>y</italic> axis of the robot and its orientation angle. In this environment, the task consists in a goal reaching task with sparse rewards and a long horizon.<xref ref-type="fn" rid="fn2">
<sup>2</sup>
</xref> Thus, it is a challenge for a RL algorithm to address the exploration/exploitation tradeoff. This task provides a RL algorithm with a sparse reward of &#x2b;1 each time the robot reaches the goal, a reward of &#x2212;1 each time it touches a wall, and 0 otherwise, within a maximum of 100&#x20;time steps before the robot and the goal are randomly reinitialized. In addition, this task provides a RL algorithm with the position of the goal, which is concatenated to the state representation. Indeed, since the goal position is task-dependent, it cannot be learned by state representations in a reward-free context.</p>
</sec>
<sec id="s4-2-2">
<title>4.2.2 InvertedPendulum</title>
<p>The InvertedPendulum is attached to a pivot point on a cart sliding on a rail. The one-dimensional action applies a force to the cart, which is limited to linear movement on the rail. The five-dimensional ground truth state is formed by the x-axis position and velocity of the cart, the angular position in Cartesian space (i.e. cosine and sine of the angle) and angular velocity of the pendulum. In this environment, the task consists in a swing up task where the pendulum must swing up several times before balancing upward (since the pendulum is initialized downwards). This task provides a RL algorithm with a reward for keeping the pendulum up vertically, within a maximum of 1,000&#x20;time steps before the pendulum is reset to a random&#x20;state.</p>
</sec>
<sec id="s4-2-3">
<title>4.2.3 HalfCheetah</title>
<p>The HalfCheetah is composed of eight rigid links, the torso and the back, and two legs each composed of three rigid and controllable links. The six-dimensional action applies torques to each of the six joints of the two legs. The 17-dimensional ground truth state is formed by the angular positions and velocities of the six joints, as well as agent cartesian position. In this environment, the task consists in a locomotion task where an agent must run to progress as far as possible. This task provides a RL algorithm with a reward for moving the robot as fast as possible, in a maximum of 1,000&#x20;time steps and with a constraint that resets it as soon as it gets too close to the ground (which is not applied during XSRL and RAE trainings).</p>
</sec>
<sec id="s4-2-4">
<title>4.2.4 Rewardless Environments</title>
<p>We detail some of the differences in the three environments used without reward in the SRL context and the three tasks described above used in the RL context. In the SRL context (i.e. during XSRL and RAE pretraining), an agent is reset after a longer horizon, and is initialized to a constant state. For TurtleBot Maze the horizon is 500&#x20;time steps, hence the need of an effective exploration to reach the other end of the maze, which is at the opposite of the fixed initial state. For the two torque-controlled environments (InvertedPendulum and HalfCheetah), the horizon is 2,000&#x20;time steps (so 500 after repeating the action four times). The remaining common hyperparameters of the three environments for the SRL and RL contexts are displayed in <xref ref-type="table" rid="T1">Table&#x20;1</xref>.</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Hyperparameters used in the PyBullet environments (<xref ref-type="bibr" rid="B11">Coumans and Bai, 2016&#x2013;2019</xref>).</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Hyperparameter</th>
<th align="center">Value</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Image rendering size</td>
<td align="center">3 &#xd7; 96 &#xd7; 96</td>
</tr>
<tr>
<td align="left">Image size after downscaling</td>
<td align="center">3 &#xd7; 64 &#xd7; 64</td>
</tr>
<tr>
<td align="left">Action repeat</td>
<td align="center">1 TurtleBot Maze</td>
</tr>
<tr>
<td align="left"/>
<td align="center">4 otherwise</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4-2-5">
<title>4.2.5 Image Preprocessing</title>
<p>The image preprocessing performed in these environments follows basically the same state-of-the-art approaches. We divide the pixel values by 255 to normalize them to [0, 1]. Then we downscale the image size to 3 &#xd7; 64&#x20;&#xd7; 64 pixels just like <xref ref-type="bibr" rid="B38">Mnih et&#x20;al. (2013)</xref>; <xref ref-type="bibr" rid="B34">Lillicrap et&#x20;al. (2015)</xref>. When the action repeat is one (with TurtleBot Maze), an observation corresponds to the image <bold>o</bold>
<sub>
<italic>t</italic>
</sub> &#x3d; <bold>I</bold>
<sub>
<italic>t</italic>
</sub>. When it is four (with InvertedPendulum and HalfCheetah), an observation corresponds to the stack of the three consecutive images <inline-formula id="inf51">
<mml:math id="m65">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">I</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">I</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">I</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:math>
</inline-formula> of size 9, &#xd7;, 64&#x20;&#xd7; 64, just like <xref ref-type="bibr" rid="B34">Lillicrap et&#x20;al. (2015)</xref> and <xref ref-type="bibr" rid="B60">Yarats et&#x20;al. (2019)</xref>, where <italic>t</italic>&#x2032; corresponds to a time scale four times smaller than that of <italic>t</italic> (i.e. <italic>t</italic>&#x2032; &#x3d; 4&#x20;&#xd7; <italic>t</italic>). For our XSRL method, this concatenation of images obtained by repeating the last action three times allows not to lose all the information on these time steps. This concatenation of images solves the trade-off between computational complexity and information&#x20;loss.</p>
</sec>
<sec id="s4-2-6">
<title>4.2.6 Test Datasets</title>
<p>For quantitative performance evaluation of our XSRL algorithm, we use an error measure of the next observation prediction, and for the state-of-the-art RAE baseline, we use an error measure of the next observation reconstruction. To perform those evaluations, we need an appropriate test dataset for each of the three environments described above. To do this, we carefully collected a wide variety of 400 transitions formed of observation-action pairs into a dataset. We generated them in two different ways. In the case of TurtleBot maze, we hand-designed expert trajectories that follow the U-shape of the maze. In the case of InvertedPendulum and HalfCheetah, we executed a policy learned by SAC from the ground truth state&#x20;space.</p>
</sec>
</sec>
<sec id="s4-3">
<title>4.3 Implementation Details</title>
<p>We now detail the implementation of the training procedures for XSRL and SAC. The source code of our implementation is available online.<xref ref-type="fn" rid="fn3">
<sup>3</sup>
</xref> This implementation uses the deep learning library PyTorch (<xref ref-type="bibr" rid="B41">Paszke et&#x20;al., 2017</xref>). The hyperparameter details for XSRL are detailed in <xref ref-type="table" rid="T3">Table&#x20;3</xref>, and for SAC, when different from the original implementation of <xref ref-type="bibr" rid="B17">Haarnoja et&#x20;al. (2018)</xref> in <xref ref-type="table" rid="T2">Table&#x20;2</xref>. Preliminary experiments showed that the hyperparameters <inline-formula id="inf52">
<mml:math id="m66">
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">I</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> and <italic>w</italic>
<sub>LPB</sub> (to solve the tradeoff during discovery policy training between maximizing the prediction error of an inverse model and maximizing the <italic>k</italic>-step learning progress bonus on <italic>&#x3c6;</italic>) had little impact on final performance.</p>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>Hyperparameters used for SAC [Soft Actor-Critic (<xref ref-type="bibr" rid="B17">Haarnoja et&#x20;al., 2018</xref>)] experiments.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Hyperparameter</th>
<th align="center">Value</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Episode length of the environments</td>
<td align="center">100 TurtleBot Maze</td>
</tr>
<tr>
<td align="left"/>
<td align="center">1,000 otherwise</td>
</tr>
<tr>
<td align="left">Discount facor <italic>&#x3b3;</italic>
</td>
<td align="center">0.99</td>
</tr>
<tr>
<td align="left">Replay buffer capacity</td>
<td align="center">100,000</td>
</tr>
<tr>
<td align="left">Optimizer</td>
<td align="center">Adam (<xref ref-type="bibr" rid="B26">Kingma and Ba, 2014</xref>)</td>
</tr>
<tr>
<td align="left">Batch size</td>
<td align="center">256</td>
</tr>
<tr>
<td align="left">Update frequency for the critic target model, and actor model</td>
<td align="center">2</td>
</tr>
<tr>
<td align="left">Learning rate for the critic and actor models, and the automatic temperature tuning</td>
<td align="center">5e&#x2212;4</td>
</tr>
<tr>
<td align="left">Hidden units of critic/actor models</td>
<td align="center">128, 512, 128</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>For a fair comparison with RAE baseline, the same architecture as <italic>&#x3b1;</italic> (a convolutional neural network) and <italic>&#x3c9;</italic> (a transposed convolutional neural network) is used for the encoder and decoder respectively. For the RAE and random network baselines, their neural networks similar to <italic>&#x3b1;</italic> output state representations, while for XSRL the neural network of the forward model (<italic>&#x3c6;</italic>) predicts next state representations. We chose a state representation of 20 dimensions for TurtleBot Maze and InvertedPendulum, and 30 dimensions for HalfCheetah, which correspond to heuristically chosen values. These dimensions were empirically selected to account for the trade-off between sample efficiency and final performance (i.e. between computation time and the optimal policy performance).</p>
<p>We use the same architecture for the policy (a.k.a. actor model) and the action-value function (a.k.a. critic model) of the SAC algorithm as for the discovery policies, the inverse model and <italic>&#x3b3;</italic> of our XSRL algorithm. This architecture is made of three-hidden layers (see <xref ref-type="table" rid="T3">Table&#x20;3</xref>). The total number of parameters in the corresponding neural network is less than that of the neural network architecture with fewer layers used by <xref ref-type="bibr" rid="B60">Yarats et&#x20;al. (2019)</xref>; <xref ref-type="bibr" rid="B19">Hansen et&#x20;al. (2020)</xref> on similar RL tasks, because each layer of our networks is much smaller; see <xref ref-type="bibr" rid="B43">Poggio et&#x20;al. (2017)</xref> for theoretical explanations. As <xref ref-type="bibr" rid="B60">Yarats et&#x20;al. (2019)</xref>, we use double Q-learning (<xref ref-type="bibr" rid="B54">Van Hasselt et&#x20;al., 2015</xref>) for the critic&#x20;model.</p>
<table-wrap id="T3" position="float">
<label>TABLE 3</label>
<caption>
<p>Hyperparameters used for XSRL experiments.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Hyperparameter</th>
<th align="center">Value</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Episode length for all the environments (after action repeat)</td>
<td align="center">500</td>
</tr>
<tr>
<td align="left">State representation dimension <inline-formula id="inf53">
<mml:math id="m67">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>
</td>
<td align="center">20 TurtleBot Maze; InvertedPendulum</td>
</tr>
<tr>
<td align="left">(i.e. <italic>&#x3b3;</italic> output dimension)</td>
<td align="center">30 HalfCheetah</td>
</tr>
<tr>
<td align="left">
<italic>&#x3b1;</italic> output dimension</td>
<td align="center">30</td>
</tr>
<tr>
<td align="left">
<italic>&#x3b2;</italic> output dimension</td>
<td align="center">
<inline-formula id="inf54">
<mml:math id="m68">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
</tr>
<tr>
<td align="left">Intrinsic reward weight terms</td>
<td align="center">
<inline-formula id="inf55">
<mml:math id="m69">
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">I</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.5</mml:mn>
</mml:math>
</inline-formula>, <italic>w</italic>
<sub>LPB</sub> &#x3d; 1</td>
</tr>
<tr>
<td align="left">Optimizer</td>
<td align="center">Adam (<xref ref-type="bibr" rid="B26">Kingma and Ba, 2014</xref>)</td>
</tr>
<tr>
<td align="left">Batch size <italic>B</italic> for <italic>&#x3b1;</italic>, <italic>&#x3b2;</italic>, <italic>&#x3b3;</italic>, <italic>&#x3c9;</italic>
</td>
<td align="center">32</td>
</tr>
<tr>
<td align="left">Batch size <italic>B</italic>
<sub>
<italic>&#x3c0;</italic>
</sub> for <inline-formula id="inf56">
<mml:math id="m70">
<mml:mi mathvariant="script">I</mml:mi>
</mml:math>
</inline-formula> and <italic>&#x3c0;</italic>
</td>
<td align="center">128</td>
</tr>
<tr>
<td align="left">Update interval <italic>T</italic>
<sub>
<italic>&#x3c0;</italic>
</sub> for <inline-formula id="inf57">
<mml:math id="m71">
<mml:mi mathvariant="script">I</mml:mi>
</mml:math>
</inline-formula> and <italic>&#x3c0;</italic>
</td>
<td align="center">512</td>
</tr>
<tr>
<td align="left">Reset interval <italic>T</italic>
<sub>reset</sub>
</td>
<td align="center">4,096 for both discovery policies</td>
</tr>
<tr>
<td align="left">Learning rate for <italic>&#x3b1;</italic>, <italic>&#x3b2;</italic>, <italic>&#x3b3;</italic>, <italic>&#x3c9;</italic>, <inline-formula id="inf58">
<mml:math id="m72">
<mml:mi mathvariant="script">I</mml:mi>
</mml:math>
</inline-formula>, <italic>&#x3c0;</italic>, <inline-formula id="inf59">
<mml:math id="m73">
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">H</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>
</td>
<td align="center">1e&#x2212;4</td>
</tr>
<tr>
<td align="left">Hidden units of <inline-formula id="inf60">
<mml:math id="m74">
<mml:mi mathvariant="script">I</mml:mi>
<mml:mo>,</mml:mo>
<mml:mspace width="0.17em"/>
<mml:mi>&#x3c0;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mspace width="0.17em"/>
<mml:mi>&#x3b3;</mml:mi>
</mml:math>
</inline-formula>
</td>
<td align="center">128, 512, 128</td>
</tr>
<tr>
<td align="left">Hidden units of <italic>&#x3b2;</italic>
</td>
<td align="center">128, 512, 32</td>
</tr>
<tr>
<td align="left">Hidden units of <italic>&#x3b1;</italic>, <italic>&#x3c9;</italic>:</td>
<td/>
</tr>
<tr>
<td align="left">
<italic>&#x391;</italic>
</td>
<td align="center">CNN (strides and filters): (2, 32), (2, 64), (2, 128), (2, 256) MLP hidden units: 1024, 256, 32</td>
</tr>
<tr>
<td align="left">
<italic>&#x3a9;</italic>
</td>
<td align="center">MLP hidden units: 32, 256, 1024 transposed CNN (strides and filters): (1, 256), (2, 128), (2, 64), (2, 32)</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The Leaky Rectified Linear Unit (Leaky ReLU) is used for the activation functions between hidden layers, which removes the vanishing gradients encountered with the ReLU and improves the convergence speed and stability (which we observed empirically on preliminary experiments); see <xref ref-type="bibr" rid="B59">Xu et&#x20;al. (2015)</xref> for details.</p>
<p>In our RL experiments, the SAC algorithm is only used to test the generalization of the XSRL state representation to unseen control tasks. This implies that we keep the parameters of <italic>&#x3c6;</italic> fixed. Due to memory constraints, for all experiments, we use a reduced buffer capacity unlike work comparable to ours: 100,000 instead of 1,000,000 in <xref ref-type="bibr" rid="B60">Yarats et&#x20;al. (2019)</xref>.</p>
<sec id="s4-3-1">
<title>4.3.1 Hardware Details</title>
<p>All our experiments are performed on three computers, each containing 40 cores and a Titan Xp GPU provided by Nvidia.</p>
</sec>
</sec>
</sec>
<sec id="s5">
<title>5 Experimental Results</title>
<sec id="s5-1">
<title>5.1 Evaluations of XSRL Representations and Exploration</title>
<p>In this section, we show the results of our quantitative and qualitative evaluations to validate whether XSRL fulfills criteria 1), 2), and 3) which we defined in <xref ref-type="sec" rid="s4">Section&#x20;4</xref>.</p>
<p>
<xref ref-type="fig" rid="F4">Figure&#x20;4</xref> reports the results of the error measure obtained on a training dataset and a test dataset (defined in <xref ref-type="sec" rid="s4-2-6">Section 4.2.6</xref>) on each of the three environments. This error measure corresponds to the prediction error of the next observation for XSRL and the two ablations (XSRL-random and XSRL-MaxEnt); it corresponds to the reconstruction error of the next observation for RAE (<xref ref-type="bibr" rid="B14">Ghosh et&#x20;al., 2019</xref>) (following a random exploration) and RAE-explor (following an effective exploration) as defined in <xref ref-type="sec" rid="s4-1">Section&#x20;4.1</xref>.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Error measure results (the lower the better) obtained on a training dataset (top row) and a test dataset (bottom row) (which is defined for each environment in <xref ref-type="sec" rid="s4-2-6">Section 4.2.6</xref>), averaged across 5 runs (with different random seeds). This measure corresponds to the prediction of <bold>o</bold>
<sub>
<italic>t</italic>&#x2b;1</sub> with XSRL, and to the reconstruction of <bold>o</bold>
<sub>
<italic>t</italic>&#x2b;1</sub> with RAE. XSRL (w/distractor) is performed in TurtleBot Maze with a randomly sampled wall color (as defined in <xref ref-type="sec" rid="s4">Section 4</xref>). XSRL-MaxEnt and XSRL-random are XSRL ablations that follow the entropy maximization strategy and random sampling respectively. RAE-explor benefits from effective exploration (described in \mysec{sec:XSRL_baselines}) while RAE follows only random exploration.</p>
</caption>
<graphic xlink:href="frobt-09-762051-g004.tif"/>
</fig>
<p>We observe on the two environments, TurtleBot Maze and HalfCheetah, that the error measure for XSRL is higher than that for RAE and RAE-explor on both training and test datasets. This does not correspond to a poor exploration performance of XSRL but to the objective function which is more complicated than RAE. Indeed, all information in the next observation that cannot be predicted from the current time step is ignored as it is the case for random distractors or too complex information from the transition model, which tends to increase the prediction error. Furthermore, the qualitative results in <xref ref-type="fig" rid="F5">Figure&#x20;5</xref> show that XSRL captures well what is relevant to predict the observation that can be explained by agent actions, but ignores less useful/redundant information. For example, in TurtleBot Maze it predicts walls with relatively good precision despite their potentially small size (see the purple wall in <xref ref-type="fig" rid="F5">Figure&#x20;5I</xref>), but it predicts the checkerboard pattern on the floor in a less accurate way. The former is related to the global information on the topology of the maze, while the latter is&#x20;not.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>One of the most complex transitions in each of the test datasets (defined in <xref ref-type="sec" rid="s4-2-6">Section 4.2.6</xref>) of three environments. From left to right: TurtleBot Maze, TurtleBot Maze w/distractor, InvertedPendulum and HalfCheetah. The top line shows: (left) the locations of <bold>(A)</bold> and <bold>(E)</bold>; (right) the locations of <bold>(B,F)</bold>. In the bottom line, <bold>(I&#x2013;L)</bold> show the corresponding <bold>o</bold>
<sub>
<italic>t</italic>&#x2b;1</sub> predictions of XSRL. In <bold>(J)</bold> XSRL tends to filter the random wall color because it predicts a neutral gray color instead. For InvertedPendulum and HalfCheetah environments, as the action is repeated four times, <bold>o</bold>
<sub>
<italic>t</italic>
</sub> corresponds to the three consecutive images <inline-formula id="inf61">
<mml:math id="m75">
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">I</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">I</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">I</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:math>
</inline-formula> (as defined in <xref ref-type="sec" rid="s4-2-5">Section 4.2.5</xref>).</p>
</caption>
<graphic xlink:href="frobt-09-762051-g005.tif"/>
</fig>
<p>The results tend to show that representations learned by XSRL follow Markovian transitions which is criterion (i). Indeed, the representations learned by XSRL can predict the observation change related to robot actions from the current time step only. This is a consequence of the fact that XSRL is based on recursive state estimation predictions (see <xref ref-type="sec" rid="s3-1">Section&#x20;3.1</xref>).</p>
<p>In TurtleBot Maze with a distractor represented by a wall color that is randomly sampled after every transition (as defined in <xref ref-type="sec" rid="s4">Section 4</xref>), the gray wall predicted by <italic>&#x3c9;</italic> in <xref ref-type="fig" rid="F5">Figure&#x20;5J</xref> shows that the random colors are ignored by the forward model <italic>&#x3c6;</italic>. Using its forward model, XSRL learns state representations which filter out stochastic information and more generally information that is unnecessary to predict the motion of the system, which is criterion&#x20;(ii).</p>
<p>We evaluated XSRL discovery policies with a quantitative evaluation of maze exploration presented in <xref ref-type="fig" rid="F6">Figure&#x20;6</xref>. <xref ref-type="fig" rid="F6">Figure&#x20;6</xref> shows that XSRL discovery policies lead more quickly to episodes (of 500&#x20;time steps) in which agents reach the other end of the maze. Specifically, with XSRL-random, agents can almost never reach the other end of the maze in only 500&#x20;time steps. We observe no significant decrease in performance of XSRL with a distractor in TurtleBot Maze. Furthermore, with and without a distractor, XSRL exploration reaches the end of the maze almost twice as fast as with XSRL-MaxEnt. These results tend to confirm that XSRL discovery policies are successful in guiding agents quickly to diverse and learnable transitions, without being affected by the presence of distractors, which is criterion (iii).</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption>
<p>
<bold>(A)</bold> Shows the top view of TurtleBot Maze where the robot&#x2019;s position corresponds to the constant initial state, hence the complexity of crossing the dotted red line in less than 500&#x20;time steps. <bold>(B,C)</bold> Number of training steps on <inline-formula id="inf62">
<mml:math id="m76">
<mml:mi mathvariant="script">L</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c9;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> before an agent crosses the dotted red line during XSRL training (mean &#xb1; standard deviation over 10 runs; the lower the better). Remark: a training step on <inline-formula id="inf63">
<mml:math id="m77">
<mml:mi mathvariant="script">L</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c9;</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> corresponds to a time step for each of the 32 agents. Our XSRL exploration strategy outperforms XSRL-MaxEnt, while XSRL-random provides an upper&#x20;bound.</p>
</caption>
<graphic xlink:href="frobt-09-762051-g006.tif"/>
</fig>
<p>The video available here <ext-link ext-link-type="uri" xlink:href="https://youtu.be/IbGa-TC7wek">https://youtu.be/IbGa-TC7wek</ext-link>, shows a comparative evaluation between XSRL exploration (left) and random exploration (right) in each of the three environments. It highlights that discovery policies learned by XSRL allow: in TurtleBot Maze to quickly visit transitions far from the initial state position (as shown in the previous results); in InvertedPendulum to balance the pendulum upwards while it is initialized downwards with zero velocity; in HalfCheetah to keep the robot constantly moving and exploring various kinds of postures. We can see that for random exploration: in TurtleBot Maze the robot moves little away from its initial constant state; in InvertedPendulum the pendulum is never upwards; in HalfCheetah it is complicated for the robot to stay in motion since it ends up stuck in a lying position.</p>
<p>In addition to these qualitative and quantitative comparisons, the better performance of XSRL exploration is also confirmed by the quantitative evaluation of the prediction error measure on test datasets for XSRL and its ablations (<xref ref-type="fig" rid="F4">Figure&#x20;4</xref>). This measure reaches its lowest value with XSRL exploration, followed by XSRL-MaxEnt and finally XSRL-random which is by far the worst strategy.</p>
<p>Apart from the comparative study of our XSRL exploration, we observe that an effective exploration improves the generalization performance of RAE models, which could be expected. Indeed, the quantitative evaluation of the observation reconstruction shows a smaller error on the test dataset with RAE-explor (which is trained with an effective exploration defined in <xref ref-type="sec" rid="s4-1">Section 4.1</xref>) than with RAE (see <xref ref-type="fig" rid="F4">Figure&#x20;4</xref>).</p>
<p>Qualitative and quantitative performance differences with respect to exploration strategies show the advantage of visiting quickly diverse transitions during state embedding pretraining to obtain better generalization performance over new transitions. However, as we see below, it is only with XSRL that the low error measure translates into good transfer performance with a new RL&#x20;task.</p>
</sec>
<sec id="s5-2">
<title>5.2 XSRL Representations Transfer</title>
<p>In this section, we show quantitative evaluations to validate whether state estimators pretrained with XSRL provide advantageous inputs to RL algorithms for solving three unseen control tasks (which is an instance of criterion (iv) defined in <xref ref-type="sec" rid="s4">Section 4</xref>). In particular, we use the deep RL algorithm SAC (Soft Actor-Critic) (<xref ref-type="bibr" rid="B17">Haarnoja et&#x20;al., 2018</xref>) which has shown promising results on the standard continuous control tasks InvertedPendulum and HalfCheetah. Throughout these experiments, all parameters of the pretrained state embeddings (with XSRL and RAE) are kept fixed: only the actor and critic neural networks of SAC are trained. We performed 10 runs with different random seeds just like <xref ref-type="bibr" rid="B20">Henderson et&#x20;al. (2018)</xref>, <xref ref-type="bibr" rid="B60">Yarats et&#x20;al. (2019)</xref>, resulting in 10 different trained policies for each of the representation strategies. For each state embedding pretraining approach (XSRL and RAE) and for the random network, we used 5 different models trained with different random seeds, from which 2 SAC runs with different random seeds are executed. In addition, unlike ground truth, open-loop and position baselines, they transform visual observations into compact state representations of 20 dimensions for TurtleBot Maze and InvertedPendulum, and 30 dimensions for HalfCheetah (as explained in <xref ref-type="sec" rid="s4-3">Section&#x20;4.3</xref>).</p>
<p>
<xref ref-type="fig" rid="F7">Figure&#x20;7</xref> shows the learning curves of the episode returns averaged over 10 episodes across 10 different runs. After training, we measured the episode returns averaged over 100 episodes for the 10 different trained policies, which are displayed in <xref ref-type="table" rid="T4">Table&#x20;4</xref>. For clarity, we normalized all episode returns between the average SAC &#x2b; ground truth performance and that of SAC &#x2b; open-loop, except for the task with TurtleBot Maze as this is evaluated with the probability to reach the goal (from a random initial configuration) in 100&#x20;time steps or less. Indeed, SAC &#x2b; ground truth is an upper bound because it has easy access to the agent&#x2019;s proprioceptive information, and SAC &#x2b; open-loop is a lower bound because it corresponds to a blind agent. These results show that only XSRL state representations perform well in all three RL tasks, unlike the other state representation baselines.</p>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption>
<p>Learning curves of the episode returns averaged over 10 episodes (mean in lines and standard deviation in shaded areas over 10 runs; the higher the better) of SAC with different state representation strategies (defined in \mysec{sec:XSRL_baselines}) on three different continuous control tasks. Learning curves of the episode returns averaged over 10 episodes (mean in lines and standard deviation in shaded areas over 10 runs; the higher the better) of SAC with different state representation strategies on three different continuous control tasks. The XSRL pretrained representations are the only one to perform well in three of the environments, while ground truth and open-loop provide an upper and lower bound respectively. A video showing the corresponding learned policies can be found at <ext-link ext-link-type="uri" xlink:href="https://youtu.be/XpRcU75i-iQ">https://youtu.be/XpRcU75i-iQ</ext-link>.</p>
</caption>
<graphic xlink:href="frobt-09-762051-g007.tif"/>
</fig>
<table-wrap id="T4" position="float">
<label>TABLE 4</label>
<caption>
<p>Episode returns after convergence of the curves in <xref ref-type="fig" rid="F7">Figure&#x20;7</xref> averaged over 100 episodes (mean &#xb1; standard deviation over 10 runs; the higher the better).</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Mean score</th>
<th align="center">TurtleBot Maze</th>
<th align="center">InvertedPendulum</th>
<th align="center">HalfCheetah</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">SAC &#x2b; XSRL</td>
<td align="char" char="plusmn">
<bold>0.98</bold>&#x20;&#xb1; <bold>0.02</bold>
</td>
<td align="char" char="plusmn">
<bold>1</bold>&#x20;&#xb1; <bold>0</bold>
</td>
<td align="char" char="plusmn">0.82&#x20;&#xb1; 0.03</td>
</tr>
<tr>
<td align="left">SAC &#x2b; RAE-explor</td>
<td align="char" char="plusmn">0.34&#x20;&#xb1; 0.04</td>
<td align="char" char="plusmn">0.99&#x20;&#xb1; 0</td>
<td align="char" char="plusmn">
<bold>0.87</bold>&#x20;&#xb1; <bold>0.09</bold>
</td>
</tr>
<tr>
<td align="left">SAC &#x2b; RAE</td>
<td align="char" char="plusmn">0.34&#x20;&#xb1; 0.06</td>
<td align="char" char="plusmn">0.93&#x20;&#xb1; 0.03</td>
<td align="char" char="plusmn">0.85&#x20;&#xb1; 0.08</td>
</tr>
<tr>
<td align="left">SAC &#x2b; random network</td>
<td align="char" char="plusmn">0.27&#x20;&#xb1; 0.1</td>
<td align="char" char="plusmn">0.74&#x20;&#xb1; 0.02</td>
<td align="char" char="plusmn">0.31&#x20;&#xb1; 0.05</td>
</tr>
<tr>
<td align="left">SAC &#x2b; ground truth</td>
<td align="char" char="plusmn">0.98&#x20;&#xb1; 0.02</td>
<td align="char" char="plusmn">1&#x20;&#xb1; 0</td>
<td align="char" char="plusmn">1&#x20;&#xb1; 0.1</td>
</tr>
<tr>
<td align="left">SAC &#x2b; open-loop</td>
<td align="char" char="plusmn">0.04&#x20;&#xb1; 0.03</td>
<td align="char" char="plusmn">0&#x20;&#xb1; 0.06</td>
<td align="char" char="plusmn">0&#x20;&#xb1; 0</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>
<xref ref-type="fig" rid="F7">Figure&#x20;7B</xref> shows that the position baseline does not allow SAC to learn a good policy on the InvertedPendulum task. This confirms that InvertedPendulum and HalfCheetah tasks require information from the positions and velocities of the agent&#x2019;s joints to follow Markovian state transitions which are only related to the local coherence of the environment (Lesort et&#x20;al., 2018). According to <xref ref-type="fig" rid="F7">Figure&#x20;7</xref> and <xref ref-type="table" rid="T4">Table&#x20;4</xref> on both torque-controlled environments (InvertedPendulum and HalfCheetah) SAC &#x2b; XSRL and SAC &#x2b; RAE-explor achieve about the same performance. While on InvertedPendulum they catch up to the ground truth performance, on HalfCheetah they remain slightly below. This is because HalfCheetah is more complex on the control part than InvertedPendulum, as the former has six degrees of freedom and the latter only&#x20;one.</p>
<p>In TurtleBot Maze, none of the state representation strategies other than XSRL were successful on the navigation task. In addition, a random network is not a viable strategy in any of the three environments, hence the need for a representation learning strategy. Furthermore, as shown by <xref ref-type="fig" rid="F7">Figure&#x20;7A</xref>, the performance of SAC &#x2b; XSRL is the same in TurtleBot Maze with a distractor (where XSRL was pretrained with the distractor). This tends to show that XSRL representations can capture information about the environment topology to encode the&#x20;orientation and position of the robot. As previously explained in <xref ref-type="sec" rid="s3-1">Section 3.1</xref>, this is related to the ability of following Markovian state transitions despite perceptual aliasing (<xref ref-type="bibr" rid="B8">Cadena et&#x20;al., 2016</xref>).</p>
<p>Overall, these quantitative evaluations show that pretrained state estimators with XSRL provide advantageous inputs to solve unseen RL tasks with SAC algorithm, which is an instance of criterion (iv). This confirms that by memorizing the information useful for predicting the consequences of the robot&#x2019;s action in the next observation, XSRL representations can encode the robot&#x2019;s configuration in a state space that exhibits Markovian transitions (useful to control it with RL), while filtering out unnecessary information (useful for generalization on new transitions).</p>
</sec>
</sec>
<sec id="s6">
<title>6 Discussion</title>
<p>Experimental results show that our proposed XSRL algorithm builds state representations that perform well on three unseen RL tasks. We see the link between the generalization performance of XSRL with respect to its next observation prediction objective (see <xref ref-type="fig" rid="F4">Figure&#x20;4</xref>) and the transfer performance of its pretrained state estimator (<italic>&#x3c6;</italic>) to a new RL task (see <xref ref-type="table" rid="T4">Table&#x20;4</xref>). Specifically, when XSRL achieves good prediction performance on a test dataset, this tends to imply good transfer performance to new RL tasks. On the contrary, our results showed on TurtleBot Maze that the generalization performance of RAE did not guarantee a good transfer to a RL&#x20;task.</p>
<p>The generalization performance on the test dataset strongly depends on the exploration efficiency (see <xref ref-type="fig" rid="F4">Figure&#x20;4</xref>), which is better with XSRL than with its two ablations. Our exploration allows agents to reach transitions far away from their initial states and much faster than the policy entropy maximization and random strategies (see <xref ref-type="fig" rid="F6">Figure&#x20;6</xref>).</p>
<p>Instead of dedicated policies, the exploration strategy could rely on count-based methods (<xref ref-type="bibr" rid="B51">Tang et&#x20;al., 2017</xref>). It might lead to promising extensions of XSRL, with more direct ways to encourage the agent to visit states it has never seen before. However, this approach raises the challenge of keeping the state counts up-to-date and relevant during the whole representation learning process, which requires to constantly update state visitation statistics while the state space changes.</p>
<p>Another promising avenue for XSRL is to extend it to the case where partial observability can be handled not only with memory, but also via active perception (<xref ref-type="bibr" rid="B9">Chrisman, 1992</xref>; <xref ref-type="bibr" rid="B36">McCallum, 1993</xref>; <xref ref-type="bibr" rid="B58">Whitehead and Lin, 1995</xref>). It would both require a modification of the representation learning procedure, in order to take into account information that may be related to hidden aspects of the state, and a modification of the exploration strategy to specifically aim at discovering and exploiting information that removes ambiguity about the true state of the&#x20;agent.</p>
<p>In this work, we are interested in a state representation that makes the evolution of the system predictable. XSRL tends to filter out information that is unnecessary for this purpose. However, this can be an issue if, in a new RL task, rewards are not related to the evolution of the system. For example, in a task in which an agent must respond to a color signal. Since this information is not controlled by the robot, it will be unpredictable for XSRL and thus filtered from its state representations. Solving this kind of problem is outside the scope of this paper, since we are specifically interested in learning state representations <italic>before</italic> being exposed to various RL tasks and reward signals.</p>
<p>Overall, experimental results have highlighted the main advantage of XSRL in learning state embeddings that can capture both the local coherence of the environment and a global information about its topology. While the state-of-the-art RAE method succeeds in encoding the former, it fails in encoding the latter, and leads to significantly worse results in the TurtleBot Maze navigation task (see <xref ref-type="table" rid="T4">Table&#x20;4</xref>).</p>
</sec>
<sec id="s7">
<title>7 Conclusion</title>
<p>We have presented a SRL algorithm (XSRL) that trains discovery policies for efficient exploration and pretrains state representations at the same time. Our experiments show that XSRL exploration provides fast maze traversal compared to random policy and policy entropy maximization strategies. Moreover, our comparative evaluation on unseen RL tasks confirms the transfer efficiency of the pretrained XSRL models. One of the most striking results is the superiority of XSRL representations over autoencoder ones, which is due to better representational properties since the constructed states are constrained to follow Markovian transitions. Furthermore, these results highlight the importance of an efficient exploration strategy in state representation pretraining approaches, and more generally in the SRL framework.</p>
</sec>
</body>
<back>
<sec id="s8">
<title>Data Availability Statement</title>
<p>The raw data supporting the conclusion of this article will be made available by the authors, without undue reservation.</p>
</sec>
<sec id="s9">
<title>Author Contributions</title>
<p>AM: designed the proposed algorithm, implemented the experiments, and also wrote the article. SD, NP-G, and AC: supervised the project and provided guidance and feedback, and also helped with the writing of the article.</p>
</sec>
<sec id="s10">
<title>Funding</title>
<p>This work has been sponsored by the Labex SMART supported by French state funds managed by the ANR within the Investissements d&#x2019;Avenir program under references ANR-11-LABX-65 and ANR-18-CE33-0005 HUSKI, and by the project VeriDream that has received funding from the European Union&#x2019;s H2020-EU.1.2.2. research and innovation program under grant agreement No. 951&#x2009;992.</p>
</sec>
<sec sec-type="COI-statement" id="s11">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s12">
<title>Publisher&#x2019;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ack>
<p>We gratefully acknowledge the support of NVIDIA Corporation with the donation of one Titan Xp GPU used for this research.</p>
</ack>
<fn-group>
<fn id="fn1">
<label>1</label>
<p>We used the 2D transposed convolution operator provided by PyTorch.</p>
</fn>
<fn id="fn2">
<label>2</label>
<p>In TurtleBot Maze, an agent must perform 47 actions of maximum amplitude to cross the&#x20;maze.</p>
</fn>
<fn id="fn3">
<label>3</label>
<p>
<ext-link ext-link-type="uri" xlink:href="https://github.com/astrid-merckling/SRL4RL">https://github.com/astrid-merckling/SRL4RL</ext-link>.</p>
</fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Achiam</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Sastry</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Surprise-based Intrinsic Motivation for Deep Reinforcement Learning.</article-title> <comment>arXiv preprint arXiv:1703.01732</comment> </citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Assael</surname>
<given-names>J.-A. M.</given-names>
</name>
<name>
<surname>Wahlstr&#xf6;m</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Sch&#xf6;n</surname>
<given-names>T. B.</given-names>
</name>
<name>
<surname>Deisenroth</surname>
<given-names>M. P.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Data-efficient Learning of Feedback Policies from Image Pixels Using Deep Dynamical Models</article-title>. <comment>arXiv preprint arXiv:1510.02173</comment> </citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>B&#xf6;hmer</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Gr&#xfc;new&#xe4;lder</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Musial</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Obermayer</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>Construction of Approximation Spaces for Reinforcement Learning</article-title>. <source>J.&#x20;Machine Learn. Res.</source> <volume>14</volume>, <fpage>2067</fpage>&#x2013;<lpage>2118</lpage>. </citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>B&#xf6;hmer</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Springenberg</surname>
<given-names>J.&#x20;T.</given-names>
</name>
<name>
<surname>Boedecker</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Riedmiller</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Obermayer</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Autonomous Learning of State Representations for Control: An Emerging Field Aims to Autonomously Learn State Representations for Reinforcement Learning Agents from Their Real-World Sensor Observations</article-title>. <source>KI-K&#xfc;nstliche Intelligenz</source> <volume>29</volume>, <fpage>353</fpage>&#x2013;<lpage>362</lpage>. </citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brockman</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Cheung</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Pettersson</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Schneider</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Schulman</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2016</year>). <article-title>Openai Gym</article-title>. <comment>arXiv preprint arXiv:1606.01540</comment> </citation>
</ref>
<ref id="B6">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Bubeck</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Munos</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Stoltz</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2009</year>). <article-title>Pure Exploration in Multi-Armed Bandits Problems</article-title>. <conf-name>International conference on Algorithmic learning theory</conf-name>. <publisher-name>Springer</publisher-name>, <fpage>23</fpage>&#x2013;<lpage>37</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-642-04414-4_7</pub-id> </citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Burda</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Edwards</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Pathak</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Storkey</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Darrell</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Efros</surname>
<given-names>A. A.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Large-scale Study of Curiosity-Driven Learning</article-title>. <comment>arXiv preprint arXiv:1808.04355</comment> </citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cadena</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Carlone</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Carrillo</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Latif</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Scaramuzza</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Neira</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2016</year>). <article-title>Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age</article-title>. <source>IEEE Trans. Robot.</source> <volume>32</volume>, <fpage>1309</fpage>&#x2013;<lpage>1332</lpage>. <pub-id pub-id-type="doi">10.1109/tro.2016.2624754</pub-id> </citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chrisman</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>19921992</year>). <article-title>Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach</article-title>. <source>AAAI (Citeseer)</source>, <fpage>183</fpage>&#x2013;<lpage>188</lpage>. </citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chua</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Calandra</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>McAllister</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Levine</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Deep Reinforcement Learning in a Handful of Trials Using Probabilistic Dynamics Models</article-title>. In <conf-name>Advances in Neural Information Processing Systems</conf-name>. <fpage>4759</fpage>&#x2013;<lpage>4770</lpage>. </citation>
</ref>
<ref id="B11">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Coumans</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Bai</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2016&#x2013;2019</year>). <article-title>Pybullet, a python Module for Physics Simulation for Games, Robotics and Machine Learning</article-title>. <comment>Availableat: <ext-link ext-link-type="uri" xlink:href="http://pybullet.org">http://pybullet.org</ext-link>
</comment>. </citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>de Bruin</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Kober</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Tuyls</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Babuska</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Integrating State Representation Learning into Deep Reinforcement Learning</article-title>. <source>IEEE Robot. Autom. Lett.</source> <volume>3</volume>, <fpage>1394</fpage>&#x2013;<lpage>1401</lpage>. <pub-id pub-id-type="doi">10.1109/lra.2018.2800101</pub-id> </citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gaier</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Ha</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Weight Agnostic Neural Networks</article-title>. In <conf-name>Advances in Neural Information Processing Systems</conf-name>. <fpage>5364</fpage>&#x2013;<lpage>5378</lpage>. </citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ghosh</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Sajjadi</surname>
<given-names>M. S.</given-names>
</name>
<name>
<surname>Vergari</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Black</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Sch&#xf6;lkopf</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>From Variational to Deterministic Autoencoders</article-title>. <comment>arXiv preprint arXiv:1903.12436</comment> </citation>
</ref>
<ref id="B15">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Glasmachers</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Limits of End-To-End Learning</article-title>,&#x201d;. <conf-name>Proceedings of The 9th Asian Conference on Machine Learning, ACML 2017 of Proceedings of Machine Learning Research</conf-name>. <conf-loc>Seoul, Korea</conf-loc>. <conf-loc>November 15-17, 2017</conf-loc>. Editors <person-group person-group-type="editor">
<name>
<surname>Zhang</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Noh</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<publisher-name>PMLR</publisher-name>), <volume>77</volume>, <fpage>17</fpage>&#x2013;<lpage>32</lpage>. </citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Haarnoja</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Abbeel</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Levine</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Reinforcement Learning with Deep Energy-Based Policies</article-title>. <comment>arXiv preprint arXiv:1702.08165</comment> </citation>
</ref>
<ref id="B17">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Haarnoja</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Hartikainen</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Tucker</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Ha</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Tan</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>Soft Actor-Critic Algorithms and Applications</article-title>. <comment>CoRR abs/1812.05905</comment> </citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hafner</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Lillicrap</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Fischer</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Villegas</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Ha</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>H.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>Learning Latent Dynamics for Planning from Pixels</article-title>. <comment>arXiv preprint arXiv:1811.04551</comment> </citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hansen</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Abbeel</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Efros</surname>
<given-names>A. A.</given-names>
</name>
<name>
<surname>Pinto</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Self-supervised Policy Adaptation during Deployment</article-title>. <comment>CoRR abs/2007.04309</comment> </citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Henderson</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Islam</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Bachman</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Pineau</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Precup</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Meger</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Deep Reinforcement Learning that Matters</article-title>. In <conf-name>Thirty-Second AAAI Conference on Artificial Intelligence</conf-name> </citation>
</ref>
<ref id="B21">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Hester</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Stone</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>Intrinsically Motivated Model Learning for a Developing Curious Agent</article-title>. <conf-name>2012 IEEE international conference on development and learning and epigenetic robotics (ICDL)</conf-name>. <publisher-name>IEEE</publisher-name>, <fpage>1</fpage>&#x2013;<lpage>6</lpage>. <pub-id pub-id-type="doi">10.1109/devlrn.2012.6400802</pub-id> </citation>
</ref>
<ref id="B22">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Jaderberg</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Mnih</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Czarnecki</surname>
<given-names>W. M.</given-names>
</name>
<name>
<surname>Schaul</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Leibo</surname>
<given-names>J.&#x20;Z.</given-names>
</name>
<name>
<surname>Silver</surname>
<given-names>D.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). <article-title>Reinforcement Learning with Unsupervised Auxiliary Tasks</article-title>. <conf-name>5th International Conference on Learning Representations, ICLR 2017</conf-name>, <conf-loc>Toulon, France</conf-loc>. <conf-date>April 24-26, 2017</conf-date>. <publisher-name>Conference Track Proceedings OpenReview.net</publisher-name>. </citation>
</ref>
<ref id="B23">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Jarrett</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Kavukcuoglu</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Ranzato</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>LeCun</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2009</year>). <article-title>What Is the Best Multi-Stage Architecture for Object Recognition?</article-title> <conf-name>2009 IEEE 12th international conference on computer vision</conf-name>. <publisher-name>IEEE</publisher-name>, <fpage>2146</fpage>&#x2013;<lpage>2153</lpage>. </citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jonschkowski</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Brock</surname>
<given-names>O.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Learning State Representations with Robotic Priors</article-title>. <source>Auton. Robot</source> <volume>39</volume>, <fpage>407</fpage>&#x2013;<lpage>428</lpage>. <pub-id pub-id-type="doi">10.1007/s10514-015-9459-7</pub-id> </citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jonschkowski</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Brock</surname>
<given-names>O.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>Learning Task-specific State Representations by Maximizing Slowness and Predictability</article-title>. In <conf-name>6th international workshop on evolutionary and reinforcement learning for autonomous robot systems (ERLARS)</conf-name> </citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kingma</surname>
<given-names>D. P.</given-names>
</name>
<name>
<surname>Ba</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>Adam: A Method for Stochastic Optimization</article-title>. <comment>arXiv preprint arXiv:1412.6980</comment> </citation>
</ref>
<ref id="B27">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Kingma</surname>
<given-names>D. P.</given-names>
</name>
<name>
<surname>Welling</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Auto-encoding Variational Bayes</article-title>,&#x201d;. . Editors <person-group person-group-type="editor">
<name>
<surname>Bengio</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>LeCun</surname>
<given-names>Y.</given-names>
</name>
</person-group>.<conf-name>2nd International Conference on Learning Representations, ICLR 2014</conf-name>. <conf-loc>Banff, AB, Canada</conf-loc>. <conf-date>April 14-16, 2014</conf-date> </citation>
</ref>
<ref id="B28">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kostrikov</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Yarats</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Fergus</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels</article-title>. <comment>CoRR abs/2004.13649</comment> </citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lange</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Riedmiller</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>Deep Auto-Encoder Neural Networks in Reinforcement Learning</article-title>. In <conf-name>The 2010 International Joint Conference on Neural Networks (IJCNN) (IEEE)</conf-name>, <fpage>1</fpage>&#x2013;<lpage>8</lpage>. <pub-id pub-id-type="doi">10.1109/ijcnn.2010.5596468</pub-id> </citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Laskin</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Stooke</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Pinto</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Abbeel</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Srinivas</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Reinforcement Learning with Augmented Data</article-title>. <comment>CoRR abs/2004.14990</comment> </citation>
</ref>
<ref id="B31">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lee</surname>
<given-names>A. X.</given-names>
</name>
<name>
<surname>Nagabandi</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Abbeel</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Levine</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model</article-title>. <comment>arXiv preprint arXiv:1907.00953</comment> </citation>
</ref>
<ref id="B32">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lesort</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>D&#xed;az-Rodr&#xed;guez</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Goudou</surname>
<given-names>J.-F.</given-names>
</name>
<name>
<surname>Filliat</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>State Representation Learning for Control: An Overview</article-title>. <source>Neural Networks</source> <volume>108</volume>, <fpage>379</fpage>&#x2013;<lpage>392</lpage>. <pub-id pub-id-type="doi">10.1016/j.neunet.2018.07.006</pub-id> </citation>
</ref>
<ref id="B33">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Deep Reinforcement Learning</article-title>. <comment>CoRR abs/1810.06339</comment> </citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lillicrap</surname>
<given-names>T. P.</given-names>
</name>
<name>
<surname>Hunt</surname>
<given-names>J.&#x20;J.</given-names>
</name>
<name>
<surname>Pritzel</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Heess</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Erez</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Tassa</surname>
<given-names>Y.</given-names>
</name>
<etal/>
</person-group> (<year>2015</year>). <article-title>Continuous Control with Deep Reinforcement Learning</article-title>. <comment>arXiv preprint arXiv:1509.02971</comment> </citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lopes</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Lang</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Toussaint</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Oudeyer</surname>
<given-names>P.-Y.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>Exploration in Model-Based Reinforcement Learning by Empirically Estimating Learning Progress</article-title>. In <conf-name>Advances in neural information processing systems</conf-name>. <fpage>206</fpage>&#x2013;<lpage>214</lpage>. </citation>
</ref>
<ref id="B36">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>McCallum</surname>
<given-names>R. A.</given-names>
</name>
</person-group> (<year>1993</year>). <article-title>Overcoming Incomplete Perception with Utile Distinction Memory</article-title>. In <conf-name>Proceedings of the Tenth International Conference on Machine Learning</conf-name>. <fpage>190</fpage>&#x2013;<lpage>196</lpage>. <pub-id pub-id-type="doi">10.1016/b978-1-55860-307-3.50031-9</pub-id> </citation>
</ref>
<ref id="B37">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Merckling</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Coninx</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Cressot</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Doncieux</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Perrin</surname>
<given-names>N.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>State Representation Learning from Demonstration</article-title>. <conf-name>International Conference on Machine Learning, Optimization, and Data Science</conf-name>. <publisher-name>Springer</publisher-name>, <fpage>304</fpage>&#x2013;<lpage>315</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-64580-9_26</pub-id> </citation>
</ref>
<ref id="B38">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mnih</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Kavukcuoglu</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Silver</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Graves</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Antonoglou</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Wierstra</surname>
<given-names>D.</given-names>
</name>
<etal/>
</person-group> (<year>2013</year>). <article-title>Playing Atari with Deep Reinforcement Learning</article-title>. <source>CoRR abs/</source>
<volume>1312</volume>, <fpage>5602</fpage>. </citation>
</ref>
<ref id="B39">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Morik</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Rastogi</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Jonschkowski</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Brock</surname>
<given-names>O.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>State&#x20;Representation Learning with Robotic Priors for Partially Observable Environments</article-title>. <source>IROS</source>, <fpage>6693</fpage>&#x2013;<lpage>6699</lpage>. <pub-id pub-id-type="doi">10.1109/iros40897.2019.8967938</pub-id> </citation>
</ref>
<ref id="B40">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Oudeyer</surname>
<given-names>P.-Y.</given-names>
</name>
<name>
<surname>Kaplan</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Hafner</surname>
<given-names>V. V.</given-names>
</name>
</person-group> (<year>2007</year>). <article-title>Intrinsic Motivation Systems for Autonomous Mental Development</article-title>. <source>IEEE Trans. Evol. Computat.</source> <volume>11</volume>, <fpage>265</fpage>&#x2013;<lpage>286</lpage>. <pub-id pub-id-type="doi">10.1109/tevc.2006.890271</pub-id> </citation>
</ref>
<ref id="B41">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Paszke</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Gross</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Chintala</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Chanan</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>DeVito</surname>
<given-names>Z.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). <source>Automatic Differentiation in Pytorch</source>. </citation>
</ref>
<ref id="B42">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pathak</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Agrawal</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Efros</surname>
<given-names>A. A.</given-names>
</name>
<name>
<surname>Darrell</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Curiosity-driven Exploration by Self-Supervised Prediction</article-title>. In <conf-name>International Conference on Machine Learning (ICML)</conf-name>. vol. <volume>2017</volume>. <pub-id pub-id-type="doi">10.1109/cvprw.2017.70</pub-id> </citation>
</ref>
<ref id="B43">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Poggio</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Mhaskar</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Rosasco</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Miranda</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Liao</surname>
<given-names>Q.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Why and when Can Deep-But Not Shallow-Networks Avoid the Curse of Dimensionality: a Review</article-title>. <source>Int. J.&#x20;Autom. Comput.</source> <volume>14</volume>, <fpage>503</fpage>&#x2013;<lpage>519</lpage>. <pub-id pub-id-type="doi">10.1007/s11633-017-1054-2</pub-id> </citation>
</ref>
<ref id="B44">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Robbins</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Monro</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>1951</year>). <article-title>A Stochastic Approximation Method</article-title>. <source>Ann. Math. Statist.</source> <volume>22</volume>, <fpage>400</fpage>&#x2013;<lpage>407</lpage>. <pub-id pub-id-type="doi">10.1214/aoms/1177729586</pub-id> </citation>
</ref>
<ref id="B45">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schmidhuber</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>1991</year>). <article-title>Adaptive Confidence and Adaptive Curiosity</article-title>. <source>Tech. rep., Citeseer</source>. </citation>
</ref>
<ref id="B46">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sekar</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Rybkin</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Daniilidis</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Abbeel</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Hafner</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Pathak</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Planning to Explore via Self-Supervised World Models</article-title>. <comment>arXiv preprint arXiv:2005.05960</comment> </citation>
</ref>
<ref id="B47">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Sermanet</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Lynch</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Chebotar</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Hsu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Jang</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Schaal</surname>
<given-names>S.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>Time-contrastive Networks: Self-Supervised Learning from Video</article-title>. <conf-name>2018 IEEE International Conference on Robotics and Automation</conf-name>. <publisher-name>ICRA IEEE</publisher-name>, <fpage>1134</fpage>&#x2013;<lpage>1141</lpage>. <pub-id pub-id-type="doi">10.1109/icra.2018.8462891</pub-id> </citation>
</ref>
<ref id="B48">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shelhamer</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Mahmoudieh</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Argus</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Darrell</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Loss Is its Own Reward: Self-Supervision for Reinforcement Learning</article-title>. <comment>arXiv preprint arXiv:1612.07307</comment> </citation>
</ref>
<ref id="B49">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Shyam</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Ja&#x15b;kowski</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Gomez</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Model-based Active Exploration</article-title>. <conf-name>International Conference on Machine Learning</conf-name>. <publisher-name>PMLR</publisher-name>, <fpage>5779</fpage>&#x2013;<lpage>5788</lpage>. </citation>
</ref>
<ref id="B50">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Srinivas</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Laskin</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Abbeel</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Curl: Contrastive Unsupervised Representations for Reinforcement Learning</article-title>. <comment>CoRR abs/2004.04136</comment> </citation>
</ref>
<ref id="B51">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Tang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Houthooft</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Foote</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Stooke</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Xi Chen</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Duan</surname>
<given-names>Y.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). &#x201c;<article-title>&#x23;exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning</article-title>,&#x201d;. <source>Advances in Neural Information Processing Systems</source>. Editors <person-group person-group-type="editor">
<name>
<surname>Guyon</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Luxburg</surname>
<given-names>U. V.</given-names>
</name>
<name>
<surname>Bengio</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Wallach</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Fergus</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Vishwanathan</surname>
<given-names>S.</given-names>
</name>
<etal/>
</person-group> (<publisher-loc>Long Beach Convention Center, Long Beach</publisher-loc>: <publisher-name>Curran Associates, Inc.</publisher-name>), <volume>30</volume>. </citation>
</ref>
<ref id="B52">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tassa</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Doron</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Muldal</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Erez</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Casas</surname>
<given-names>D. d. L.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>Deepmind Control Suite</article-title>. <comment>arXiv preprint arXiv:1801.00690</comment> </citation>
</ref>
<ref id="B53">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Todorov</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Erez</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Tassa</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>Mujoco: A Physics Engine for Model-Based Control</article-title>. <conf-name>Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on</conf-name>. <publisher-name>IEEE</publisher-name>, <fpage>5026</fpage>&#x2013;<lpage>5033</lpage>. <pub-id pub-id-type="doi">10.1109/iros.2012.6386109</pub-id> </citation>
</ref>
<ref id="B54">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Van Hasselt</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Guez</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Silver</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Deep Reinforcement Learning with Double Q-Learning</article-title>. <comment>arXiv preprint arXiv:1509.06461</comment> </citation>
</ref>
<ref id="B55">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>van Hoof</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Karl</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>van der Smagt</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Peters</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Stable Reinforcement Learning with Autoencoders for Tactile and Visual Data</article-title>. <conf-name>Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on</conf-name>. <publisher-name>IEEE</publisher-name>, <fpage>3928</fpage>&#x2013;<lpage>3934</lpage>. <pub-id pub-id-type="doi">10.1109/iros.2016.7759578</pub-id> </citation>
</ref>
<ref id="B56">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wahlstr&#xf6;m</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Sch&#xf6;n</surname>
<given-names>T. B.</given-names>
</name>
<name>
<surname>Deisenroth</surname>
<given-names>M. P.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>From Pixels to Torques: Policy Learning with Deep Dynamical Models</article-title>. <comment>arXiv preprint arXiv:1502.02251</comment> </citation>
</ref>
<ref id="B57">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Watter</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Springenberg</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Boedecker</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Riedmiller</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2015</year>). &#x201c;<article-title>Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images</article-title>,&#x201d; in <source>Advances in Neural Information Processing Systems</source>, <fpage>2746</fpage>&#x2013;<lpage>2754</lpage>. </citation>
</ref>
<ref id="B58">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Whitehead</surname>
<given-names>S. D.</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>L.-J.</given-names>
</name>
</person-group> (<year>1995</year>). <article-title>Reinforcement Learning of Non-markov Decision Processes</article-title>. <source>Artif. intelligence</source> <volume>73</volume>, <fpage>271</fpage>&#x2013;<lpage>306</lpage>. <pub-id pub-id-type="doi">10.1016/0004-3702(94)00012-p</pub-id> </citation>
</ref>
<ref id="B59">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Xu</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Empirical Evaluation of Rectified Activations in Convolutional Network</article-title>. <comment>arXiv preprint arXiv:1505.00853</comment> </citation>
</ref>
<ref id="B60">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yarats</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Kostrikov</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Amos</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Pineau</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Fergus</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Improving Sample Efficiency in Model-free Reinforcement Learning from Images</article-title>. <comment>CoRR abs/1910.01741</comment> </citation>
</ref>
<ref id="B61">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ziebart</surname>
<given-names>B. D.</given-names>
</name>
<name>
<surname>Maas</surname>
<given-names>A. L.</given-names>
</name>
<name>
<surname>Bagnell</surname>
<given-names>J.&#x20;A.</given-names>
</name>
<name>
<surname>Dey</surname>
<given-names>A. K.</given-names>
</name>
</person-group> (<year>2008</year>). <article-title>Maximum Entropy Inverse Reinforcement Learning</article-title>. <source>Aaai (Chicago, IL, USA)</source> <volume>8</volume>, <fpage>1433</fpage>&#x2013;<lpage>1438</lpage>. </citation>
</ref>
</ref-list>
</back>
</article>