<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="brief-report" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Robot. AI</journal-id>
<journal-title>Frontiers in Robotics and AI</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Robot. AI</abbrev-journal-title>
<issn pub-type="epub">2296-9144</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">625125</article-id>
<article-id pub-id-type="doi">10.3389/frobt.2021.625125</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Robotics and AI</subject>
<subj-group>
<subject>Brief Research Report</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Off-Policy Evaluation of the Performance of a Robot Swarm: Importance Sampling to Assess Potential Modifications to the Finite-State Machine That Controls the&#x20;Robots</article-title>
<alt-title alt-title-type="left-running-head">Pagnozzi and Birattari</alt-title>
<alt-title alt-title-type="right-running-head">Off-Policy Evaluation of a Robot Swarm</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Pagnozzi</surname>
<given-names>Federico</given-names>
</name>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/743025/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Birattari</surname>
<given-names>Mauro</given-names>
</name>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/219071/overview"/>
</contrib>
</contrib-group>
<aff>IRIDIA, Universit&#xe9; libre de Bruxelles, <addr-line>Brussels</addr-line>, <country>Belgium</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/160162/overview">Savvas Loizou</ext-link>, Cyprus University of Technology, Cyprus</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/514710/overview">Alan Gregory Millard</ext-link>, University of Lincoln, United&#x20;Kingdom</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/152648/overview">Heiko Hamann</ext-link>, University of L&#xfc;beck, Germany</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Federico Pagnozzi, <email>federico.pagnozzi@ulb.ac.be</email>; Mauro Birattari, <email>mbiro@ulb.ac.be</email>
</corresp>
<fn fn-type="other">
<p>This article was submitted to Multi-Robot Systems, a section of the journal Frontiers in Robotics and&#x20;AI</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>29</day>
<month>04</month>
<year>2021</year>
</pub-date>
<pub-date pub-type="collection">
<year>2021</year>
</pub-date>
<volume>8</volume>
<elocation-id>625125</elocation-id>
<history>
<date date-type="received">
<day>02</day>
<month>11</month>
<year>2020</year>
</date>
<date date-type="accepted">
<day>17</day>
<month>02</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2021 Pagnozzi and Birattari.</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Pagnozzi and Birattari</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these&#x20;terms.</p>
</license>
</permissions>
<abstract>
<p>Due to the decentralized, loosely coupled nature of a swarm and to the lack of a general design methodology, the development of control software for robot swarms is typically an iterative process. Control software is generally modified and refined repeatedly, either manually or automatically, until satisfactory results are obtained. In this paper, we propose a technique based on off-policy evaluation to estimate how the performance of an instance of control software&#x2014;implemented as a probabilistic finite-state machine&#x2014;would be impacted by modifying the structure and the value of the parameters. The proposed technique is particularly appealing when coupled with automatic design methods belonging to the AutoMoDe family, as it can exploit the data generated during the design process. The technique can be used either to reduce the complexity of the control software generated, improving therefore its readability, or to evaluate perturbations of the parameters, which could help in prioritizing the exploration of the neighborhood of the current solution within an iterative improvement algorithm. To evaluate the technique, we apply it to control software generated with an AutoMoDe method, <inline-formula id="inf1">
<mml:math id="minf1">
<mml:mrow>
<mml:mi mathvariant="normal">Chocolate</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>6</mml:mn>
<mml:mi mathvariant="normal">S</mml:mi>
<mml:mtext>&#x2009;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>. In a first experiment, we use the proposed technique to estimate the impact of removing a state from a probabilistic finite-state machine. In a second experiment, we use it to predict the impact of changing the value of the parameters. The results show that the technique is promising and significantly better than a naive estimation. We discuss the limitations of the current implementation of the technique, and we sketch possible improvements, extensions, and generalizations.</p>
</abstract>
<kwd-group>
<kwd>swarm robotics</kwd>
<kwd>control software architecture</kwd>
<kwd>automatic design</kwd>
<kwd>reinforcement learning</kwd>
<kwd>importance sampling</kwd>
</kwd-group>
<contract-num rid="cn001">681872</contract-num>
<contract-sponsor id="cn001">H2020 European Research Council<named-content content-type="fundref-id">10.13039/100010663</named-content>
</contract-sponsor>
<contract-sponsor id="cn002">Fonds De La Recherche Scientifique - FNRS<named-content content-type="fundref-id">10.13039/501100002661</named-content>
</contract-sponsor>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>In this paper, we investigate the use of off-policy evaluation to estimate the performance of a swarm of robots. In swarm robotics (<xref ref-type="bibr" rid="B9">Dorigo et&#x20;al., 2014</xref>), a group of robots act in coordination to perform a given mission. This engineering discipline is inspired by the principles of swarm intelligence (<xref ref-type="bibr" rid="B10">Dorigo and Birattari, 2007</xref>). The behavior of the swarm is determined by the local interactions of the robots with each other and with the environment. In a robot swarm, there is no single point of failure and additional robots can be added to the swarm without changing the control software. Unfortunately, these same features make designing the control software of the individual robots comprised in a swarm a complex endeavor. In fact, with the exception of some specific cases (<xref ref-type="bibr" rid="B6">Brambilla et&#x20;al., 2015</xref>; <xref ref-type="bibr" rid="B26">Reina et&#x20;al., 2015</xref>; <xref ref-type="bibr" rid="B19">Lopes et&#x20;al., 2016</xref>), a general design methodology has yet to be proposed (<xref ref-type="bibr" rid="B11">Francesca and Birattari, 2016</xref>). Typically, the design of the control software of the individual robots comprised in a swarm is an iterative improvement process based on trial and error and heavily relies on the experience and intuition of the designer (<xref ref-type="bibr" rid="B13">Francesca et&#x20;al., 2014</xref>). Automatic design has shown to be a valid alternative to manual design (<xref ref-type="bibr" rid="B11">Francesca and Birattari, 2016</xref>; <xref ref-type="bibr" rid="B3">Birattari et&#x20;al., 2019</xref>). Automatic design methods work by formulating the design problem as an optimization problem, which is then solved using generally available heuristic methods. The solution of the optimization problem is an instance of control software and the solution quality is a measure of its performance. In other words, the optimal solution of such optimization problem is the control software that maximizes an appropriate mission-dependent performance metric. Reviews of the swarm robotics literature can be found in <xref ref-type="bibr" rid="B14">Garattoni and Birattari (2016)</xref> and <xref ref-type="bibr" rid="B7">Brambilla et&#x20;al. (2013)</xref>, while in depth reviews of automatic design in swarm robotics can be found in <xref ref-type="bibr" rid="B11">Francesca and Birattari (2016)</xref>; <xref ref-type="bibr" rid="B8">Bredeche et&#x20;al. (2018)</xref>; <xref ref-type="bibr" rid="B4">Birattari et&#x20;al. (2020)</xref>.</p>
<p>In this study, we focus on control software implemented as a probabilistic finite-state machine (PFSM): a graph where each node represents a low-level behavior of the robot and each edge represents a transition from a low-level behavior to another. When the condition associated to a transition is verified, the transition is performed and the current state changes. In a probabilistic finite-state machine, each transition whose associated condition is verified may take place with a certain probability. This control software architecture is human readable and modular&#x2014;states and transitions can be defined once and easily reused or changed. Due to these characteristics, finite-state machines have been commonly used in manual design as well as in automatic design methods such as the ones belonging to the AutoMoDe family (<xref ref-type="bibr" rid="B13">Francesca et&#x20;al., 2014</xref>).</p>
<p>In AutoMoDe, the control software is generated by combining pre-existing parametric software modules in a modular architecture, such as a probabilistic finite-state machine or a behavior tree. When considering finite-state machines, the software modules are either state modules and transition modules. The optimization algorithm designs control software by optimizing the structure of the PFSM&#x2014;the number of states and how they are connected to each other&#x2014;the behaviors, the transitions, and their parameters. In <xref ref-type="fig" rid="F1">Figure&#x20;1C</xref>, we show an example of a finite-state machine generated using AutoMoDe.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>In panel <bold>(A)</bold>, we show the workflow of the experiments. During the design process, AutoMoDe produces several finite-state machines&#x2014;such as the one composed of four states shown in panel <bold>(C)</bold>&#x2014;as well as the execution traces of each experiment executed during the design process. The finite-state machine is then modified by a designer&#x2014;either a human or an automatic procedure&#x2014;producing a modified finite-state machine&#x2014;such as the one produced by removing a state shown in panel <bold>(D)</bold> where the state and the relative transitions that have been removed are shown in light gray&#x2014;or by modifying the parameters. First, the state values of the original finite-state machine are calculated from the execution logs using the <italic>first-visit</italic> MC method. Importance sampling uses the state values, the modified finite-state machine, and the execution traces to calculate the estimated state values of the modified finite-state machine that are then used to produce an estimation of the performance of the modified finite-state machine. Finally, in panel <bold>(B),</bold> we show a screenshot from the ARGoS simulator where a swarm of 20 robots is performing the foraging task using the control software shown in panel <bold>(C)</bold>.</p>
</caption>
<graphic xlink:href="frobt-08-625125-g001.tif"/>
</fig>
<p>Regardless of the design method and control software architecture, once generated, the control software is improved through an iterative process where changes are evaluated and applied if considered promising. For instance, a human designer applies this process when fine-tuning the configuration of the control software. The optimization algorithms used in automatic design methods, and in particular iterative optimization methods, also work in this way. For instance, iterative optimization methods start from one or multiple solutions and explore the solution space in an iterative fashion. At each iteration, the optimization process generates new solutions by considering modifications of the previous solution(s). Having an estimate of the performance of such modifications can save valuable resources&#x2014;access to appropriate computational resources can be expensive&#x2014;and significantly speed up the design process. Furthermore, in the context of automatic design, such an estimate could be used to reduce the complexity of the generated control software. Indeed, the automatic design process often introduces artifacts in the generated control software, that is, there may be parts of the control software that do not contribute to the performance because they either do not influence the behavior of the robots or are never executed. These artifacts are generally ignored, but they add unnecessary complexity and hinder readability.</p>
<p>Our proposal is to use off-policy evaluation to estimate the impact of a modification from data collected during the execution of the control software. Off-policy evaluation, is a technique developed in the context of reinforcement learning (<xref ref-type="bibr" rid="B2">Bertsekas and Tsitsiklis, 1996</xref>; <xref ref-type="bibr" rid="B28">Sutton and Barto, 2018</xref>), to estimate the performance of a <italic>target</italic> policy from the observation of the one of a <italic>behavior</italic> policy. In a policy, the world is represented as a set of states&#x2014;possible configurations of the environment, including the robot itself&#x2014;connected by actions&#x2014;the possible interactions with the world. Each state is associated with a set of possible actions that can be taken to transition from one state to the other. The target policy may be deterministic&#x2014;given a state it always executes the same action&#x2014;or stochastic&#x2014;the action to be performed is chosen with a certain probability; while the behavior policy must be stochastic (<xref ref-type="bibr" rid="B28">Sutton and Barto, 2018</xref>).</p>
<p>Almost all off-policy methods use a technique called importance sampling (<xref ref-type="bibr" rid="B15">Hammersley and Handscomb, 1964</xref>; <xref ref-type="bibr" rid="B27">Rubinstein and Kroese, 1981</xref>; <xref ref-type="bibr" rid="B28">Sutton and Barto, 2018</xref>). This technique is used to compute a statistic of a distribution based on a sample extracted from another. In this way, data generated by one policy can be used&#x2014;after being appropriately weighted through importance sampling&#x2014;to evaluate a different policy. For instance, one may execute a policy that performs actions randomly and use the observed policy performance to estimate the performance of a deterministic policy that always chooses the action with the highest expected reward. Given a set of actions executed by the behavior policy, this technique estimates the performance of the target policy from the one observed by performing the behavior policy by weighting the latter with the ratio between the probability of executing each action under the target policy and the one of executing them under the behavior policy. As a consequence, the behavior policy must contain all the states and actions of the target policy. Moreover, under the behavior policy, the probabilities of executing each action under each state must be strictly positive.</p>
<p>Off-policy methods based on importance sampling have been studied in reinforcement learning for a long time (<xref ref-type="bibr" rid="B15">Hammersley and Handscomb, 1964</xref>; <xref ref-type="bibr" rid="B23">Powell and Swann, 1966</xref>; <xref ref-type="bibr" rid="B27">Rubinstein and Kroese, 1981</xref>; <xref ref-type="bibr" rid="B28">Sutton and Barto, 2018</xref>). Recent works focused on combining importance sampling with temporal difference learning and approximation methods (<xref ref-type="bibr" rid="B25">Precup et&#x20;al., 2000</xref>, <xref ref-type="bibr" rid="B24">Precup et&#x20;al., 2001</xref>), as well as reducing the variance of the estimation (<xref ref-type="bibr" rid="B17">Jiang and Li, 2016</xref>) and improving the bias-variance trade-off (<xref ref-type="bibr" rid="B29">Thomas and Brunskill, 2016</xref>).</p>
<p>The control software of a robot is indeed the implementation of a policy. In fact, a robot uses information acquired through its sensors to acquire information on the world and execute an action by properly operating its actuators and motors. Depending on the control software architecture, the state might be explicitly reconstructed or not, and the set of actions available in each state might be defined in a more or less explicit way. In the case of probabilistic finite-state machines, the similarities of this control software architecture with policies makes it ideal for this study. By considering additional information from the sensors of the robot, the states and transitions of a PFSM are directly translated in states and actions of a policy. In the technique we propose in this paper, the relevant data is the execution traces&#x2014;that is, the sequence of internal states traversed by the controller and the sequence of sensor readings&#x2014;from each robot in the swarm during multiple experimental&#x20;runs.</p>
<p>To evaluate the technique we propose, we use control software generated with a variant of Chocolate (<xref ref-type="bibr" rid="B12">Francesca et&#x20;al., 2015</xref>) that we modified to allow the generation of more complex finite-state machines composed of up to six states. In order to avoid confusion, we call this variant <inline-formula id="inf2">
<mml:math id="minf2">
<mml:mrow>
<mml:mi>C</mml:mi>
<mml:mi>h</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>6</mml:mn>
<mml:mi>S</mml:mi>
<mml:mtext>&#x2009;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>. A further advantage of using AutoMoDe is that collecting the execution data needed for the estimation does not require additional experimental runs because the technique we propose can operate on the data produced within the design process. In the experiments, we consider two kind of modifications, one concerning the structure and one the parameters of the control software. In the first, we estimate the performance after removing one of the states of the control software. In the second, we evaluate the impact of modifying the values of two parameters of the two most executed transitions. In both experiments, we compared the estimation provided by the proposed technique with a naive estimation made assuming that the applied modification would not change the performance. The results show that the proposed technique is better than the naive estimation and that the difference between the two is statistically significant.</p>
<p>The structure of the paper is the following. In <xref ref-type="sec" rid="s2">Section 2</xref>, we present off-policy evaluation and how it can be applied to finite-state machines. The experiments, their setup, and results are presented in <xref ref-type="sec" rid="s3">Section 3</xref>. Finally in <xref ref-type="sec" rid="s4">Section 4</xref>, we discuss the limitations of the proposed technique, possible ways to improve the estimation and how it can be extended to other control software architectures.</p>
</sec>
<sec id="s2">
<title>2 Method</title>
<sec id="s2-1">
<title>2.1 Background: Off-Policy Evaluation</title>
<p>The main challenge in reinforcement learning is estimating how desirable it is for an agent to be in a state. The value <inline-formula id="inf3">
<mml:math id="minf3">
<mml:mrow>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mi>&#x3c0;</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> of a state <italic>s</italic> under a policy &#x3c0; is the expected reward obtainable by starting in state <italic>s</italic> and then following the policy &#x3c0;. The reward is calculated over an episode <italic>e</italic> that can be defined as an entire interaction between an agent and the environment, from an initial to a final step. The reward obtained after step <italic>t</italic> can be calculated as<disp-formula id="e1">
<mml:math id="me1">
<mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>e</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2250;</mml:mo>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3b3;</mml:mi>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
<label>(1)</label>
</disp-formula>
</p>
<p>In <xref ref-type="disp-formula" rid="e1">Eq. 1</xref>, <inline-formula id="inf4">
<mml:math id="minf4">
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the reward obtained at step <inline-formula id="inf5">
<mml:math id="minf5">
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf6">
<mml:math id="minf6">
<mml:mrow>
<mml:mi>&#x3b3;</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mn>0,1</mml:mn>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is a parameter called discount rate, which allows us to model the fact that rewards may depend less and less on states visited earlier in the episode.</p>
<p>Using Monte Carlo (MC) methods, the value function can be estimated from a sample of episodes. In these methods, the value <inline-formula id="inf7">
<mml:math id="minf7">
<mml:mrow>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mi>&#x3c0;</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> of a state <italic>s</italic> is calculated as the average of the returns following the visit to state <italic>s</italic>. In the <italic>first-visit</italic> MC method only the reward following the first visit is considered while in <italic>every-visit</italic> all the visits contribute to the average. As the two methods are similar and both converge to <inline-formula id="inf8">
<mml:math id="minf8">
<mml:mrow>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mi>&#x3c0;</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> when the number of visits tends to infinity, to simplify calculations in this work we use the <italic>first-visit</italic> MC method.</p>
<p>Off-policy evaluation estimates the value function of a policy &#x3c0; from episodes generated by another policy <italic>b</italic>. To be able to use episodes from <italic>b</italic> to estimate values for &#x3c0;, <italic>b</italic> must <italic>cover</italic> &#x3c0;, that is, it must be possible&#x2014;under <italic>b</italic>&#x2014;to take every action that can be taken under &#x3c0;. Formally, if <inline-formula id="inf9">
<mml:math id="minf9">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mrow>
<mml:mo>&#x7c;</mml:mo>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3e;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> then <inline-formula id="inf10">
<mml:math id="minf10">
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mrow>
<mml:mo>&#x7c;</mml:mo>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3e;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, where <inline-formula id="inf11">
<mml:math id="minf11">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mrow>
<mml:mo>&#x7c;</mml:mo>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> indicates the probability under policy &#x3c0; of taking action <italic>a</italic> when in state <italic>s</italic>. Assuming that <italic>b</italic> covers &#x3c0;, importance sampling can be used to weight the returns of <italic>b</italic> considering that&#x2014;between the policies&#x2014;each action may be taken with a different probability. Defining <inline-formula id="inf12">
<mml:math id="minf12">
<mml:mrow>
<mml:mi>&#x3c4;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> as the sequence of states and actions <inline-formula id="inf13">
<mml:math id="minf13">
<mml:mrow>
<mml:mtext>&#x2009;</mml:mtext>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>A</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>T</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>&#x2014;where <inline-formula id="inf14">
<mml:math id="minf14">
<mml:mrow>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the first visit to state <italic>s</italic>, <inline-formula id="inf15">
<mml:math id="minf15">
<mml:mrow>
<mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the action taken when in <inline-formula id="inf16">
<mml:math id="minf16">
<mml:mrow>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf17">
<mml:math id="minf17">
<mml:mrow>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>T</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the final state&#x2014;the ratio between the different probabilities&#x2014;called importance sampling ratio&#x2014;can be expressed as<disp-formula id="e2">
<mml:math id="me2">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3c1;</mml:mi>
<mml:mrow>
<mml:mi>&#x3c4;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munderover>
<mml:mo>&#x220f;</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>&#x7c;</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>&#x7c;</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mstyle>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
<label>(2)</label>
</disp-formula>
</p>
<p>In <xref ref-type="disp-formula" rid="e2">Eq. 2</xref>, <inline-formula id="inf18">
<mml:math id="minf18">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>&#x7c;</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf19">
<mml:math id="minf19">
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>&#x7c;</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> indicate the probability of taking action <inline-formula id="inf20">
<mml:math id="minf20">
<mml:mrow>
<mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> when in state <inline-formula id="inf21">
<mml:math id="minf21">
<mml:mrow>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> under the target policy and the behavior policy, respectively. Given a set of episodes <italic>E</italic> generated with policy <italic>b</italic>, there are two main ways of using <inline-formula id="inf22">
<mml:math id="minf22">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3c1;</mml:mi>
<mml:mrow>
<mml:mi>&#x3c4;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> to estimate <inline-formula id="inf23">
<mml:math id="minf23">
<mml:mrow>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mi>&#x3c0;</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>: <italic>ordinary importance sampling</italic> and <italic>weighted importance sampling</italic> (WIS). <italic>Ordinary importance sampling</italic> is defined as<disp-formula id="e3">
<mml:math id="me3">
<mml:mrow>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mi>&#x3c0;</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:msub>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>e</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>E</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi>&#x0020;</mml:mi>
<mml:mrow>
<mml:msub>
<mml:mi>&#x3c1;</mml:mi>
<mml:mrow>
<mml:mi>&#x3c4;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>e</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>&#x7c;</mml:mo>
<mml:mi>E</mml:mi>
<mml:mo>&#x7c;</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
<label>(3)</label>
</disp-formula>
</p>
<p>Weighted importance sampling instead is defined as<disp-formula id="e4">
<mml:math id="me4">
<mml:mrow>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mi>&#x3c0;</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:msub>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>e</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>E</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi>&#x0020;</mml:mi>
<mml:mrow>
<mml:msub>
<mml:mi>&#x3c1;</mml:mi>
<mml:mrow>
<mml:mi>&#x3c4;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>e</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:msub>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>e</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>E</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi>&#x0020;</mml:mi>
<mml:mrow>
<mml:msub>
<mml:mi>&#x3c1;</mml:mi>
<mml:mrow>
<mml:mi>&#x3c4;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
<label>(4)</label>
</disp-formula>
</p>
<p>The main difference between <xref ref-type="disp-formula" rid="e3">Eqs. 3</xref>, <xref ref-type="disp-formula" rid="e4">4</xref> is that, in the latter equation, <inline-formula id="inf24">
<mml:math id="minf24">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3c1;</mml:mi>
<mml:mrow>
<mml:mi>&#x3c4;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is in the denominator. Both <italic>ordinary importance sampling</italic> and <italic>weighted importance sampling</italic> tend to the exact state value when increasing the number of episodes considered but the two methods show a different variance and bias trade-off. <italic>Ordinary importance sampling</italic> is unbiased but shows a very high variance, <italic>weighted importance sampling</italic> introduces some bias resulting in a far lower variance (<xref ref-type="bibr" rid="B28">Sutton and Barto, 2018</xref>). We selected WIS for our implementation because during some preliminary experiments the variance of the estimation was found to impact significantly the results.</p>
</sec>
<sec id="s2-2">
<title>2.2 Our Technique: Applying Off-Policy Evaluation to Finite-State Machines</title>
<p>Applying off-policy evaluation to finite-state machines requires establishing what are states and actions in a PFSM, what is an episode and how to calculate the final reward, <inline-formula id="inf25">
<mml:math id="minf25">
<mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mi>T</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, from the score attained by the whole swarm at the end of an experimental run. In finite-state machines, each state represents a behavior that is executed until an event triggers a transition to another state. Considering that the robot does not change its behavior until a transition is triggered, our working hypothesis is that a state of a PFSM&#x2013;together with the information from the robot sensors saved in the execution traces&#x2014;can be simplified in a single state of a policy and, consequently, transitions acquire the same function as actions in a policy. With these assumptions&#x2014;considering a PFSM &#x3c0;&#x2014;<inline-formula id="inf26">
<mml:math id="minf26">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mrow>
<mml:mo>&#x7c;</mml:mo>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> can be defined as the probability that transition <italic>a</italic> is triggered when in state&#x20;<italic>s</italic>.</p>
<p>Naturally, an episode should correspond to an experimental run with the reward being the final score, but in swarm robotics there are several robots&#x2014;typically all running the same control software. For this reason, we divide an experimental run involving a swarm of <italic>n</italic> robots in a set of <italic>n</italic> parallel episodes. Similarly, assuming <italic>F</italic> is the final score of the experimental run, the reward awarded to each robot corresponds to <inline-formula id="inf27">
<mml:math id="minf27">
<mml:mrow>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>/</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>. In calculating <inline-formula id="inf28">
<mml:math id="minf28">
<mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mi>T</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> using <xref ref-type="disp-formula" rid="e1">Eq. 1</xref>, because assigning a per-time-step reward is not always possible, we set <inline-formula id="inf29">
<mml:math id="minf29">
<mml:mrow>
<mml:msub>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>&#x2014;that is, the reward is given only at the end of the episode&#x2014;and we do not consider any discount&#x2014;that is, <inline-formula id="inf30">
<mml:math id="minf30">
<mml:mrow>
<mml:mi>&#x3b3;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>&#x2014;resulting in <inline-formula id="inf31">
<mml:math id="minf31">
<mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mi>T</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>/</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>. Using the <italic>first-visit</italic> method, a state will get a reward of <inline-formula id="inf32">
<mml:math id="minf32">
<mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mi>T</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> if it is executed at least once during the episode. In this case, a state that is executed once&#x2014;for instance, the initial state&#x2014;will have the same reward as a state that is executed for almost the entirety of the episode. We also considered another way of calculating <inline-formula id="inf33">
<mml:math id="minf33">
<mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mi>T</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> that consists in weighting the reward by the relative execution time of each state. This proportional <inline-formula id="inf34">
<mml:math id="minf34">
<mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mi>T</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is calculated per state so that a state <italic>s</italic> that has been executed for <italic>k</italic> time steps gets a reward <inline-formula id="inf35">
<mml:math id="minf35">
<mml:mrow>
<mml:mi>G</mml:mi>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>T</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>/</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x22c5;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>/</mml:mo>
<mml:mrow>
<mml:mtext>steps</mml:mtext>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> where <inline-formula id="inf36">
<mml:math id="minf36">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the total number of time steps in the episode. The pseudo code showing how to estimate the state values is shown in <xref ref-type="other" rid="alg1">Algorithm&#x20;1</xref>.</p>
<p>
<statement content-type="algorithm" id="alg1">
<label>
<bold>Algorithm 1</bold> </label>
<p>Pseudo code showing how the state values are calculated. The inputs are the finite-state machine and its execution traces generated during the design process. Each execution trace contains the recording of all the robots in the swarm as well as the final score awarded to the swarm at the end of the experiment. The procedure iterates over each execution trace and for each execution trace considers each robot separately. For each robot, the value of each state is calculated using the <italic>first-visit</italic> MC method and considering two ways of calculating the reward. In the first, the reward is equal for each state and is equal to the reward per robot. In the second, each state has a reward calculated as a fraction of the reward per robot proportional to the total time the state was active.<list list-type="simple">
<list-item>
<p>&#x2009;<bold>Input</bold> A PFSM composed of <italic>S</italic> states and <italic>T</italic> transitions.</p>
</list-item>
<list-item>
<p>&#x2009;<bold>Input</bold> A set of execution traces</p>
</list-item>
<list-item>
<p>&#x2009;<bold>Output</bold> <inline-formula id="inf65">
<mml:math id="minf65">
<mml:mrow>
<mml:mi>V</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> state values using the full reward</p>
</list-item>
<list-item>
<p>&#x2009;<bold>Output</bold> <inline-formula id="inf66">
<mml:math id="minf66">
<mml:mrow>
<mml:msub>
<mml:mi>V</mml:mi>
<mml:mi>p</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> state values using the proportional reward</p>
</list-item>
<list-item>
<p>&#x2009;<bold>for</bold> each <inline-formula id="inf67">
<mml:math id="minf67">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>&#x20;<sub>
<bold>do</bold>
</sub>
</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;</bold>&#x2009;<bold>Initialize</bold> <inline-formula id="inf68">
<mml:math id="minf68">
<mml:mrow>
<mml:mtext>Rewards</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> as an empty&#x20;list</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;</bold>&#x2009;<bold>Initialize</bold> <inline-formula id="inf69">
<mml:math id="minf69">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mtext>Rewards</mml:mtext>
</mml:mrow>
<mml:mi>p</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> as an empty&#x20;list</p>
</list-item>
<list-item>
<p>&#x2009;<bold>loop</bold> over each execution&#x20;trace</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;</bold>&#x2009;<inline-formula id="inf70">
<mml:math id="minf70">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>&#x3d;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> final score of the&#x20;swarm</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;</bold>&#x2009;<inline-formula id="inf71">
<mml:math id="minf71">
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>&#x3d;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>number of robots in the&#x20;swarm</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;</bold>&#x2009;<inline-formula id="inf72">
<mml:math id="minf72">
<mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mi>T</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>/</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> reward per&#x20;robot</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;</bold>&#x2009;<inline-formula id="inf73">
<mml:math id="minf73">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>s</mml:mi>
<mml:mo>&#x3d;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> number of time steps in the execution&#x20;trace</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;</bold>&#x2009;<bold>for</bold> each robot in the swarm&#x20;<bold>do</bold>
</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;&#x2003;</bold>&#x2009;<bold>for</bold> each state <inline-formula id="inf74">
<mml:math id="minf74">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>&#x20;<bold>do</bold>
</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;&#x2003;&#x2003;</bold>&#x2009;<bold>if</bold> <italic>s</italic> has been executed in the episode&#x20;<bold>then</bold>
</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;&#x2003;&#x2003;&#x2003;</bold>&#x2009;<inline-formula id="inf75">
<mml:math id="minf75">
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x3d;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> number of time steps <italic>s</italic> has been executed</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;&#x2003;&#x2003;&#x2003;</bold>&#x2009;Append <inline-formula id="inf76">
<mml:math id="minf76">
<mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mi>T</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> to <inline-formula id="inf77">
<mml:math id="minf77">
<mml:mrow>
<mml:mtext>Rewards</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
</list-item>
<list-item>
<p>
<bold>&#x2003;&#x2003;&#x2003;&#x2003;</bold>&#x2009;Append <inline-formula id="inf78">
<mml:math id="minf78">
<mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mi>T</mml:mi>
</mml:msub>
<mml:mo>&#x22c5;</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>/</mml:mo>
<mml:mrow>
<mml:mtext>steps</mml:mtext>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> to <inline-formula id="inf79">
<mml:math id="minf79">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mtext>Rewards</mml:mtext>
</mml:mrow>
<mml:mi>p</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
</list-item>
<list-item>
<p>&#x2009;<inline-formula id="inf80">
<mml:math id="minf80">
<mml:mrow>
<mml:mi>V</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>average</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mtext>Rewards</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
</list-item>
<list-item>
<p>&#x2009;<inline-formula id="inf81">
<mml:math id="minf81">
<mml:mrow>
<mml:msub>
<mml:mi>V</mml:mi>
<mml:mi>p</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>average</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mtext>Rewards</mml:mtext>
</mml:mrow>
<mml:mi>p</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
</p>
</list-item>
</list>
</p>
</statement>
</p>
<p>Given these definitions, we can use <italic>weighted importance sampling</italic> to estimate the state values of a <italic>target</italic> finite-state machine from the execution traces of a <italic>behavior</italic> finite-state machine, provided that the states of the behavior finite-state machine are a superset of those of the target one and are connected by transitions in the same way. To calculate a performance estimation for the whole swarm when executing the target control software, <inline-formula id="inf37">
<mml:math id="minf37">
<mml:mrow>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>/</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> has to be derived from the state values estimated using <italic>weighted importance sampling</italic>. Let <inline-formula id="inf38">
<mml:math id="minf38">
<mml:mrow>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mi>b</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, with <inline-formula id="inf39">
<mml:math id="minf39">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> be the state values of the behavior finite-state machine and <inline-formula id="inf40">
<mml:math id="minf40">
<mml:mrow>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, with <inline-formula id="inf41">
<mml:math id="minf41">
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>J</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf42">
<mml:math id="minf42">
<mml:mrow>
<mml:mi>J</mml:mi>
<mml:mo>&#x2286;</mml:mo>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, be the state values of the target finite-state machine estimated using <italic>weighted importance sampling</italic>. When considering the reward calculated as <inline-formula id="inf43">
<mml:math id="minf43">
<mml:mrow>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mi>T</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>/</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, the performance <inline-formula id="inf44">
<mml:math id="minf44">
<mml:mrow>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mi>e</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> of the target finite-state machine is<disp-formula id="e5">
<mml:math id="me5">
<mml:mrow>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mi>e</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:msub>
<mml:mo>&#x2211;</mml:mo>
<mml:mi>J</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mi>b</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>&#x7c;</mml:mo>
<mml:mi>J</mml:mi>
<mml:mo>&#x7c;</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
<mml:mo>&#x22c5;</mml:mo>
<mml:msub>
<mml:mi>G</mml:mi>
<mml:mi>T</mml:mi>
</mml:msub>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
<label>(5)</label>
</disp-formula>
</p>
<p>In <xref ref-type="disp-formula" rid="e5">Eq. 5</xref>, the estimation is calculated as the per-robot reward multiplied by a factor that combines the estimated state values of the target finite-state machine weighted by the state values of the behavior finite-state machine. The proportional reward already considers the relative contribution of each state so the calculation of <inline-formula id="inf45">
<mml:math id="minf45">
<mml:mrow>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mi>e</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is defined as follows:<disp-formula id="e6">
<mml:math id="me6">
<mml:mrow>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mi>e</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munder>
<mml:mo>&#x2211;</mml:mo>
<mml:mi>J</mml:mi>
</mml:munder>
<mml:mrow>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mi>G</mml:mi>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>T</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mstyle>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
<label>(6)</label>
</disp-formula>
</p>
<p>
<inline-formula id="inf46">
<mml:math id="minf46">
<mml:mrow>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mi>e</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> represents the estimation of the average performance of a single robot in the swarm. The estimated performance of the swarm, considering the assumption that all the robots contribute equally, is calculate as <inline-formula id="inf47">
<mml:math id="minf47">
<mml:mrow>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mi>e</mml:mi>
</mml:msub>
<mml:mo>&#x22c5;</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
<p>The experiments presented in this paper are based on finite-state machines generated with <inline-formula id="inf48">
<mml:math id="minf48">
<mml:mrow>
<mml:mtext>Chocolate</mml:mtext>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>6</mml:mn>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, which builds PFSM composed of a maximum of six states and four transitions per state. Each state can assume one of six behaviors and each transition can have one of six conditions. The key characteristics of <inline-formula id="inf49">
<mml:math id="minf49">
<mml:mrow>
<mml:mtext>Chocolate</mml:mtext>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>6</mml:mn>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, as well as a brief description of the behaviors and the conditions, are given in <xref ref-type="table" rid="T1">Table&#x20;1</xref>. To produce execution traces for each experimental run, we modified AutoMoDe so that the control software of each robot logs an execution trace containing, for each time step, the current state, the active transition(s) and the information needed to calculate the activation probability of each condition&#x2014;that is, the ground color and the number of neighboring robots perceived.</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Description of <inline-formula id="inf50">
<mml:math id="minf50">
<mml:mrow>
<mml:mtext>Chocolate</mml:mtext>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>6</mml:mn>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. This method targets the e-puck robot and specifically the reference model RM1.1 of which we report the key features at the end of the table. The finite-state machines are generated by choosing from six behaviors&#x2014;for which we provide a brief description&#x2014;and six conditions&#x2014;for which we report how the activation probability <italic>p</italic> is calculated with &#x3b1; and &#x3b2; being parameters. <inline-formula id="inf51">
<mml:math id="minf51">
<mml:mrow>
<mml:mtext>Chocolate</mml:mtext>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>6</mml:mn>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> can generate finite-state machines of maximum six states, while Chocolate allows a maximum of four and each state can have a maximum of four transitions. The optimization algorithm is iterated F-Race which uses the ARGoS simulator to perform evaluations.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th colspan="3" align="center">
<inline-formula id="inf52">
<mml:math id="minf52">
<mml:mrow>
<mml:mtext>Chocolate</mml:mtext>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>6</mml:mn>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td colspan="3" align="left">
<bold>Modules</bold>
</td>
</tr>
<tr>
<td rowspan="6" align="left">Behaviors</td>
<td align="left">Exploration</td>
<td align="left">Move randomly</td>
</tr>
<tr>
<td align="left">Stop</td>
<td align="left">Stop moving</td>
</tr>
<tr>
<td align="left">Phototaxis</td>
<td align="left">Move toward the light</td>
</tr>
<tr>
<td align="left">Anti-phototaxis</td>
<td align="left">Move away from the light</td>
</tr>
<tr>
<td align="left">Attraction</td>
<td align="left">Move toward other robots</td>
</tr>
<tr>
<td align="left">Repulsion</td>
<td align="left">Move away from other robots</td>
</tr>
<tr>
<td rowspan="6" align="left">Conditions</td>
<td align="left">Black-floor</td>
<td align="left">If the floor is black, <inline-formula id="inf53">
<mml:math id="minf53">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>&#x3b1;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>; 0 otherwise</td>
</tr>
<tr>
<td align="left">White-floor</td>
<td align="left">If the floor is white, <inline-formula id="inf54">
<mml:math id="minf54">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>&#x3b1;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>; 0 otherwise</td>
</tr>
<tr>
<td align="left">Gray-floor</td>
<td align="left">If the floor is gray, <inline-formula id="inf55">
<mml:math id="minf55">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>&#x3b1;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>; 0 otherwise</td>
</tr>
<tr>
<td align="left">Neighbor-count</td>
<td align="left">With <italic>n</italic> neighbors <inline-formula id="inf56">
<mml:math id="minf56">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>/</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2b;</mml:mo>
<mml:msup>
<mml:mi>e</mml:mi>
<mml:mrow>
<mml:mi>&#x3b2;</mml:mi>
<mml:mo>&#x22c5;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>&#x3b1;</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
</tr>
<tr>
<td align="left">Inverted-neighbor-count</td>
<td align="left">With <italic>n</italic> neighbors <inline-formula id="inf57">
<mml:math id="minf57">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2b;</mml:mo>
<mml:msup>
<mml:mi>e</mml:mi>
<mml:mrow>
<mml:mi>&#x3b2;</mml:mi>
<mml:mo>&#x22c5;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>&#x3b1;</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
</tr>
<tr>
<td align="left">Fixed-probability</td>
<td align="left">
<inline-formula id="inf58">
<mml:math id="minf58">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>&#x3b1;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
</tr>
<tr>
<td align="left">
<bold>Constraints</bold>
</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td rowspan="2" align="left">Number of states Transitions per state</td>
<td align="left">6 (max)</td>
<td align="left"/>
</tr>
<tr>
<td align="left">4 (max)</td>
<td align="left"/>
</tr>
<tr>
<td align="left">
<bold>Tools</bold>
</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td rowspan="2" align="left">Optimization algorithm Simulator</td>
<td colspan="2" align="left">iterated F-Race implemented in the irace package (<xref ref-type="bibr" rid="B1">Balaprakash et&#x20;al. (2007)</xref>; <xref ref-type="bibr" rid="B20">L&#xf3;pez-Ib&#xe1;&#xf1;ez et&#x20;al. (2016)</xref>)</td>
</tr>
<tr>
<td align="left">ARGoS3 (<xref ref-type="bibr" rid="B22">Pinciroli et&#x20;al. (2012)</xref>)</td>
</tr>
<tr>
<td align="left">
<bold>Robot platform</bold>
</td>
<td colspan="2" align="left">e-puck reference model RM1.1 (<xref ref-type="bibr" rid="B16">Hasselmann et&#x20;al. (2020)</xref>)</td>
</tr>
<tr>
<td rowspan="5" align="left">Input</td>
<td align="left">8 proximity sensors</td>
<td align="left"/>
</tr>
<tr>
<td align="left">8 light sensors</td>
<td align="left"/>
</tr>
<tr>
<td align="left">3 ground sensors</td>
<td align="left"/>
</tr>
<tr>
<td align="left">Number of neighboring robots perceived</td>
<td align="left"/>
</tr>
<tr>
<td align="left">Attraction vector for each perceived robot</td>
<td align="left"/>
</tr>
<tr>
<td align="left">Output</td>
<td align="left">Left and right wheel target linear velocity</td>
<td align="left"/>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="s3">
<title>3 Results</title>
<p>In the experiments presented here, the execution traces are collected during the generation of the control software from the executions performed by the optimization algorithm used in <inline-formula id="inf59">
<mml:math id="minf59">
<mml:mrow>
<mml:mtext>Chocolate</mml:mtext>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>6</mml:mn>
<mml:mi>S</mml:mi>
<mml:mtext>&#x2009;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>, Iterated F-Race implemented in the irace package (<xref ref-type="bibr" rid="B5">Birattari, 2009</xref>; <xref ref-type="bibr" rid="B20">L&#xf3;pez-Ib&#xe1;ez et&#x20;al., 2016</xref>). In a nutshell, I/F-race works in an iterated fashion by generating, testing and discarding solutions&#x2014;i.e.,&#x20;finite-state machines. In each iteration, I/F-race keeps a set of solutions that are executed and compared with each other. A solution is discarded when proved worse than the others by means of a statistical test. The algorithm ends when the maximum amount of executions is reached, returning the set of surviving solutions. <xref ref-type="fig" rid="F1">Figure&#x20;1A</xref> shows a description of how a finite-state machine is generated and modified as well as how the performance of the modified control software is estimated.</p>
<p>We applied off-policy evaluation to 20&#x20;finite-state machines generated with <inline-formula id="inf60">
<mml:math id="minf60">
<mml:mrow>
<mml:mtext>Chocolate</mml:mtext>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>6</mml:mn>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> to perform a foraging mission as defined by <xref ref-type="bibr" rid="B13">Francesca et&#x20;al. (2014)</xref>. In this mission, a swarm of 20 robots, confined in an dodecagonal arena, must retrieve as many objects as possible from two sources and transport them to the nest. A screenshot of an experimental run is shown in <xref ref-type="fig" rid="F1">Figure&#x20;1B</xref>. The e-puck robot is not able to manipulate objects so the interaction with objects is abstracted. We consider that a robot collects an object by entering a source and deposits it by entering the nest. The sources are two black circles roughly in the middle of the arena while the nest is a white area placed at the edge of the arena. Additionally, a light source is placed behind the nest. The performance metric is defined as the number of objects retrieved. A video of an experimental run showcasing the mission as well as the source code and the experimental data are available in the supplementary page (Supplementary Material).</p>
<p>All the experiments were conducted in a simulated environment using ARGoS3&#x20;<xref ref-type="bibr" rid="B22">Pinciroli et&#x20;al. (2012)</xref>. We executed <inline-formula id="inf61">
<mml:math id="minf61">
<mml:mrow>
<mml:mtext>Chocolate</mml:mtext>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>6</mml:mn>
<mml:mi>S</mml:mi>
<mml:mtext>&#x2009;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula> ten times with a budget of 10,000 evaluations, generating 93&#x20;finite-state machines. From this group, we removed the finite-state machines that had less than three active states according to the execution traces. From the remaining ones, we formed the final group of 20&#x20;finite-state machines by selecting the ones showing the greatest number of active states. During the execution of <inline-formula id="inf62">
<mml:math id="minf62">
<mml:mrow>
<mml:mtext>Chocolate</mml:mtext>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>6</mml:mn>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, these finite-state machines were executed between seven and ten times. Considering that each experimental run involves twenty robots, we collected from 140 to 200 episodes for each finite-state machine.</p>
<p>We devised two experiments: one, called &#x201c;state pruning,&#x201d; in which we consider changes to the structure of the finite-state machines and one, called &#x201c;parameter variation,&#x201d; where we consider variations in its configuration. In state pruning, for each finite-state machine, we use the proposed technique to estimate the performance of all the finite-state machines that can be generated by removing one state&#x2014;for instance, three finite-state machines composed of two states can be generated from one composed of three states. <xref ref-type="fig" rid="F1">Figure&#x20;1D</xref> shows a finite-state machine composed of three states generated by removing state S2 from the finite-state machine in <xref ref-type="fig" rid="F1">Figure&#x20;1C</xref>. In parameter variation, for each finite-state machine, we generate four variations by changing two parameters of the two most active transitions. For each parameter, we considered two different values generating four combinations per finite-state machine. For instance, we can generate four finite-state machines by considering two new values for the &#x3b1; parameters of the two transitions connecting the states S0 and S3 in the finite-state machine in <xref ref-type="fig" rid="F1">Figure&#x20;1C</xref>. From these 80 combinations, we discarded the ones unable to change significantly the performance of the control software after five experimental runs, resulting in a total of 20 parameter variations. In both experiments, we compared the performance estimations calculated with weighted importance sampling&#x2014;both with and without the proportional reward&#x2014;with a naive estimation implemented as the average performance of the unmodified finite-state machine as reported in the execution traces. In other words, the naive estimation always assumes that the changes done to the control software will not influence its performance. We measured the accuracy of each estimation using the normalized squared error (SE):<disp-formula id="e7">
<mml:math id="me7">
<mml:mrow>
<mml:mtext>normalized&#xa0;SE</mml:mtext>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>&#x3c0;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>E</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>E</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>&#x3c0;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>E</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
<label>(7)</label>
</disp-formula>
</p>
<p>In the equation, <inline-formula id="inf63">
<mml:math id="minf63">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3c0;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>E</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is the measured average performance of the finite-state machine <italic>i</italic> over the set of executions <italic>E</italic> and <inline-formula id="inf64">
<mml:math id="minf64">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>E</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is the estimated performance calculated from the traces generated by the finite-state machine <italic>b</italic> within the set of executions <italic>E</italic>. Moreover, we tested the results for significance using the Friedman rank sum&#x20;test.</p>
<p>The results for the two experiments are shown in <xref ref-type="fig" rid="F2">Figure&#x20;2</xref> where, we indicate with WIS the result obtained using weighted importance sampling and with PWIS the result obtained using weighted importance sampling and the proportional reward. In both cases, PWIS and WIS have better results than the naive estimation with PWIS having significantly better results in the state pruning experiment&#x2014;shown in <xref ref-type="fig" rid="F2">Figure&#x20;2A</xref>&#x2014;and WIS being significantly better in the parameter variation experiment&#x2014;shown in <xref ref-type="fig" rid="F2">Figure&#x20;2B</xref>. When comparing the results, the larger estimation error shown by all methods in the state pruning experiment can be explained by the fact that, in this experiment, the finite-state machines undergo substantial modifications which may invalidate the execution traces leading to performance estimation of 0. This is the case, for instance, when removing a state generates a finite-state machine with no transitions.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>Results of the experiments measured as normalized mean squared error. In <bold>(A)</bold>, we report the results of estimating the impact of removing a state; in <bold>(B)</bold>, we report those of estimating the impact of changing the probabilities of the two most active transitions. Naive indicates the naive estimation, while WIS and PWIS indicate the results obtained with the weighted importance sampling using, respectively, the full and the proportional per state reward.</p>
</caption>
<graphic xlink:href="frobt-08-625125-g002.tif"/>
</fig>
<p>Overall, the results indicate that PWIS gives a better estimation when changing the structure of a finite-state machine while WIS is better suited to estimate the effect of variations to the parameters. In the first experiment, the proportional reward used in PWIS&#x2014;that includes a measure of the relative execution time of each state&#x2014;makes it better suited to estimate how the performance would change when removing a state. On the contrary, in the parameter variation experiment, the changes to the parameters influence directly the execution time of the states, making PWIS less accurate than&#x20;WIS.</p>
</sec>
<sec id="s4">
<title>4 Discussion</title>
<p>In this paper, we applied off-policy evaluation to estimate the performance of a robot swarm where the control software is represented as a finite-state machine. Although the experiments deliver promising results, further experimentation is needed, considering different missions as well as different sets of software modules. However, the results indicate that this line of research is promising with several developments that could be explored such as different reward calculations, different estimators. The execution traces can be modified to also trace the performance metric of the swarm so that more complex reward calculations can be implemented. The estimation can be improved by employing importance sampling methods such as the ones proposed by <xref ref-type="bibr" rid="B17">Jiang and Li (2016)</xref>; <xref ref-type="bibr" rid="B29">Thomas and Brunskill (2016)</xref>.</p>
<p>Moreover, this technique is not necessarily limited to finite-state machines and it could be extended with some modifications to other modular control software architecture such as, for instance, behavior trees (<xref ref-type="bibr" rid="B18">Kuckling et&#x20;al., 2018</xref>). Another interesting application of this technique would be in automatic design methods using iterative optimization algorithms. The execution time of these methods might be reduced by running simulations only if newly generated solutions have an estimated performance that is better than the current best&#x20;one.</p>
</sec>
</body>
<back>
<sec id="s5">
<title>Data Availability Statement</title>
<p>The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number (s) can be found as follows: <ext-link ext-link-type="uri" xlink:href="http://iridia.ulb.ac.be/supp/IridiaSupp2020-012/index.html">http://iridia.ulb.ac.be/supp/IridiaSupp2020-012/index.html</ext-link>.</p>
</sec>
<sec id="s6">
<title>Author Contributions</title>
<p>MB and FP discussed and developed the concept together. FP implemented the idea and conducted the experiments. Both authors contributed to the writing of the paper. The research was directed by MB.</p>
</sec>
<sec id="s7">
<title>Funding</title>
<p>The project has received funding from the European Research Council (ERC) under the European Union&#x2019;s Horizon 2020 research and innovation programme (DEMIURGE Project, grant agreement no. 681872) and from Belgium&#x2019;s Wallonia-Brussels Federation through the ARC Advanced Project GbO-Guaranteed by Optimization. MB acknowledges support from the Belgian Fonds de la Recherche Scientifique&#x2013;FNRS.</p>
</sec>
<sec sec-type="COI-statement" id="s8">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Balaprakash</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Birattari</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>St&#xfc;tzle</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2007</year>). &#x201c;<article-title>Improvement strategies for the F-Race algorithm: sampling design and iterative refinement</article-title>,&#x201d; in <source>Hybrid metaheuristics, 4th international workshop, HM 2007</source>. Editors <person-group person-group-type="editor">
<name>
<surname>Bartz-Beielstein</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Blesa</surname>
<given-names>M. J.</given-names>
</name>
<name>
<surname>Blum</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Naujoks</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Roli</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Rudolph</surname>
<given-names>G.</given-names>
</name>
<etal/>
</person-group> (<publisher-loc>Berlin, Germany</publisher-loc>: <publisher-name>Springer</publisher-name>), <volume>4771</volume>. <fpage>108</fpage>&#x2013;<lpage>122</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-540-75514-2-9</pub-id> </citation>
</ref>
<ref id="B2">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Bertsekas</surname>
<given-names>D. P.</given-names>
</name>
<name>
<surname>Tsitsiklis</surname>
<given-names>J.&#x20;N.</given-names>
</name>
</person-group> (<year>1996</year>). <source>Neuro-dynamic programming</source>. <publisher-loc>Nashua NH</publisher-loc>: <publisher-name>Athena Scientific</publisher-name>.</citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Birattari</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Ligot</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Bozhinoski</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Brambilla</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Francesca</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Garattoni</surname>
<given-names>L.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Automatic off-line design of robot swarms: a manifesto</article-title>. <source>Front. Robot. AI.</source> <volume>6</volume>, <fpage>59</fpage>. <pub-id pub-id-type="doi">10.3389/frobt.2019.00059</pub-id> </citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Birattari</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Ligot</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Hasselmann</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Disentangling automatic and semi-automatic approaches to the optimization-based design of control software for robot swarms</article-title>. <source>Nat. Mach Intell.</source> <volume>2</volume>, <fpage>494</fpage>&#x2013;<lpage>499</lpage>. <pub-id pub-id-type="doi">10.1038/s42256-020-0215-0</pub-id> </citation>
</ref>
<ref id="B5">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Birattari</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2009</year>). <source>Tuning metaheuristics: a machine learning perspective</source>. <publisher-loc>Berlin, Germany</publisher-loc>: <publisher-name>Springer</publisher-name>. <pub-id pub-id-type="doi">10.1007/978-3-642-00483-4</pub-id> </citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brambilla</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Brutschy</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Dorigo</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Birattari</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Property-driven design for robot swarms</article-title>. <source>ACM Trans. Auton. Adapt. Syst.</source> <volume>9</volume> (<issue>4</issue>), <fpage>1</fpage>&#x2013;<lpage>28</lpage>. <pub-id pub-id-type="doi">10.1145/2700318</pub-id> </citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brambilla</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Ferrante</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Birattari</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Dorigo</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>Swarm robotics: a review from the swarm engineering perspective</article-title>. <source>Swarm Intell.</source> <volume>7</volume>, <fpage>1</fpage>&#x2013;<lpage>41</lpage>. <pub-id pub-id-type="doi">10.1007/s11721-012-0075-2</pub-id> </citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bredeche</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Haasdijk</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Prieto</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Embodied evolution in collective robotics: a review</article-title>. <source>Front. Robot. AI.</source> <volume>5</volume>, <fpage>12</fpage>. <pub-id pub-id-type="doi">10.3389/frobt.2018.00012</pub-id> </citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dorigo</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Birattari</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Brambilla</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>Swarm robotics</article-title>. <source>Scholarpedia</source> <volume>9</volume>, <fpage>1463</fpage>. <pub-id pub-id-type="doi">10.4249/scholarpedia.1463</pub-id> </citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dorigo</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Birattari</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2007</year>). <article-title>Swarm intelligence</article-title>. <source>Scholarpedia</source> <volume>2</volume>, <fpage>1462</fpage>. <pub-id pub-id-type="doi">10.4249/scholarpedia.1462</pub-id> </citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Francesca</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Birattari</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Automatic design of robot swarms: achievements and challenges</article-title>. <source>Front. Robot. AI.</source> <volume>3</volume>, <fpage>1</fpage>&#x2013;<lpage>9</lpage>. <pub-id pub-id-type="doi">10.3389/frobt.2016.00029</pub-id> </citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Francesca</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Brambilla</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Brutschy</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Garattoni</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Miletitch</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Podevijn</surname>
<given-names>G.</given-names>
</name>
<etal/>
</person-group> (<year>2015</year>). <article-title>AutoMoDe-Chocolate: automatic design of control software for robot swarms</article-title>. <source>Swarm Intell.</source> <volume>9</volume>, <fpage>125</fpage>&#x2013;<lpage>152</lpage>. <pub-id pub-id-type="doi">10.1007/s11721-015-0107-9</pub-id> </citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Francesca</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Brambilla</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Brutschy</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Trianni</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Birattari</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>AutoMoDe: a novel approach to the automatic design of control software for robot swarms</article-title>. <source>Swarm Intell.</source> <volume>8</volume>, <fpage>89</fpage>&#x2013;<lpage>112</lpage>. <pub-id pub-id-type="doi">10.1007/s11721-014-0092-4</pub-id> </citation>
</ref>
<ref id="B14">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Garattoni</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Birattari</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Swarm robotics</article-title>,&#x201d; in <source>Wiley encyclopedia of electrical and electronics engineering</source>. Editor <person-group person-group-type="editor">
<name>
<surname>Webster</surname>
<given-names>J.&#x20;G.</given-names>
</name>
</person-group> (<publisher-loc>Hoboken, NJ, United&#x20;States</publisher-loc>: <publisher-name>John Wiley &#x26; Sons</publisher-name>), <fpage>1</fpage>&#x2013;<lpage>19</lpage>. <pub-id pub-id-type="doi">10.1002/047134608X.W8312</pub-id> </citation>
</ref>
<ref id="B15">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Hammersley</surname>
<given-names>J.&#x20;M.</given-names>
</name>
<name>
<surname>Handscomb</surname>
<given-names>D. C.</given-names>
</name>
</person-group> (<year>1964</year>). <source>Monte Carlo methods</source>. <publisher-loc>North Yorkshire, United&#x20;Kingdom</publisher-loc>: <publisher-name>Methuen</publisher-name>. </citation>
</ref>
<ref id="B16">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Hasselmann</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Ligot</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Francesca</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Garz&#xf3;n Ramos</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Salman</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Kuckling</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <source>Reference models for AutoMoDe. Tech. Rep. TR/IRIDIA/2018-002, IRIDIA</source>. <publisher-loc>Belgium</publisher-loc>: <publisher-name>Universit&#xe9; Libre de Bruxelles</publisher-name>.</citation>
</ref>
<ref id="B17">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Jiang</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Doubly robust off-policy value evaluation for reinforcement learning</article-title>,&#x201d; in <conf-name>International conference on machine learning</conf-name>. <conf-loc>New York, NY</conf-loc>, <conf-date>June 19&#x2013;24, 2016</conf-date> (<publisher-loc>Burlington, Massachusetts</publisher-loc>: <publisher-name>Morgan Kaufmann</publisher-name>), <fpage>652</fpage>&#x2013;<lpage>661</lpage>. </citation>
</ref>
<ref id="B18">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Kuckling</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Ligot</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Bozhinoski</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Birattari</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Behavior trees as a control architecture in the automatic modular design of robot swarms</article-title>,&#x201d; in <source>Swarm intelligence &#x2013; ants</source>. Editors <person-group person-group-type="editor">
<name>
<surname>Dorigo</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Birattari</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Blum</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Christensen</surname>
<given-names>A. L.</given-names>
</name>
<name>
<surname>Reina</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Trianni</surname>
<given-names>V.</given-names>
</name>
</person-group> (<publisher-loc>Cham, Switzerland</publisher-loc>: <publisher-name>Springer</publisher-name>), <volume>Vol. 11172</volume>. <fpage>30</fpage>&#x2013;<lpage>43</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-00533-7-3</pub-id> </citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lopes</surname>
<given-names>Y. K.</given-names>
</name>
<name>
<surname>Trenkwalder</surname>
<given-names>S. M.</given-names>
</name>
<name>
<surname>Leal</surname>
<given-names>A. B.</given-names>
</name>
<name>
<surname>Dodd</surname>
<given-names>T. J.</given-names>
</name>
<name>
<surname>Gro&#xdf;</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Supervisory control theory applied to swarm robotics</article-title>. <source>Swarm Intell.</source> <volume>10</volume>, <fpage>65</fpage>&#x2013;<lpage>97</lpage>. <pub-id pub-id-type="doi">10.1007/s11721-016-0119-0</pub-id> </citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>L&#xf3;pez-Ib&#xe1;&#xf1;ez</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Dubois-Lacoste</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>P&#xe9;rez C&#xe1;ceres</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Birattari</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>St&#xfc;tzle</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>The irace package: iterated racing for automatic algorithm configuration</article-title>. <source>Operations Res. Perspect.</source> <volume>3</volume>, <fpage>43</fpage>&#x2013;<lpage>58</lpage>. <pub-id pub-id-type="doi">10.1016/j.orp.2016.09.002</pub-id> </citation>
</ref>
<ref id="B21">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Pagnozzi</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Birattari</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2020</year>). <source>Supplementary material for the paper: Off-policy evaluation of the performance of a robot swarm: importance sampling to assess potential modifications to the finite-state machine that controls the robots.</source> <comment>IRIDIA &#x2010; Supplementary Information. ISSN: 2684&#x2013;2041. Brussels, Belgium</comment>
</citation>
</ref>
<ref id="B22">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pinciroli</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Trianni</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>O&#x2019;Grady</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Pini</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Brutschy</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Brambilla</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2012</year>). <article-title>ARGoS: a modular, parallel, multi-engine simulator for multi-robot systems</article-title>. <source>Swarm Intell.</source> <volume>6</volume>, <fpage>271</fpage>&#x2013;<lpage>295</lpage>. <pub-id pub-id-type="doi">10.1007/s11721-012-0072-5</pub-id> </citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Powell</surname>
<given-names>M. J.</given-names>
</name>
<name>
<surname>Swann</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>1966</year>). <article-title>Weighted uniform sampling - a Monte Carlo technique for reducing variance</article-title>. <source>IMA J.&#x20;Appl. Math.</source> <volume>2</volume>, <fpage>228</fpage>&#x2013;<lpage>236</lpage>. <pub-id pub-id-type="doi">10.1093/imamat/2.3.228</pub-id> </citation>
</ref>
<ref id="B24">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Precup</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Sutton</surname>
<given-names>R. S.</given-names>
</name>
<name>
<surname>Dasgupta</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2001</year>). &#x201c;<article-title>Off-policy temporal-difference learning with function approximation</article-title>,&#x201d; in <conf-name>International conference on machine learning</conf-name>. <conf-loc>New York, NY</conf-loc>, <conf-date>June 2, 2001</conf-date> (<publisher-loc>Burlington, Massachusetts</publisher-loc>: <publisher-name>Morgan Kaufmann)</publisher-name>, <fpage>417</fpage>&#x2013;<lpage>424</lpage>. </citation>
</ref>
<ref id="B25">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Precup</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Sutton</surname>
<given-names>R. S.</given-names>
</name>
<name>
<surname>Singh</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2000</year>). &#x201c;<article-title>Eligibility traces for off-policy policy evaluation</article-title>,&#x201d; in <conf-name>International conference on machine learning</conf-name>. <conf-loc>New York, NY</conf-loc>, <conf-date>June 13, 2000</conf-date> (<publisher-loc>Burlington, MA</publisher-loc>: <publisher-name>Morgan Kaufmann</publisher-name>). <fpage>759</fpage>&#x2013;<lpage>766</lpage>. </citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Reina</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Valentini</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Fern&#xe1;ndez-Oto</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Dorigo</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Trianni</surname>
<given-names>V.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>A design pattern for decentralised decision making</article-title>. <source>PLOS ONE.</source> <volume>10</volume>, <fpage>e0140950</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0140950</pub-id> </citation>
</ref>
<ref id="B27">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Rubinstein</surname>
<given-names>R. Y.</given-names>
</name>
<name>
<surname>Kroese</surname>
<given-names>D. P.</given-names>
</name>
</person-group> (<year>1981</year>). <source>Simulation and the Monte Carlo method</source>. <publisher-loc>Hoboken, New Jersey</publisher-loc>: <publisher-name>John Wiley &#x26; Sons</publisher-name>.</citation>
</ref>
<ref id="B28">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Sutton</surname>
<given-names>R. S.</given-names>
</name>
<name>
<surname>Barto</surname>
<given-names>A. G.</given-names>
</name>
</person-group> (<year>2018</year>). <source>Reinforcement learning: an introduction</source>. <publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>MIT Press</publisher-name>.</citation>
</ref>
<ref id="B29">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Thomas</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Brunskill</surname>
<given-names>E.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Data-efficient off-policy policy evaluation for reinforcement learning</article-title>,&#x201d; in <conf-name>International conference on machine learning</conf-name>. <conf-loc>New York, NY, United&#x20;States</conf-loc>, <conf-date>June 19&#x2013;24, 2016</conf-date> (<publisher-loc>Burlington, MA</publisher-loc>: <publisher-name>Morgan Kaufmann</publisher-name>), <fpage>2139</fpage>&#x2013;<lpage>2148</lpage>. </citation>
</ref>
</ref-list>
</back>
</article>