<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Sustain. Cities</journal-id>
<journal-title>Frontiers in Sustainable Cities</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Sustain. Cities</abbrev-journal-title>
<issn pub-type="epub">2624-9634</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/frsc.2022.756539</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Sustainable Cities</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Effects of Smart Traffic Signal Control on Air Quality</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Fazzini</surname> <given-names>Paolo</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1390538/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Torre</surname> <given-names>Marco</given-names></name>
</contrib>
<contrib contrib-type="author">
<name><surname>Rizza</surname> <given-names>Valeria</given-names></name>
</contrib>
<contrib contrib-type="author">
<name><surname>Petracchini</surname> <given-names>Francesco</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/1198088/overview"/>
</contrib>
</contrib-group>
<aff><institution>Institute of Atmospheric Pollution Research (IIA), National Research Council</institution>, <addr-line>Rome</addr-line>, <country>Italy</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Sergio Ulgiati, University of Naples Parthenope, Italy</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Danil Prokhorov, Other, Ann Arbor, United States; Nicola Milano, Italian National Research Council, Italy</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Paolo Fazzini <email>paolo.fazzini&#x00040;iia.cnr.it</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Urban Resource Management, a section of the journal Frontiers in Sustainable Cities</p></fn></author-notes>
<pub-date pub-type="epub">
<day>18</day>
<month>02</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>4</volume>
<elocation-id>756539</elocation-id>
<history>
<date date-type="received">
<day>10</day>
<month>08</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>25</day>
<month>01</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2022 Fazzini, Torre, Rizza and Petracchini.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Fazzini, Torre, Rizza and Petracchini</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license> </permissions>
<abstract>
<p>Adaptive traffic signal control (ATSC) in urban traffic networks poses a challenging task due to the complicated dynamics arising in traffic systems. In recent years, several approaches based on multi-agent deep reinforcement learning (MARL) have been studied experimentally. These approaches propose distributed techniques in which each signalized intersection is seen as an agent in a stochastic game whose purpose is to optimize the flow of vehicles in its vicinity. In this setting, the systems evolves toward an equilibrium among the agents that shows beneficial for the whole traffic network. A recently developed multi-agent variant of the well-established advantage actor-critic (A2C) algorithm, called MA2C (multi-agent A2C) exploits the promising idea of some communication among the agents. In this view, the agents share their strategies with other neighbor agents, thereby stabilizing the learning process even when the agents grow in number and variety. We experimented MA2C in two traffic networks located in Bologna (Italy) and found that its action translates into a significant decrease of the amount of pollutants released into the environment.</p></abstract>
<kwd-group>
<kwd>multi-agent systems</kwd>
<kwd>reinforcement learning</kwd>
<kwd>vehicle flow optimization</kwd>
<kwd>traffic emissions</kwd>
<kwd>machine learning</kwd>
</kwd-group>
<counts>
<fig-count count="12"/>
<table-count count="3"/>
<equation-count count="3"/>
<ref-count count="29"/>
<page-count count="10"/>
<word-count count="4845"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>The impact of air pollution on human health, whether due to vehicular traffic or from industrial sources, has been proven to be largely detrimental. According to WHO, the World Health Organization, in recent times (2016) there have been worldwide 4.2 million premature deaths due to air pollution (WHO, <xref ref-type="bibr" rid="B27">2021</xref>). This mortality is due to exposure to small particulate matter of 2.5 microns or less in diameter (PM2.5), which cause cardiovascular and respiratory disease, and cancers. The Organization has included polluted air among the top 10 health risks of our species. Respiratory diseases kill more than alcohol and drugs and rank fourth among the leading causes of death (WHO, <xref ref-type="bibr" rid="B26">2002</xref>). It is particularly blocked traffic that cause the greatest risks (Hermes, <xref ref-type="bibr" rid="B10">2012</xref>). In order to avoid congestion and traffic jams, various artificial-intelligence based algorithms have been proposed. These algorithms are able to deal with the problem of managing traffic signal control to favor a smooth vehicle flow. Established approaches include fuzzy logic (Gokulan and Srinivasan, <xref ref-type="bibr" rid="B9">2010</xref>), swarm intelligence (Teodorovi, <xref ref-type="bibr" rid="B23">2008</xref>), and reinforcement learning (Sutton and Barto, <xref ref-type="bibr" rid="B22">1998</xref>).</p>
<p>In the present work, we employ MA2C (Chu et al., <xref ref-type="bibr" rid="B6">2019</xref>), an instance of multi-agent reinforcement learning as a signalized intersection controller, in an area located in the immediate outskirts of the city of Bologna (Italy), namely the Andrea Costa area (Fazzini et al., <xref ref-type="bibr" rid="B8">2021</xref>). Our experimentation is focused on evaluating the variation of vehicle emissions when signalized intersection are coordinated with MA2C. The traffic network setting we adopted is based on (Fazzini et al., <xref ref-type="bibr" rid="B8">2021</xref>).</p>
<sec>
<title>1.1. Related Work</title>
<p>Traffic flow is increasing constantly with economic and social growth, and road congestion is a crucial issue in growing urban areas (Marini et al., <xref ref-type="bibr" rid="B16">2015</xref>; Rizza et al., <xref ref-type="bibr" rid="B20">2017</xref>). Machine learning methods like reinforcement learning (Kuyer et al., <xref ref-type="bibr" rid="B12">2008</xref>; El-Tantawy and Abdulhai, <xref ref-type="bibr" rid="B7">2012</xref>; Bazzan and Klgl, <xref ref-type="bibr" rid="B3">2014</xref>; Mannion et al., <xref ref-type="bibr" rid="B15">2016</xref>) and other artificial intelligence techniques such as fuzzy logic algorithms (Gokulan and Srinivasan, <xref ref-type="bibr" rid="B9">2010</xref>) and swarm intelligence (Teodorovi, <xref ref-type="bibr" rid="B23">2008</xref>) have been applied to improve the management of street intersections regulated with traffic lights (signalized intersections). Arel et al. (<xref ref-type="bibr" rid="B1">2010</xref>) proposed a new approach of a multi-agent system and reinforcement learning (RL) utilizing a q-learning algorithm with a neural network, and demonstrated its advantages in obtaining an efficient traffic signal control policy. Recently, a specific interest has been shown in the applications of agent-based technologies to traffic and transportation engineering. As an example, Liang et al. (<xref ref-type="bibr" rid="B13">2019</xref>) studied traffic signal duration with a deep reinforcement learning model. Furthermore, Nishi et al. (<xref ref-type="bibr" rid="B18">2018</xref>) developed an RL-based traffic signal control method that employs a graph convolutional neural network analysing a six-intersection area. In addition, Rezzai et al. (<xref ref-type="bibr" rid="B19">2018</xref>) proposed a new architecture based on multi-agent systems and RL algorithms to make the signal control system more autonomous, able to learn from its environment and make decisions to optimize road traffic. Wei et al. (<xref ref-type="bibr" rid="B25">2020</xref>) gave a complete overview on RL-based traffic signal control approaches, including the recent advances in deep RL-based traffic signal control methods. Wang et al. (<xref ref-type="bibr" rid="B24">2018</xref>) summarized in their review some technical characteristics and the current research status of self-adaptive control methods used so far. Yau et al. (<xref ref-type="bibr" rid="B29">2017</xref>) and Mannion et al. (<xref ref-type="bibr" rid="B15">2016</xref>), instead, provide comprehensive surveys mainly on studies before the more recent spread of deep reinforcement learning. The present work is mainly based on (Fazzini et al., <xref ref-type="bibr" rid="B8">2021</xref>). For our simulations, we replicated the Andrea Costa and Pasubio areas in pseudo-random and entirely random traffic conditions. Both areas are located in the western outskirts of Bologna (Italy) (Bieker et al., <xref ref-type="bibr" rid="B4">2015</xref>).</p>
</sec>
</sec>
<sec sec-type="materials and methods" id="s2">
<title>2. Materials and Methods</title>
<sec>
<title>2.1. Overview</title>
<p>In this work, we experimented a multi-agent deep reinforcement Learning (MARL) algorithm called Multi-Agent Advantage Actor-Critic (MA2C) (Chu et al., <xref ref-type="bibr" rid="B6">2019</xref>, <xref ref-type="bibr" rid="B5">2020</xref>) in a simulated traffic settings located in the Bologna area (Fazzini et al., <xref ref-type="bibr" rid="B8">2021</xref>). Our goal is to evaluate its performance in terms of amount of pollutants released in the environment. More specifically, our evaluation focus on how MA2C, by controlling the logic of traffic lights, affects the coordination among the signalized intersections and consequently influence the amount of vehicles queuing at their surroundings.</p>
<p>The problem of coordinating signalized intersections can be seen as a stochastic game: every <italic>agent</italic> (i.e., every signalized intersection) aims to minimize the amount of queuing vehicles (<italic>reward</italic>)<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref> by observing their behavior in its neighborhood (i.e., by observing its neighborhood <italic>state</italic>) and ultimately learns how to balance its <italic>action</italic> (by controlling traffic lights switching) with the other agents. Notably, MA2C couples the observation of its neighbor policy to the observation of its state, and restricts the environment reward to its neighborhood (<xref ref-type="fig" rid="F1">Figure 1</xref>) (Fazzini et al., <xref ref-type="bibr" rid="B8">2021</xref>) yielding a mixed cooperative-competitive stochastic game.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Agent &#x0201C;i&#x0201D; reward and observation.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frsc-04-756539-g0001.tif"/>
</fig>
<p>As shown in <xref ref-type="fig" rid="F2">Figure 2</xref>, our setting is organized in a nested structure: a traffic network represents our environment, which in turn includes multiple traffic signalized intersections (agents). Every intersection contains one or more crossroads, each including a number of lanes.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Traffic network: the round (translucent-purple) spots reference the signalized intersections (agents). Each signalized intersection include one or more crossroads which are highlighted in a dark (green) color. The intersection not controlled by any agent are highlighted in a lighter (yellow) color.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frsc-04-756539-g0002.tif"/>
</fig>
<p>We start by reviewing the equations of multi-agent reinforcement learning. In Section 3, we detail our experiments and show our traffic networks. Finally (Section 4), we evaluate how the MA2C action translates in terms of pollutants released in the environment.</p>
</sec>
<sec>
<title>2.2. Multi-Agent Reinforcement Learning</title>
<p>As described in Fazzini et al. (<xref ref-type="bibr" rid="B8">2021</xref>), we refer our formalism to the framework of recurrent policy gradients (Wierstra et al., <xref ref-type="bibr" rid="B28">2007</xref>): each agent learns a limited memory stochastic policy &#x003C0;(<italic>u</italic><sub><italic>t</italic></sub> &#x02223; <italic>h</italic><sub><italic>t</italic></sub>), mapping sufficient statistics of a sequence of states <italic>h</italic><sub><italic>t</italic></sub> to probability distributions on action <italic>u</italic><sub><italic>t</italic></sub>; once the optimal policy has been determined it is adopted for signalized intersection coordination.</p>
<sec>
<title>2.2.1. Neighbor Agents</title>
<p>In a network symbolized by a graph <inline-formula><mml:math id="M1"><mml:mi>G</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">V</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">E</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, where <inline-formula><mml:math id="M2"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">V</mml:mi></mml:mrow></mml:math></inline-formula> (vertices) is the set of the agents and <inline-formula><mml:math id="M3"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">E</mml:mi></mml:mrow></mml:math></inline-formula> (edges) is the set of their connections, agent <italic>i</italic> and agent <italic>j</italic> are neighbors if the number of edges connecting them is less or equal some prefixed threshold. In the adopted formalism: (1) agents and connections refers to signalized intersections; (2) the neighborhood of agent <italic>i</italic> is denoted as <inline-formula><mml:math id="M4"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">N</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and its local region is <inline-formula><mml:math id="M5"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">V</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> = <inline-formula><mml:math id="M6"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">N</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0222A;</mml:mo><mml:mi>i</mml:mi></mml:math></inline-formula>; and (3) the distance between any two agents is denoted as <italic>d</italic>(<italic>i, j</italic>) with <italic>d</italic>(<italic>i, i</italic>) &#x0003D; 0 and <italic>d</italic>(<italic>i, j</italic>) &#x0003D; 1 for any <inline-formula><mml:math id="M7"><mml:mi>j</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">N</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>.</p>
</sec>
</sec>
<sec>
<title>2.3. System Architecture</title>
<p><xref ref-type="fig" rid="F3">Figure 3</xref> provides an overview of the system. The goal is to minimize the vehicle queues measured at signalized intersections. To this end, an agent keeps repeating the following steps (Fazzini et al., <xref ref-type="bibr" rid="B8">2021</xref>): (1) the ANN provides a policy for the traffic simulator given the perceived state <italic>s</italic><sub><italic>t</italic></sub> of the environment; (2) given the policy, a set of consecutive actions are selected (e.g., the simulator can be instructed to switch traffic lights at signalized intersections); (3) the simulator performs a few time steps following the current policy and stores the environment rewards, corresponding to the amount of queuing vehicles in proximity of signalized intersections; and (4) the ANN uses the stored rewards to change its parameters in order to improve its policy.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>MA2C general scheme.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frsc-04-756539-g0003.tif"/>
</fig>
<p><xref ref-type="table" rid="T1">Table 1</xref> shows formally how states, actions, rewards and policies have been defined in our setting.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Settings.</p></caption>
<table frame="hsides" rules="groups">
<tbody><tr>
<td valign="top" align="left">Agents</td>
<td valign="top" align="left">Signalized intersections</td>
</tr>
<tr>
<td valign="top" align="left">States</td>
<td valign="top" align="left">Wave and fingerprints</td>
</tr>
<tr>
<td valign="top" align="left">Actions</td>
<td valign="top" align="left">Traffic lights settings (e.g., switching from red to green)</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In <xref ref-type="table" rid="T1">Table 1</xref>, with <italic>fingerprints</italic> is intended the current policy of the neighboring agents, instantiated with the vector of probabilities of choosing one of the available actions; <italic>wave</italic> [veh] measures the total number of approaching vehicles along each incoming lane, within 50 m to a signalized intersection. The state is defined as <italic>s</italic><sub><italic>t, i</italic></sub> &#x0003D; {<italic>wav</italic><sub><italic>e</italic><sub><italic>t</italic></sub>[<italic>l</italic><sub><italic>ji</italic></sub>]}<italic>l</italic><sub><italic>ji</italic></sub> &#x02208; <italic>L</italic><sub><italic>i</italic></sub></sub> where <italic>L</italic><sub><italic>i</italic></sub> is the set of lanes <italic>j</italic> converging at a signalized intersection (agent) <italic>i</italic>; moreover fingerprints of other agents are added to complete the observation set.</p>
<p>In addition to the settings in <xref ref-type="table" rid="T1">Table 1</xref>, <inline-formula><mml:math id="M10"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">U</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the set of available actions for each agent <italic>i</italic>, defined as the set of all the possible red-green-yellow transitions available to each traffic light. The reward function at time <italic>t</italic> cumulates the queues (number of vehicles with speed less than 0.1 m/s) at the lanes concurring to a certain signalized intersection computed at time <italic>t</italic> &#x0002B; &#x00394;<italic>t</italic>:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M11"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mi>i</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">E</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mi>l</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munder></mml:mstyle><mml:mi>q</mml:mi><mml:mi>u</mml:mi><mml:mi>e</mml:mi><mml:mi>u</mml:mi><mml:msub><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mo>&#x00394;</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>l</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
</sec>
<sec>
<title>2.4. ANN Detail</title>
<p>States, actions, next states, and rewards are collected in minibatches called experience buffers, one for each agent <italic>i</italic>: <italic>B</italic><sub><italic>i</italic></sub> &#x0003D; {(<sub><italic>s</italic><sub><italic>t</italic></sub>, <italic>u</italic><sub><italic>t</italic></sub>, <italic>s</italic><sub><italic>t</italic>&#x0002B;1</sub>, <italic>r</italic><sub><italic>t</italic></sub>)}<italic>i</italic></sub>. They are stored while the traffic simulator performs a sequence of actions. Each batch <italic>i</italic> reflects agent <italic>i</italic> experience trajectory. <xref ref-type="fig" rid="F4">Figure 4</xref> shows MA2C&#x00027;s architecture. The graph reflects the A2C formalism (Barto et al., <xref ref-type="bibr" rid="B2">1983</xref>; Mnih et al., <xref ref-type="bibr" rid="B17">2016</xref>), therefore, each graph represents two different networks, one for the Actor (Policy) and one for the Critic (State-Value), their respective parameters being further referred as &#x003B8; and &#x003C8;. As in the graph, wave states and the fingerprint unit are fed to separated fully connected (FC) Layer with a variable number of inputs, depending by the number of lanes converging to the controlled signalized intersection. The output of the FC layer (128 units) feeds the Long Short-Term Memory module (LSTM) equipped with 64 outputs and 64 inner states (Fazzini et al., <xref ref-type="bibr" rid="B8">2021</xref>). The output of the LSTM module is linked to the network output that in the Actor case is a policy vector (with softmax activation function) and in the Critic case is a State-Value (with linear activation function). All the activation functions in the previous modules are Rectification Units (ReLU). In <xref ref-type="fig" rid="F4">Figure 4</xref>, the network biases are not depicted although present in each layer. For ANN training, an orthogonal initializer [43] and a gradient optimizer of type RMSprop have been used. To prevent gradient explosion, all normalized states are clipped to [0, 2] and each gradient is capped at 40. Rewards are clipped to [-2, 2].</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Scheme of the two ANNs (Actor and Critic). Apart from the output layer, the two ANNs share the same architecture but are distinct entities with different weights. The <italic>Wave</italic> and <italic>Fingerprints</italic> inputs are elaborated by two fully connected (FC) layers made by 64 and 128 units. The outputs of the FC layers feed the LSTM unit (64 inputs) whose output is in the first case a state value (first ANN, Critic) and in the second case a policy (second ANN, Actor).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frsc-04-756539-g0004.tif"/>
</fig>
</sec>
<sec>
<title>2.5. Multi-Agent Advantage Actor-Critic (MA2C)</title>
<p>MA2C (Chu et al., <xref ref-type="bibr" rid="B6">2019</xref>) is characterized by a stable learning process due to communication among agents belonging to the same neighborhood: a spatial discount factor weakens the reward signals from agents other than agent <italic>i</italic> in the loss function and agents not in <inline-formula><mml:math id="M12"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">N</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are not considered in the reward computation. The relevant expressions for the Loss functions governing the training optimization algorithm are:</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M13"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtable style="text-align:axis;" equalrows="false" columnlines="" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>B</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover></mml:mstyle><mml:mo class="qopname">log</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo class="qopname">&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">V</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">N</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>&#x000C3;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B2;</mml:mi><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munder></mml:mstyle><mml:msub><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo class="qopname">log</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo class="qopname">&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">V</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">N</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E3"><label>(3)</label><mml:math id="M14"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003C8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>B</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover></mml:mstyle><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003C8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">V</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">N</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>In the above equations:</p>
<list list-type="bullet">
<list-item><p><inline-formula><mml:math id="M15"><mml:msub><mml:mrow><mml:mi>&#x000C3;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003C8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">V</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">N</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></p></list-item>
<list-item><p><inline-formula><mml:math id="M16"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x003B3;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>B</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:msup><mml:msub><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003C8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>B</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">V</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>B</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">N</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:math></inline-formula></p></list-item>
<list-item><p><inline-formula><mml:math id="M17"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo></mml:math></inline-formula> <inline-formula><mml:math id="M18"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x003C4;</mml:mi><mml:mo>=</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>B</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover><mml:msup><mml:mrow><mml:mi>&#x003B3;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003C4;</mml:mi><mml:mo>-</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:msup><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>&#x003C4;</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></p></list-item>
<list-item><p><inline-formula><mml:math id="M19"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">V</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02223;</mml:mo></mml:mrow></mml:mfrac><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">V</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02260;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:munder><mml:mi>&#x003B1;</mml:mi><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></p></list-item>
<list-item><p><inline-formula><mml:math id="M20"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">V</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo>}</mml:mo></mml:mrow><mml:mo>&#x022C3;</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">N</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula></p></list-item>
<list-item><p><inline-formula><mml:math id="M21"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">V</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo>}</mml:mo></mml:mrow><mml:mo>&#x022C3;</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">N</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula></p></list-item>
<list-item><p><inline-formula><mml:math id="M22"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">V</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>H</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">V</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:math></inline-formula></p></list-item>
<list-item><p><inline-formula><mml:math id="M23"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">V</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>H</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">V</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:math></inline-formula></p></list-item>
<list-item><p><inline-formula><mml:math id="M24"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>H</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">V</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow><mml:mo>&#x022C3;</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>.</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow><mml:mo>&#x022C3;</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:math></inline-formula>&#x022C3;&#x003B1;{<italic>s</italic><sub><italic>t, j</italic></sub>}] with <inline-formula><mml:math id="M25"><mml:mi>j</mml:mi><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">V</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></p></list-item>
</list>
<p>Where &#x003C0;<sub>&#x003B8;<sub><italic>i</italic></sub></sub> refers the policy to be learned determining the parameters &#x003B8;<sub><italic>i</italic></sub> associated with agent <italic>i</italic>, <inline-formula><mml:math id="M26"><mml:msub><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">N</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:math></inline-formula> are the policies of agent <italic>i</italic>&#x00027;s neighbor agents at time <italic>t</italic>, <italic>u</italic><sub><italic>t, i</italic></sub> is the action taken by agent <italic>i</italic> at time <italic>t</italic>, <inline-formula><mml:math id="M27"><mml:msubsup><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is the history of the past states of agent <italic>i</italic> at time <italic>t</italic> following the policy &#x003C0;<sub>&#x003B8;<sub><italic>i</italic></sub></sub>, <italic>r</italic><sub><italic>t, i</italic></sub> is an evaluation of the average queue at signalized intersection (agent) <italic>i</italic> at time <italic>t</italic><xref ref-type="fn" rid="fn0002"><sup>2</sup></xref>.</p>
<p>The spatial discount factor &#x003B1; penalizes other agent&#x00027;s reward and <italic>D</italic><sub><italic>i</italic></sub> is the limit of agent <italic>i</italic> neighborhood.</p>
<p>Equation (3) yields a stable learning process since (a) fingerprints <inline-formula><mml:math id="M28"><mml:msub><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">N</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:math></inline-formula> are input to <italic>V</italic><sub>&#x003C8;<sub><italic>i</italic></sub></sub> to bring in account <inline-formula><mml:math id="M29"><mml:msub><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo>-</mml:mo></mml:mrow></mml:msubsup></mml:mrow></mml:msub></mml:math></inline-formula>, and (b) spatially discounted return <inline-formula><mml:math id="M30"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is more correlated to local region observations <inline-formula><mml:math id="M31"><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">V</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">N</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>.</p>
</sec>
</sec>
<sec id="s3">
<title>3. Calculation</title>
<p>We trained and evaluated MA2C in two traffic environments replicating two districts in the Bologna area (Andrea Costa and the Pasubio) simulated in SUMO (Lopez et al., <xref ref-type="bibr" rid="B14">2018</xref>).</p>
<sec>
<title>3.1. Training and Evaluation</title>
<p>A relevant finding in Fazzini et al. (<xref ref-type="bibr" rid="B8">2021</xref>) is that pseudo-random training (when the same seed is applied to the random vehicle trip generation, causing vehicles repeating the same path among training episodes) shapes robust policies also able to cope with completely random trips (generated with different seeds in different episodes). In fact (Fazzini et al., <xref ref-type="bibr" rid="B8">2021</xref>) reports that all the evaluations performed with various seeds (therefore, various random sequences of trips) show a consistent behavior when using MA2C both with the insertion of 2,000 and 3,600 vehicles. Moreover, such policies have proven effective even when the total number of vehicles inserted during evaluation is different from the total number of vehicles inserted during training, remarking that a learned policy doesn&#x00027;t show a relation with such parameter. Consequently, in the experiments detailed in this work, we adopted pseudo-training. Our setting involves that every episode of the SUMO simulation consists of 3,600 time steps; each time step a vehicle is inserted in the traffic network with a pseudo-random Origin-Destination (OD) pair until an amount of 2,000 vehicles is achieved. The criterion used to measure the algorithms performance is the vehicle queues at the intersections, which is linked to the DP reward by Equation (1). Such queues are estimated by SUMO for each crossing (reward) and then elaborated following the equations in Section 2.5. The algorithm is trained over 1 M training steps, each divided in 720 time steps; consequently every SUMO episode is made by 5 training steps. For the evaluation, we adopt the same settings as in training, although the vehicle trips are generated with a different random seed.</p>
</sec>
<sec>
<title>3.2. Initial Conditions</title>
<p>When training, being the vehicle trips generated in a pseudo-random fashion, randomness comes from the choice of the initial conditions for the ANN weights. Here, the only constraint is that such weights are initialized as orthogonal matrix (Saxe et al., <xref ref-type="bibr" rid="B21">2014</xref>). <xref ref-type="fig" rid="F5">Figures 5</xref>, <xref ref-type="fig" rid="F6">6</xref> show the effect of different initial conditions on the learning process in terms of number of vehicles queuing at the controlled signalized intersection (<italic>y</italic>-axis). The opaque (green) graph shows the best learning curve among 10 training attempts, which are shown in translucent shades.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Andrea costa, learning curves.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frsc-04-756539-g0005.tif"/>
</fig>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Pasubio, learning curves.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frsc-04-756539-g0006.tif"/>
</fig>
<p>In the following evaluations the best learnt policies are adopted to operate synchronization among the agents (signalized interceptions).</p>
</sec>
<sec>
<title>3.3. Parameter Settings</title>
<p>The DP is finally instantiated with the settings listed in <xref ref-type="table" rid="T2">Table 2</xref>.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Hyperparameter settings.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Par</bold>.</th>
<th valign="top" align="center"><bold>Value</bold></th>
<th valign="top" align="left"><bold>Description</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">&#x003B1;</td>
<td valign="top" align="center">0.9</td>
<td valign="top" align="left">Space weighting factor</td>
</tr>
<tr>
<td valign="top" align="left"><italic>T</italic><sub><italic>s</italic></sub></td>
<td valign="top" align="center">3600 [s]</td>
<td valign="top" align="left">Period of simulated traffic</td>
</tr>
<tr>
<td valign="top" align="left">&#x00394;t</td>
<td valign="top" align="center">5 [s]</td>
<td valign="top" align="left">Interaction time between each agent and the traffic environment</td>
</tr>
<tr>
<td valign="top" align="left"><italic>t</italic><sub><italic>y</italic></sub></td>
<td valign="top" align="center">2 [s]</td>
<td valign="top" align="left">Yellow time</td>
</tr>
<tr>
<td valign="top" align="left"><italic>N</italic><sub><italic>v</italic></sub></td>
<td valign="top" align="center">2000,3600 [veh]</td>
<td valign="top" align="left">Total number of vehicles</td>
</tr>
<tr>
<td valign="top" align="left">&#x003B3;</td>
<td valign="top" align="center">0.99</td>
<td valign="top" align="left">Discount factor, controlling how much expected future reward is weighted</td>
</tr>
<tr>
<td valign="top" align="left">&#x003B7;<sub>&#x003B8;</sub></td>
<td valign="top" align="center">5exp(&#x02212;4)</td>
<td valign="top" align="left">Coefficient for <inline-formula><mml:math id="M8"><mml:mo>&#x02207;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>used for gradient descent optimization</td>
</tr>
<tr>
<td valign="top" align="left">&#x003B7;<sub>&#x003C8;</sub></td>
<td valign="top" align="center">2.5exp(&#x02212;4)</td>
<td valign="top" align="left">Coefficient for <inline-formula><mml:math id="M9"><mml:mo>&#x02207;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003C8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
</tr>
<tr>
<td valign="top" align="left">&#x02223;<italic>B</italic>&#x02223;</td>
<td valign="top" align="center">40</td>
<td valign="top" align="left">Size of the batch buffer</td>
</tr>
<tr>
<td valign="top" align="left">&#x003B2;</td>
<td valign="top" align="center">0.01</td>
<td valign="top" align="left">Parameter to balance the entropy loss of policy &#x003C0;<sub>&#x003B8;<sub><italic>i</italic></sub></sub> to encourage early-stage exploration</td>
</tr>
<tr>
<td valign="top" align="left">&#x003BE;<sub><italic>M</italic></sub></td>
<td valign="top" align="center">0.5</td>
<td valign="top" align="left">Critic loss weight</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>The above values follow the implementation in Chu et al. (<xref ref-type="bibr" rid="B6">2019</xref>)</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p>The size of the batch indirectly sets up the <italic>n</italic> parameter of the <italic>n</italic>-step return appearing in Equations (2) and (3) and has been chosen balancing the complementing characteristics of TD and Monte-Carlo methods (Sutton and Barto, <xref ref-type="bibr" rid="B22">1998</xref>).</p>
</sec>
<sec>
<title>3.4. Traffic Networks</title>
<p>Our experimentation have been conducted in the following traffic networks (Fazzini et al., <xref ref-type="bibr" rid="B8">2021</xref>).</p>
<sec>
<title>3.4.1. Bologna - Andrea Costa</title>
<p><xref ref-type="fig" rid="F7">Figure 7</xref> (left) shows the Bologna&#x02014;Andrea Costa neighborhood (Bieker et al., <xref ref-type="bibr" rid="B4">2015</xref>).</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Andrea costa.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frsc-04-756539-g0007.tif"/>
</fig>
<p>The round (translucent purple) spots reference the signalized intersections (agents). Each signalized intersection include one or more crossroads which are highlighted in a dark (green) color. The intersection not controlled by any agent are highlighted in a lighter (yellow) color. The right side of the figure shows the way each agent is connected to the others as required by MA2C fingerprints communication and reward computation. The set of all the agents connected to a single agent constitutes its neighborhood. For this pseudo-random simulation, 2,000 vehicles where inserted in the traffic network, one each time step in the time interval [0, 2,000] while no vehicle is inserted during the 1,600 remaining episode time steps.</p>
</sec>
<sec>
<title>3.4.2. Bologna&#x02014;Pasubio</title>
<p><xref ref-type="fig" rid="F8">Figure 8</xref> (left) shows the Bologna&#x02014;Pasubio neighborhood (Bieker et al., <xref ref-type="bibr" rid="B4">2015</xref>). As in the Andrea Costa case, the right hand side of the figure shows how the agents have been connected.</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>Pasubio.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frsc-04-756539-g0008.tif"/>
</fig>
</sec>
</sec>
</sec>
<sec sec-type="results" id="s4">
<title>4. Results</title>
<p>In this section, we evaluate how MA2C performance translates in terms of emissions.</p>
<p>As described in the above sections, our typical traffic simulation spans over 3,600 time steps, with an interaction time of each vehicle with its environment of 5 s (<xref ref-type="table" rid="T2">Table 2</xref>):</p>
<list list-type="bullet">
<list-item><p>In the first part of the simulation (time steps [0, 2,000]) a vehicle is pseudo-randomly inserted on the map for each time step and follows a pseudo-random path.</p></list-item>
<list-item><p>In the second part of the simulation (time steps [2,000, 3,600]) no vehicle is inserted. Eventually, all the vehicles circulating on the map leave through one of the exit lanes or end their journey by reaching their destination.</p></list-item>
</list>
<p>Evaluations regarding training, convergence as well as details on robustness toward random testing are fully detailed in Fazzini et al. (<xref ref-type="bibr" rid="B8">2021</xref>). In this section, we will focus exclusively on the effects of traffic signal control by MA2C on vehicle circulation.</p>
<p><xref ref-type="fig" rid="F9">Figure 9</xref> shows the number of running vehicles in the time span [0, 3,600] for the cases <italic>Andrea Costa</italic> and <italic>Pasubio</italic> (Fazzini et al., <xref ref-type="bibr" rid="B8">2021</xref>).</p>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption><p>Running vehicles over time. <bold>(A)</bold> Andrea Costa. <bold>(B)</bold> Pasubio.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frsc-04-756539-g0009.tif"/>
</fig>
<p>The curve referring to the case where no coordination is performed among the agents (No Sync case) shows that due to heavy queuing at the traffic lights, several vehicles stay on the road after time step 2000. In the graph, the curve keeps rising while vehicles are injected and tends to slowly decrease afterwards. However, when MA2C performs coordination among the agents using the learnt policy (Sync case), the amount of vehicles running fades quickly toward zero after time step 2,000. This finding has an obvious impact on the amount of emissions, as shown in the following sections.</p>
<sec>
<title>4.1. NO<sub>x</sub> Emissions</title>
<p>Emissions have been computed following the emission model implemented within Sumo (Krajzewicz et al., <xref ref-type="bibr" rid="B11">2014</xref>). The graphs and the tables reported come from evaluating a policy converging to the ideal behavior during training shown in the training graphs. As in Fazzini et al. (<xref ref-type="bibr" rid="B8">2021</xref>), the evaluation of such policy shows no dependence by the seed used to generate the random vehicle trips.</p>
<p><xref ref-type="fig" rid="F10">Figures 10</xref>, <xref ref-type="fig" rid="F11">11</xref> display the NO<sub>x</sub> emissions normalized in time and street length (g/h/km) with (Sync) and without (no Sync) synchronization<xref ref-type="fn" rid="fn0003"><sup>3</sup></xref>.</p>
<fig id="F10" position="float">
<label>Figure 10</label>
<caption><p>NO<sub>x</sub> emissions, time interval [0, 1000]. <bold>(A)</bold> Andrea Costa, no sync. <bold>(B)</bold> Pasubio, no sync. <bold>(C)</bold> Andrea costa, sync. <bold>(D)</bold> Pasubio, sync.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frsc-04-756539-g0010.tif"/>
</fig>
<fig id="F11" position="float">
<label>Figure 11</label>
<caption><p>NO<sub>x</sub> emissions, time interval [2000, 3600]. <bold>(A)</bold> Andrea Costa, no sync. <bold>(B)</bold> Pasubio, no sync. <bold>(C)</bold> Andrea Costa, sync. <bold>(D)</bold> Pasubio, sync.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frsc-04-756539-g0011.tif"/>
</fig>
<p>It appears evident that, in the No Sync case, the amount of emissions stays almost constant over the time intervals considered. A closer look reveals a slight increase of the emissions with time.</p>
<p>In the Sync case, the pictures highlight that the emissions are significantly lower than the previous case: they decrease significantly in the [2,000, 3,600] interval, when no new vehicle gets injected on the road and the traffic eventually fades out. This fact is completely missing in the No Sync case (<xref ref-type="fig" rid="F11">Figure 11</xref>).</p>
<p>Finally, <xref ref-type="table" rid="T3">Table 3</xref> and <xref ref-type="fig" rid="F12">Figure 12</xref> show the overall decrease in pollution and fuel consumption between the cases with No Sync and Sync for both Andrea Costa (AC) and Pasubio (P).</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Overall emissions and fuel consumption.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th valign="top" align="left"><bold>CO<sub>2</sub> [kg]</bold></th>
<th valign="top" align="left"><bold>CO [kg]</bold></th>
<th valign="top" align="left"><bold>NO<sub>x</sub> [g]</bold></th>
<th valign="top" align="left"><bold>PM<sub>x</sub> [g]</bold></th>
<th valign="top" align="left"><bold>HC [g]</bold></th>
<th valign="top" align="left"><bold>Fuel [L]</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">No sync (AC)</td>
<td valign="top" align="left">4753</td>
<td valign="top" align="left">281</td>
<td valign="top" align="left">2162</td>
<td valign="top" align="left">116</td>
<td valign="top" align="left">1391</td>
<td valign="top" align="left">2043</td>
</tr>
<tr>
<td valign="top" align="left">Sync (AC)</td>
<td valign="top" align="left">770</td>
<td valign="top" align="left">22</td>
<td valign="top" align="left">324</td>
<td valign="top" align="left">15</td>
<td valign="top" align="left">120</td>
<td valign="top" align="left">331</td>
</tr>
<tr>
<td valign="top" align="left">No sync (P)</td>
<td valign="top" align="left">5129</td>
<td valign="top" align="left">306</td>
<td valign="top" align="left">2336</td>
<td valign="top" align="left">126</td>
<td valign="top" align="left">1514</td>
<td valign="top" align="left">2204</td>
</tr>
<tr>
<td valign="top" align="left">Sync (P)</td>
<td valign="top" align="left">921</td>
<td valign="top" align="left">31</td>
<td valign="top" align="left">392</td>
<td valign="top" align="left">19</td>
<td valign="top" align="left">165</td>
<td valign="top" align="left">331</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="F12" position="float">
<label>Figure 12</label>
<caption><p>Emissions and fuel consumption.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frsc-04-756539-g0012.tif"/>
</fig>
</sec>
</sec>
<sec sec-type="discussion" id="s5">
<title>5. Discussion</title>
<p>In this work, we have evaluated a recently developed MARL approach, MA2C (Chu et al., <xref ref-type="bibr" rid="B6">2019</xref>, <xref ref-type="bibr" rid="B5">2020</xref>), in terms of emission reduction induced in a controlled traffic network. As an ATSC benchmark, we adopted digital representations of the Andrea Costa and Pasubio areas (Bologna, Italy) (Bieker et al., <xref ref-type="bibr" rid="B4">2015</xref>).</p>
<p>We showed that when signalized intersections are coordinated using MA2C, traffic emissions into the environment and fuel consumption decrease significantly with respect to the case without such coordination. This result translates to a very evident reduction of pollutants released into the environment.</p>
</sec>
<sec sec-type="data-availability" id="s6">
<title>Data Availability Statement</title>
<p>The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.</p>
</sec>
<sec id="s7">
<title>Author Contributions</title>
<p>PF: project administration, conceptualization, methodology, formal analysis, investigation, software, validation, visualization, and writing&#x02014;original draft preparation. MT: pollution data visualization and validation and bibliography management. VR: &#x02018;Related work&#x00027; section. FP: supervision and resources.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s8">
<title>Publisher&#x00027;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec> </body>
<back>

<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Arel</surname> <given-names>I.</given-names></name> <name><surname>Liu</surname> <given-names>C.</given-names></name> <name><surname>Urbanik</surname> <given-names>T.</given-names></name> <name><surname>Kohls</surname> <given-names>A.</given-names></name></person-group> (<year>2010</year>) <article-title>Reinforcement learning-based multi-agent system for network traffic signal control.</article-title> <source>IET Intell. Transp. Syst.</source> <volume>4</volume>, <fpage>128</fpage>. <pub-id pub-id-type="doi">10.1049/iet-its.2009.0070</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Barto</surname> <given-names>A. G.</given-names></name> <name><surname>Sutton</surname> <given-names>R. S.</given-names></name> <name><surname>Anderson</surname> <given-names>C. W.</given-names></name></person-group> (<year>1983</year>). <article-title>Neuronlike adaptive elements that can solve difficult learning control problems</article-title>. <source>IEEE Trans. Syst. Man Cybern.</source> (SMC-135), <fpage>834</fpage>&#x02013;<lpage>846</lpage>.</citation>
</ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bazzan</surname> <given-names>A. L. C.</given-names></name> <name><surname>Klgl</surname> <given-names>F.</given-names></name></person-group> (<year>2014</year>) <article-title>A review on agent-based technology for traffic and transportation.</article-title> <source>Knowl. Eng. Rev.</source> (<volume>29</volume>), <fpage>375</fpage>&#x02013;<lpage>403</lpage>. <pub-id pub-id-type="doi">10.1017/S0269888913000118</pub-id><pub-id pub-id-type="pmid">30886898</pub-id></citation></ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bieker</surname> <given-names>L.</given-names></name> <name><surname>Krajzewicz</surname> <given-names>D.</given-names></name> <name><surname>Morra</surname> <given-names>A. P.</given-names></name> <name><surname>Michelacci</surname> <given-names>C.</given-names></name> <name><surname>Cartolano</surname> <given-names>F.</given-names></name></person-group> (<year>2015</year>). <article-title>Traffic simulation for all: a real world traffic scenario from the city of Bologna</article-title>. <source>Lecture Notes in Control and Information Sciences</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer</publisher-name>), <volume>13</volume>, <fpage>47</fpage>&#x02013;<lpage>60</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-15024-6_4</pub-id></citation>
</ref>
<ref id="B5">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chu</surname> <given-names>T.</given-names></name> <name><surname>Chinchali</surname> <given-names>S.</given-names></name> <name><surname>Katti</surname> <given-names>S.</given-names></name></person-group> (<year>2020</year>). <article-title>Multi-agent reinforcement learning for networked system control</article-title>. <source>arXiv</source>: 2004.01339v2.</citation>
</ref>
<ref id="B6">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chu</surname> <given-names>T.</given-names></name> <name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Codec</surname> <given-names>L.</given-names></name> <name><surname>Li</surname> <given-names>Z.</given-names></name></person-group> (<year>2019</year>). <article-title>Multi-agent deep reinforcement learning for large-scale traffic signal control</article-title>. <source>IEEE Transactions on Intelligent Transportation Systems Vol. 21</source>. (<publisher-name>IEEE</publisher-name>), <fpage>1086</fpage>&#x02013;<lpage>1095</lpage>. <pub-id pub-id-type="doi">10.1109/TITS.2019.2901791</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B7">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>El-Tantawy</surname> <given-names>S.</given-names></name> <name><surname>Abdulhai</surname> <given-names>B.</given-names></name></person-group> (<year>2012</year>) <article-title>Multi-agent reinforcement learning for integrated network of adaptive traffic signal controllers (MARLIN-ATSC),</article-title> in <source>2012 15th International IEEE Conference on Intelligent Transportation Systems</source> (<publisher-loc>Anchorage, AK</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>319</fpage>&#x02013;<lpage>326</lpage>.</citation>
</ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Fazzini</surname> <given-names>P.</given-names></name> <name><surname>Wheeler</surname> <given-names>I.</given-names></name> <name><surname>Petracchini</surname> <given-names>F.</given-names></name></person-group> (<year>2021</year>). <article-title>Traffic signal control with communicative deep reinforcement learning agents: a case study</article-title>. <source>CoRR</source>, abs/2107.01347.</citation>
</ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gokulan</surname> <given-names>B. P.</given-names></name> <name><surname>Srinivasan</surname> <given-names>D.</given-names></name></person-group> (<year>2010</year>) <article-title>Distributed geometric fuzzy multiagent urban traffic signal control.</article-title> <source>IEEE Trans. Intell. Transp. Syst.</source> <volume>11</volume>, <fpage>714</fpage>&#x02013;<lpage>727</lpage>. <pub-id pub-id-type="doi">10.1109/TITS.2010.2050688</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B10">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Hermes</surname> <given-names>J.</given-names></name></person-group> (<year>2012</year>) How traffic jams affect air quality. Available online at: <ext-link ext-link-type="uri" xlink:href="https://www.environmentalleader.com/2012/01/how-traffic-jams-affect-air-quality">https://www.environmentalleader.com/2012/01/how-traffic-jams-affect-air-quality</ext-link>.</citation>
</ref>
<ref id="B11">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Krajzewicz</surname> <given-names>D.</given-names></name> <name><surname>Hausberger</surname> <given-names>S.</given-names></name> <name><surname>Wagner</surname> <given-names>P.</given-names></name> <name><surname>Behrisch</surname> <given-names>M.</given-names></name> <name><surname>Krumnow</surname> <given-names>M.</given-names></name></person-group> (<year>2014</year>). <article-title>Second generation of pollutant emission models for sumo,</article-title> in <source>Modeling Mobility with Open Data Vol. 13</source> (<publisher-loc>Springer</publisher-loc>). <fpage>203</fpage>&#x02013;<lpage>221</lpage>.</citation>
</ref>
<ref id="B12">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kuyer</surname> <given-names>L.</given-names></name> <name><surname>Whiteson</surname> <given-names>S.</given-names></name> <name><surname>Bakker</surname> <given-names>B.</given-names></name> <name><surname>Vlassis</surname> <given-names>N.</given-names></name></person-group> (<year>2008</year>) <article-title>Multiagent reinforcement learning for urban traffic control using coordination graphs.</article-title> in <source>Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science</source>, eds <person-group person-group-type="editor"><name><surname>Daelemans</surname> <given-names>W.</given-names></name> <name><surname>Goethals</surname> <given-names>B.</given-names></name> <name><surname>Morik</surname> <given-names>K.</given-names></name></person-group>, Vol. <volume>5211</volume> (<publisher-loc>Heidelberg</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>656</fpage>&#x02013;<lpage>671</lpage>.</citation>
</ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liang</surname> <given-names>X.</given-names></name> <name><surname>Du</surname> <given-names>X.</given-names></name> <name><surname>Wang</surname> <given-names>G.</given-names></name> <name><surname>Han</surname> <given-names>Z.</given-names></name></person-group> (<year>2019</year>) <article-title>A deep reinforcement learning network for traffic light cycle control.</article-title> <source>IEEE Trans. Veh. Technol.</source> <volume>68</volume>, <fpage>1243</fpage>&#x02013;<lpage>1253</lpage>. <pub-id pub-id-type="doi">10.1109/TVT.2018.2890726</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lopez</surname> <given-names>P. A.</given-names></name> <name><surname>Wiessner</surname> <given-names>E.</given-names></name> <name><surname>Behrisch</surname> <given-names>M.</given-names></name> <name><surname>Bieker-Walz</surname> <given-names>L.</given-names></name> <name><surname>Erdmann</surname> <given-names>J.</given-names></name> <name><surname>Flotterod</surname> <given-names>Y.-P.</given-names></name></person-group> (<year>2018</year>) <article-title>Microscopic traffic simulation using SUMO.</article-title> in <source>2018 21st International Conference on Intelligent Transportation Systems (ITSC)</source> (<publisher-loc>Maui, HI</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2575</fpage>&#x02013;<lpage>2582</lpage>.</citation>
</ref>
<ref id="B15">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mannion</surname> <given-names>P.</given-names></name> <name><surname>Duggan</surname> <given-names>J.</given-names></name> <name><surname>Howley</surname> <given-names>E.</given-names></name></person-group> (<year>2016</year>) <article-title>An experimental review of reinforcement learning algorithms for adaptive traffic signal control.</article-title> in <source>Autonomic Road Transport Support Systems</source>, eds <person-group person-group-type="editor"><name><surname>McCluskey</surname> <given-names>T. L.</given-names></name> <name><surname>Kotsialos</surname> <given-names>A.</given-names></name> <name><surname>Mller</surname> <given-names>J. P.</given-names></name> <name><surname>Klgl</surname> <given-names>F.</given-names></name> <name><surname>Rana</surname> <given-names>O.</given-names></name> <name><surname>Schumann</surname> <given-names>R.</given-names></name></person-group> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>), <fpage>47</fpage>&#x02013;<lpage>66</lpage>.</citation>
</ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Marini</surname> <given-names>S.</given-names></name> <name><surname>Buonanno</surname> <given-names>G.</given-names></name> <name><surname>Stabile</surname> <given-names>L.</given-names></name> <name><surname>Avino</surname> <given-names>P.</given-names></name></person-group> (<year>2015</year>) <article-title>A benchmark for numerical scheme validation of airborne particle exposure in street canyons.</article-title> <source>Environ. Sci. Pollut. Res.</source> <volume>22</volume>, <fpage>2051</fpage>&#x02013;<lpage>2063</lpage>. <pub-id pub-id-type="doi">10.1007/s11356-014-3491-6</pub-id><pub-id pub-id-type="pmid">25167823</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mnih</surname> <given-names>V.</given-names></name> <name><surname>Badia</surname> <given-names>A. P.</given-names></name> <name><surname>Mirza</surname> <given-names>M.</given-names></name> <name><surname>Graves</surname> <given-names>A.</given-names></name> <name><surname>Lillicrap</surname> <given-names>T. P.</given-names></name> <name><surname>Harley</surname> <given-names>T.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>Asynchronous methods for deep reinforcement learning</article-title>. <source>CoRR</source>, abs/1602.01783.x</citation>
</ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Nishi</surname> <given-names>T.</given-names></name> <name><surname>Otaki</surname> <given-names>K.</given-names></name> <name><surname>Hayakawa</surname> <given-names>K.</given-names></name> <name><surname>Yoshimura</surname> <given-names>T.</given-names></name></person-group> (<year>2018</year>) <article-title>Traffic signal control based on reinforcement learning with graph convolutional neural nets,</article-title> in <source>2018 21st International Conference on Intelligent Transportation Systems (ITSC)</source> (<publisher-loc>Maui, HI</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>877</fpage>&#x02013;<lpage>883</lpage>.</citation>
</ref>
<ref id="B19">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Rezzai</surname> <given-names>M.</given-names></name> <name><surname>Dachry</surname> <given-names>W.</given-names></name> <name><surname>Moutaouakkil</surname> <given-names>F.</given-names></name> <name><surname>Medromi</surname> <given-names>H.</given-names></name></person-group> (<year>2018</year>) <article-title>Design and realization of a new architecture based on multi-agent systems and reinforcement learning for traffic signal control,</article-title> in <source>2018 6th International Conference on Multimedia and Computing Systems (ICMCS)</source> (<publisher-loc>Rabat</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>6</lpage>.</citation>
</ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rizza</surname> <given-names>V.</given-names></name> <name><surname>Stabile</surname> <given-names>L.</given-names></name> <name><surname>Buonanno</surname> <given-names>G.</given-names></name> <name><surname>Morawska</surname> <given-names>L.</given-names></name></person-group> (<year>2017</year>) <article-title>Variability of airborne particle metrics in an urban area.</article-title> <source>Environ. Pollut.</source> <volume>220</volume>, <fpage>625</fpage>&#x02013;<lpage>635</lpage>. <pub-id pub-id-type="doi">10.1016/j.envpol.2016.10.013</pub-id><pub-id pub-id-type="pmid">27742438</pub-id></citation></ref>
<ref id="B21">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Saxe</surname> <given-names>A. M.</given-names></name> <name><surname>McClelland</surname> <given-names>J. L.</given-names></name> <name><surname>Ganguli</surname> <given-names>S.</given-names></name></person-group> (<year>2014</year>). <article-title>Exact solutions to the nonlinear dynamics of learning in deep linear neural networks,</article-title> in <source>2nd International Conference on Learning Representations</source>, ICLR 2014 (Banff, AB).</citation>
</ref>
<ref id="B22">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sutton</surname> <given-names>R. S.</given-names></name> <name><surname>Barto</surname> <given-names>A. G.</given-names></name></person-group> (<year>1998</year>) <source>Reinforcement Learning: an Introduction</source> (Adaptive Computation and Machine Learning). (<publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>Mit Press</publisher-name>).</citation>
</ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Teodorovi</surname> <given-names>D.</given-names></name></person-group> (<year>2008</year>) <article-title>Swarm intelligence systems for transportation engineering: Principles and applications.</article-title> <source>Transp. Res. C Emerg. Technol.</source> <volume>16</volume>, <fpage>651</fpage>&#x02013;<lpage>667</lpage>. <pub-id pub-id-type="doi">10.1016/j.trc.2008.03.002</pub-id>.</citation>
</ref>
<ref id="B24">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>Y.</given-names></name> <name><surname>Yang</surname> <given-names>X.</given-names></name> <name><surname>Liang</surname> <given-names>H.</given-names></name> <name><surname>Liu</surname> <given-names>Y.</given-names></name></person-group> <article-title>A review of the self-adaptive traffic signal control system based on future traffic environment.</article-title> <source>J. Adv. Transp.</source> (<year>2018</year>), <fpage>1</fpage>&#x02013;<lpage>12</lpage>. <pub-id pub-id-type="doi">10.1155/2018/1096123</pub-id><pub-id pub-id-type="pmid">2018</pub-id></citation></ref>
<ref id="B25">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wei</surname> <given-names>H.</given-names></name> <name><surname>Zheng</surname> <given-names>G.</given-names></name> <name><surname>Gayah</surname> <given-names>V.</given-names></name> <name><surname>Li</surname> <given-names>Z.</given-names></name></person-group> (<year>2020</year>) <article-title>A survey on traffic signal control methods.</article-title> <source>arXiv</source>: 1904.08117v3.</citation>
</ref>
<ref id="B26">
<citation citation-type="book"><person-group person-group-type="author"><collab>WHO</collab></person-group>. (<year>2002</year>) <source>The top 10 causes of death.</source></citation>
</ref>
<ref id="B27">
<citation citation-type="book"><person-group person-group-type="author"><collab>WHO</collab></person-group>. (<year>2021</year>) Ambient (outdoor) air pollution.</citation>
</ref>
<ref id="B28">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wierstra</surname> <given-names>D.</given-names></name> <name><surname>F&#x000F6;rster</surname> <given-names>A.</given-names></name> <name><surname>Peters</surname> <given-names>J.</given-names></name> <name><surname>Schmidhuber</surname> <given-names>J.</given-names></name></person-group> (<year>2007</year>). <article-title>Solving deep memory pomdps with recurrent policy gradients,</article-title> in <source>ICANN&#x00026;lsquo;07</source> (<publisher-loc>Berlin</publisher-loc>: <publisher-name>Max-Planck-Gesellschaft, Springer</publisher-name>), <fpage>697</fpage>&#x02013;<lpage>706</lpage>.</citation>
</ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yau</surname> <given-names>K.-L. A.</given-names></name> <name><surname>Qadir</surname> <given-names>J.</given-names></name> <name><surname>Khoo</surname> <given-names>H. L.</given-names></name> <name><surname>Ling</surname> <given-names>M. H.</given-names></name> <name><surname>Komisarczuk</surname> <given-names>P.</given-names></name></person-group> (<year>2017</year>) <article-title>A survey on reinforcement learning models and algorithms for traffic signal control.</article-title> <source>ACM Comput. Surveys</source> <volume>50</volume>, <fpage>1</fpage>&#x02013;<lpage>38</lpage>. <pub-id pub-id-type="doi">10.1145/3068287</pub-id></citation>
</ref>
</ref-list>
<fn-group>
<fn id="fn0001"><p><sup>1</sup>as in Fazzini et al. (<xref ref-type="bibr" rid="B8">2021</xref>), to comply the literature on the subject, in this work, we will call the environment feedback &#x0201C;reward&#x0201D; even though it is provided (and perceived) as a penalty.</p></fn>
<fn id="fn0002"><p><sup>2</sup>the complete list of the symbols used in the equations is reported in Fazzini et al. (<xref ref-type="bibr" rid="B8">2021</xref>).</p></fn>
<fn id="fn0003"><p><sup>3</sup>All the other pollutants, we analyzed (namely CO<sub>2</sub>, CO, PM<sub>x</sub>, and HC) exhibit a similar behavior.</p></fn>
</fn-group>
</back>
</article>