<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Aerosp. Eng.</journal-id>
<journal-title>Frontiers in Aerospace Engineering</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Aerosp. Eng.</abbrev-journal-title>
<issn pub-type="epub">2813-2831</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">1071793</article-id>
<article-id pub-id-type="doi">10.3389/fpace.2022.1071793</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Aerospace Engineering</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Exploring online and offline explainability in deep reinforcement learning for aircraft separation assurance</article-title>
<alt-title alt-title-type="left-running-head">Guo et al.</alt-title>
<alt-title alt-title-type="right-running-head">
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/fpace.2022.1071793">10.3389/fpace.2022.1071793</ext-link>
</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Guo</surname>
<given-names>Wei</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="fn" rid="fn1">
<sup>&#x2020;</sup>
</xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Zhou</surname>
<given-names>Yifei</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<xref ref-type="fn" rid="fn1">
<sup>&#x2020;</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1927561/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Wei</surname>
<given-names>Peng</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
</contrib>
</contrib-group>
<aff id="aff1">
<sup>1</sup>
<institution>Department of Computer Science</institution>, <institution>George Washington University</institution>, <addr-line>Washington</addr-line>, <addr-line>DC</addr-line>, <country>United States</country>
</aff>
<aff id="aff2">
<sup>2</sup>
<institution>Department of Mechanical and Aerospace Engineering</institution>, <institution>George Washington University</institution>, <addr-line>Washington</addr-line>, <addr-line>DC</addr-line>, <country>United States</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1635108/overview">Li Weigang</ext-link>, University of Brasilia, Brazil</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1405057/overview">Donald Sofge</ext-link>, United States Naval Research Laboratory, United States</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1639073/overview">Jean-Loup Farges</ext-link>, Office National d&#x2019;&#xc9;tudes et de Recherches A&#xe9;rospatiales, Palaiseau, France</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Yifei Zhou, <email>yzhou87@gwu.edu</email>
</corresp>
<fn fn-type="equal" id="fn1">
<label>
<sup>&#x2020;</sup>
</label>
<p>These authors have contributed equally to this work</p>
</fn>
<fn fn-type="other">
<p>This article was submitted to Intelligent Aerospace Systems, a section of the journal Frontiers in Aerospace Engineering</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>13</day>
<month>12</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>1</volume>
<elocation-id>1071793</elocation-id>
<history>
<date date-type="received">
<day>16</day>
<month>10</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>24</day>
<month>11</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2022 Guo, Zhou and Wei.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Guo, Zhou and Wei</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>Deep Reinforcement Learning (DRL) has demonstrated promising performance in maintaining safe separation among aircraft. In this work, we focus on a specific engineering application of aircraft separation assurance in structured airspace with high-density air traffic. In spite of the scalable performance, the non-transparent decision-making processes of DRL hinders human users from building trust in such learning-based decision making tool. In order to build a trustworthy DRL-based aircraft separation assurance system, we propose a novel framework to provide stepwise explanations of DRL policies for human users. Based on the different needs of human users, our framework integrates 1) a Soft Decision Tree (SDT) as an online explanation provider to display critical information for human operators in real-time; and 2) a saliency method, Linearly Estimated Gradient (LEG), as an offline explanation tool for certification agencies to conduct more comprehensive verification time or post-event analyses. Corresponding visualization methods are proposed to illustrate the information in the SDT and LEG efficiently: 1) Online explanations are visualized with tree plots and trajectory plots; 2) Offline explanations are visualized with saliency maps and position maps. In the BlueSky air traffic simulator, we evaluate the effectiveness of our framework on case studies with complex airspace route structures. Results show that the proposed framework can provide reasonable explanations of multi-agent sequential decision-making. In addition, for more predictable and trustworthy DRL models, we investigate two specific patterns that DRL policies follow based on similar aircraft locations in the airspace.</p>
</abstract>
<kwd-group>
<kwd>multi-agent deep reinforcement learning</kwd>
<kwd>soft decision tree</kwd>
<kwd>saliency method</kwd>
<kwd>safety-critical system</kwd>
<kwd>aircraft separation assurance</kwd>
<kwd>explainable AI</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>Many real-world decision making and control tasks face challenges from dynamic environments and complex state spaces. In addition, there are often competitions or collaborations when there is a multiple-agent environment. These practical factors make it challenging to design planning and control algorithms. Aircraft separation assurance is one example of these real-world control tasks. It aims to maintain safe separation distances among all aircraft in a given airspace region and guarantee that every aircraft exits the airspace region without conflict. In this work, we focus on aircraft separation assurance provided through speed advisories.</p>
<p>In recent years, Deep Reinforcement Learning (DRL) has been recently explored in aircraft separation assurance. Our group solves this problem in structured airspace with DRL (<xref ref-type="bibr" rid="B2">Brittain and Wei, 2019</xref>; <xref ref-type="bibr" rid="B3">Brittain and Wei, 2021</xref>; <xref ref-type="bibr" rid="B4">Brittain et al., 2021</xref>; <xref ref-type="bibr" rid="B12">Guo et al., 2021</xref>). Different DRL approaches have been applied to aircraft separation assurance task. Deep Q-network (<xref ref-type="bibr" rid="B25">Mnih et al., 2013</xref>; <xref ref-type="bibr" rid="B34">Van Hasselt et al., 2016</xref>) is a very popular solution in this field because of its generalization ability (<xref ref-type="bibr" rid="B38">Wulfe, 2017</xref>; <xref ref-type="bibr" rid="B36">Wang et al., 2019</xref>; <xref ref-type="bibr" rid="B27">Ribeiro et al., 2020</xref>; <xref ref-type="bibr" rid="B17">Isufaj et al., 2021</xref>). Proximal Policy Optimization (PPO) (<xref ref-type="bibr" rid="B28">Schulman et al., 2017</xref>) is also widely used because it provides stable and outstanding performance (<xref ref-type="bibr" rid="B2">Brittain and Wei, 2019</xref>; <xref ref-type="bibr" rid="B3">Brittain and Wei, 2021</xref>; <xref ref-type="bibr" rid="B4">Brittain et al., 2021</xref>; <xref ref-type="bibr" rid="B10">Ghosh et al., 2021</xref>; <xref ref-type="bibr" rid="B12">Guo et al., 2021</xref>). Though DRL models show good performances in collision avoidance problems, they make decisions in a non-transparent way, which makes it hard to build trust in such a safety-critical system.</p>
<p>The non-transparency issue can be addressed by providing explanations regarding the model decisions and recommendations. For example, displaying the key factors, such as position and airspeed, that lead to a speed change advisory will help human users understand DRL policies better. In our work, users of DRL-based aircraft separation assurance systems refer to human operators such as pilots and air traffic controllers, and certification agencies such as the Federal Aviation Administration. Apart from the critical information for the decision at each time step, analysts from certification agencies also need to reason whether or not the speed advisories follow any patterns given specific aircraft locations and speeds in the airspace. If such pattern is identified, DRL policies will become more predictable to users, resulting in a more trustworthy aircraft separation assurance system for human operators.</p>
<p>In order to aid human users in capturing the key factors in the decisions of DRL models, it is important to quantify the influence of input states on output policies. One popular approach is to build Soft Decision Trees (SDTs) (<xref ref-type="bibr" rid="B9">Frosst and Hinton, 2017</xref>) by distilling the knowledge from DRL models into shallow, more explainable trees. The distilled SDT works as a surrogate model. Its decision path and feature weights can be leveraged to present critical state information behind one recommended decision. Another promising branch of methods are saliency methods (<xref ref-type="bibr" rid="B31">Simonyan et al., 2013</xref>; <xref ref-type="bibr" rid="B29">Selvaraju et al., 2017</xref>; <xref ref-type="bibr" rid="B33">Sundararajan et al., 2017</xref>). Instead of building a surrogate model, saliency methods typically leverage the neural network&#x2019;s gradients to compute saliency scores of input features. Higher scores imply more important features.</p>
<p>We propose a novel framework that integrates an SDT and a saliency method called linearly estimated gradient (LEG) (<xref ref-type="bibr" rid="B24">Luo et al., 2021</xref>), to provide stepwise explanations for human users, as shown in <xref ref-type="fig" rid="F1">Figure 1</xref>. The SDT serves as an online explanation provider to display critical information for human operators in real-time, while LEG is an offline explanation tool for certification agencies to conduct more comprehensive analyses. Specifically, the SDT transfers knowledge from complex network representation to tree-structured representation with clear decision paths. We visualize the transferred knowledge with 1) a tree plot showing feature importance and the decision path and 2) a trajectory plot highlighting critical state information for a recommended decision. In addition, LEG computes saliency scores of the DRL model&#x2019;s input features. We visualize these scores with a saliency map. Offline explanations are provided by combining it and a position map showing the airspace route structure and all aircraft. Furthermore, we utilize LEG to explain two specific patterns that speed advisories follow. We refer to our framework as &#x201c;<bold>S</bold>tepwise <bold>E</bold>xplainable <bold>S</bold>eparation <bold>A</bold>ssurance <bold>ME</bold>thod&#x201d; (SESAME).</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>The structure of the SESAME explanation framework. The framework observes the state vectors of the simulation environment and generates the decision path with the SDTs in distillation module at each time step. The decision path is fed to the visualization module. The visualization module generates tree plot and trajectory plot to provide online explanations for human operators. LEG computes saliency scores of input features. The scores are fed to the visualization module. The visualization module generates saliency maps and position maps to provide offline explanations for certification agencies.</p>
</caption>
<graphic xlink:href="fpace-01-1071793-g001.tif"/>
</fig>
<p>Our contributions can be summarized as follows:<list list-type="simple">
<list-item>
<p>&#x2022; We extend the application of LEG from deep learning based computer vision models to DRL-based decision-making models.</p>
</list-item>
<list-item>
<p>&#x2022; To the best of our knowledge, our work is the first to build explainable DRL-based aircraft separation assurance systems.</p>
</list-item>
<list-item>
<p>&#x2022; Our proposed framework provides online explanations for human operators <italic>via</italic> an SDT and decision path visualization, and provides offline explanations for certification agencies <italic>via</italic> LEG and decision reasoning visualization.</p>
</list-item>
</list>
</p>
</sec>
<sec id="s2">
<title>2 Related works</title>
<p>Many works on Explainable DRL have been proposed to provide model explanations. Representation learning is applied to generating compact and explainable representations (<xref ref-type="bibr" rid="B20">Jonschkowski and Brock, 2015</xref>; <xref ref-type="bibr" rid="B19">Jarrett et al., 2021</xref>). Human-understandable rules from DRL models are extracted by logic rule methods to provide behavior explanations (<xref ref-type="bibr" rid="B35">Verma et al., 2018</xref>). Neural language models are trained to generate explanations for agent behaviors in text form (<xref ref-type="bibr" rid="B5">Cideron et al., 2019</xref>; <xref ref-type="bibr" rid="B22">Kim et al., 2020</xref>).</p>
<p>SDT is a popular model for behavior explanations, which integrates the traditional hard decision trees (<xref ref-type="bibr" rid="B26">Quinlan, 1986</xref>) and perceptrons. SDTs can distill knowledge from complex DRL models and work as surrogate models. The tree structure in SDT provides a decision path and detailed feature weights, which can then be used to provide explanations. The first SDT model was implemented to solve an image classification problem (<xref ref-type="bibr" rid="B9">Frosst and Hinton, 2017</xref>). To build SDT models for DRL models, state-action pairs are first generated by DRL models and then used to train SDT models in a supervised-learning paradigm. The knowledge from a PPO network was distilled to explain the behaviors of agents playing the Super Mario game (<xref ref-type="bibr" rid="B21">Karakovskiy and Togelius, 2012</xref>). To further improve the performance of DRL-based SDTs, a linear model U-tree is proposed by approximating Q functions (<xref ref-type="bibr" rid="B23">Liu et al., 2018</xref>). Univariate nodes are introduced to discretize SDTs for better interpretability (<xref ref-type="bibr" rid="B30">Silva et al., 2020</xref>). A feature learning tree is integrated with the standard SDT to improve model expressivity (<xref ref-type="bibr" rid="B8">Ding et al., 2020</xref>). A series of novel metrics are proposed for a comprehensive evaluation (<xref ref-type="bibr" rid="B7">Dahlin et al., 2020</xref>).</p>
<p>Another promising branch of explanation methods is saliency methods. They typically leverage the neural network&#x2019;s gradients to compute saliency scores of input features. Features with higher scores are more relevant to the network prediction. Saliency methods are widely used to determine the regions in an image that contribute the most to the classification result (<xref ref-type="bibr" rid="B31">Simonyan et al., 2013</xref>; <xref ref-type="bibr" rid="B40">Zeiler and Fergus, 2014</xref>; <xref ref-type="bibr" rid="B29">Selvaraju et al., 2017</xref>; <xref ref-type="bibr" rid="B32">Smilkov et al., 2017</xref>; <xref ref-type="bibr" rid="B33">Sundararajan et al., 2017</xref>; <xref ref-type="bibr" rid="B1">Ancona et al., 2018</xref>). Similarly, they are used to capture the key factors that lead to the decisions of DRL agents. Jacobian saliency is visualized on the images of Atari games to highlight the most important pixels (<xref ref-type="bibr" rid="B37">Wang et al., 2016</xref>; <xref ref-type="bibr" rid="B39">Zahavy et al., 2016</xref>). A perturbation method is proposed to generate saliency maps that are more human-interpretable (<xref ref-type="bibr" rid="B11">Greydanus et al., 2018</xref>). Object saliency maps are explored to explain DRL-based object recognition (<xref ref-type="bibr" rid="B18">Iyer et al., 2018</xref>). Specificity and relevance are introduced as two criteria for more accurate saliency maps (<xref ref-type="bibr" rid="B14">Gupta et al., 2020</xref>). In this work, we integrate a saliency method, LEG (<xref ref-type="bibr" rid="B24">Luo et al., 2021</xref>), into our framework as an offline explanation tool for certification agencies.</p>
<p>Our work is different from the previous works in the following ways: 1) Our framework provides both online explanations with SDTs and offline explanations with saliency methods. In contrast, the previous papers only focus on one specific method. 2) Our work concentrates on the aircraft separation assurance task with a complex high-dimensional input space. But the previous works mainly focus on tasks with simple low-dimensional input [e.g., CartPole (<xref ref-type="bibr" rid="B8">Ding et al., 2020</xref>), LunarLander (<xref ref-type="bibr" rid="B30">Silva et al., 2020</xref>)] or tasks with pixel-based input [e.g., Mario AI Benchmark (<xref ref-type="bibr" rid="B6">Coppens et al., 2019</xref>), Wildfire Tracking (<xref ref-type="bibr" rid="B15">Haksar and Schwager, 2018</xref>)]. 3) Our work focuses on a complex multi-agent problem so the proposed methods need to consider the cooperation among all agents in the system. However, the previous works mainly explain the tasks with only a single agent. 4) Our work explores the general patterns that DRL policies follow. In contrast, previous works on explaining DRL policies with saliency methods mainly focus on extracting important information from a particular state.</p>
</sec>
<sec id="s3">
<title>3 Background</title>
<sec id="s3-1">
<title>3.1 Markov decision process</title>
<p>Sequential decision-making problems can be formalized as Markov Decision Processes (MDPs). The important Markov assumption here is that the transition between two states depends only on the current state and action but not the history. An MDP can then be modeled with tuple <inline-formula id="inf1">
<mml:math id="m1">
<mml:mo>&#x3c;</mml:mo>
<mml:mi>S</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>A</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>T</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>R</mml:mi>
<mml:mo>&#x3e;</mml:mo>
</mml:math>
</inline-formula>. The agent chooses an action <italic>a</italic> &#x2208; <italic>A</italic> in state <italic>s</italic> &#x2208; <italic>S</italic> and receives a reward <italic>R</italic> (<italic>s</italic>, <italic>a</italic>). Given the transition probability <italic>T</italic> (<italic>s</italic>&#x2032;&#x7c;<italic>s</italic>, <italic>a</italic>), the current state <italic>s</italic> will transition to the next state <italic>s</italic>&#x2032;.</p>
<p>The agent interacts with the environment based on the policy <italic>&#x3c0;</italic>. The solution of MDP is an optimal policy <italic>&#x3c0;</italic>&#x2a; that maximizes the expected utility. The optimal policy can be found recursively by:<disp-formula id="equ1">
<mml:math id="m2">
<mml:msup>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2a;</mml:mo>
</mml:mrow>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mi mathvariant="normal">a</mml:mi>
<mml:mi mathvariant="normal">r</mml:mi>
<mml:mi mathvariant="normal">g</mml:mi>
<mml:mi mathvariant="normal">m</mml:mi>
<mml:mi mathvariant="normal">a</mml:mi>
<mml:mi mathvariant="normal">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mi>E</mml:mi>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:munderover accentunder="false" accent="true">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:math>
</disp-formula>
</p>
</sec>
<sec id="s3-2">
<title>3.2 Deep reinforcement learning</title>
<p>DRL is a solution to solve the Reinforcement Learning problems. A policy <italic>&#x3c0;</italic> is represented as a neural network in DRL models. Policy-based DRL is a family of algorithms designed to learn stochastic policies. PPO is a gradient-based DRL method (<xref ref-type="bibr" rid="B28">Schulman et al., 2017</xref>) that implements a neural network to approximate the policy and the value function. In this work, a PPO network is first trained and then used to generate offline transitions for the explainable models.</p>
</sec>
<sec id="s3-3">
<title>3.3 Soft decision tree</title>
<p>SDT is a classification model that integrates perceptrons with traditional hard decision trees together. SDT is recently explored in explainable classification problems. A single-layer neural network parameterized by weight <italic>w</italic>
<sub>
<italic>k</italic>
</sub> is built in each non-leaf node <italic>k</italic>. Given all possible classes <italic>c</italic>&#x2032; and the target class <italic>c</italic>, a fixed classification distribution parameterized by vector <italic>&#x3d5;</italic>
<sup>
<italic>l</italic>
</sup> is learned by each leaf node <italic>l</italic> as:<disp-formula id="equ2">
<mml:math id="m3">
<mml:msubsup>
<mml:mrow>
<mml:mi>Q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">exp</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mo movablelimits="false" form="prefix">&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2061;</mml:mo>
<mml:mi mathvariant="italic">exp</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:math>
</disp-formula>
</p>
<p>The <italic>traversal probability</italic> is defined as the probability of traversing from the parent node <italic>k</italic> to its left child node. Given input state <italic>s</italic>, weight vector <italic>w</italic>
<sub>
<italic>k</italic>
</sub>, the sigmoid function <italic>&#x3c3;</italic>, and inverse temperature parameter <italic>&#x3b2;</italic>, the traversal probability can be calculated as:<disp-formula id="equ3">
<mml:math id="m4">
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>&#x3c3;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>&#x3b2;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:math>
</disp-formula>
</p>
<p>A <italic>decision path</italic> is formed by the traversal from the root node to another node (e.g., a leaf node). The decision is made hierarchically in an SDT along the decision path. The <italic>path probability</italic> <italic>p</italic> is defined as the overall product of all probabilities leading from the root to the last node in this traversal. Given the input state <italic>s</italic>, the traversal probabilities are calculated on each none-leaf node along the decision path. Finally, one leaf node is reached and the classification distribution <italic>&#x3d5;</italic>
<sup>
<italic>l</italic>
</sup> is used to guide the output selection.</p>
<p>The loss function of SDT mainly has two parts: entropy loss foc classification and the regularization loss to penalize unequal use of non-leaf nodes. All non-leaf and leaf nodes are optimized during the training process.</p>
</sec>
<sec id="s3-4">
<title>3.4 Linearly estimated gradient</title>
<p>Linearly Estimated Gradient (LEG) (<xref ref-type="bibr" rid="B24">Luo et al., 2021</xref>) is a saliency method that recovers the gradient of a neural network by perturbing its input features. Features that contribute more to the network output will have gradients with larger magnitudes. In our setting, we treat the DRL model as a non-linear function whose policy output is <italic>&#x3c0;</italic>(<italic>a</italic>&#x7c;<italic>s</italic>), where <italic>s</italic> is the input state and <italic>a</italic> is the action. The perturbations are sampled from a continuous distribution <italic>F</italic>. The gradients to be recovered are denoted as <italic>g</italic>. If the current state is <italic>s</italic>
<sub>0</sub>, <italic>g</italic> can be written as:<disp-formula id="equ4">
<mml:math id="m5">
<mml:mi>g</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi>&#x2202;</mml:mi>
<mml:mi>&#x3c0;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2202;</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo>.</mml:mo>
</mml:math>
</disp-formula>
</p>
<p>LEG approximates the function <italic>&#x3c0;</italic>(<italic>a</italic>&#x7c;<italic>s</italic>) based on the first-order Taylor series expansion around the point <italic>s</italic>
<sub>0</sub>:<disp-formula id="equ5">
<mml:math id="m6">
<mml:mi>&#x3c0;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2248;</mml:mo>
<mml:mi>&#x3c0;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2b;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>g</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:math>
</disp-formula>
</p>
<p>LEG is defined as:<disp-formula id="equ7">
<mml:math id="m8">
<mml:mi>&#x3b3;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>F</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>arg</mml:mi>
<mml:munder>
<mml:mrow>
<mml:mi>min</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>g</mml:mi>
</mml:mrow>
</mml:munder>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="double-struck">E</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x223c;</mml:mo>
<mml:mi>F</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>&#x3c0;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2212;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>g</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:math>
</disp-formula>It minimizes a squared error. The distribution <italic>F</italic> is chosen based on the range of points to be considered.</p>
<p>It can be proved that LEG has an analytical solution under certain conditions [Lemma 1 in <xref ref-type="bibr" rid="B24">Luo et al. (2021)</xref>]. Specifically, let <italic>Z</italic> be a vector of random variables which obeys a centered distribution <italic>F</italic>, i.e., <italic>Z</italic> &#x223c; <italic>F</italic> and <inline-formula id="inf2">
<mml:math id="m9">
<mml:mi mathvariant="double-struck">E</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mrow>
<mml:mi>Z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:math>
</inline-formula>. Assume that covariance matrix of <italic>Z</italic> exists, and is positive-definite. If we denote it as &#x3a3; &#x3d; cov(<italic>Z</italic>), the analytical solution to LEG is:<disp-formula id="equ9">
<mml:math id="m11">
<mml:mi>&#x3b3;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>F</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="normal">&#x3a3;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="double-struck">E</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>Z</mml:mi>
<mml:mo>&#x223c;</mml:mo>
<mml:mi>F</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:mi>Z</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>Z</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>&#x3c0;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:math>
</disp-formula>
</p>
<p>Therefore, LEG can be approximated by an empirical mean. We randomly sample <italic>s</italic>
<sub>1</sub>, &#x2026;, <italic>s</italic>
<sub>
<italic>n</italic>
</sub> from <italic>F</italic> &#x2b; <italic>s</italic>
<sub>0</sub>, and then compute<italic>y</italic>
<sub>1</sub>, &#x2026;, <italic>y</italic>
<sub>
<italic>n</italic>
</sub>, where <italic>y</italic>
<sub>
<italic>i</italic>
</sub> &#x3d; <italic>&#x3c0;</italic>(<italic>a</italic>&#x7c;<italic>s</italic>
<sub>
<italic>i</italic>
</sub>). If we denote the difference between the original and perturbed value as <inline-formula id="inf4">
<mml:math id="m13">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>&#x3c0;</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>&#x3c0;</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> and <italic>Z</italic>
<sub>
<italic>i</italic>
</sub> &#x3d; <italic>s</italic>
<sub>
<italic>i</italic>
</sub>&#x2212;<italic>s</italic>
<sub>0</sub>, the empirical estimate of LEG is:<disp-formula id="e1">
<mml:math id="m15">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3b3;</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>F</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="normal">&#x3a3;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:munderover accentunder="false" accent="true">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>Z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:math>
<label>(1)</label>
</disp-formula>
</p>
</sec>
</sec>
<sec id="s4">
<title>4 Problem formulation</title>
<p>The target of aircraft separation assurance is to maintain a safe distance or prevent loss of separation among all aircraft in an airspace region. Specifically, we formulate it as a decentralized decision-making process, so each aircraft needs to coordinate with other aircraft. In this paper, our proposed framework provides explanations of all aircraft behaviors. The BlueSky (<xref ref-type="bibr" rid="B16">Hoekstra and Ellerbroek, 2016</xref>) is used as the air traffic simulator.</p>
<p>To evaluate the performance of the proposed framework, two challenging case studies are implemented with multiple intersections and high-density air traffic. Each case study has a dynamic simulation environment and the aircraft enter the sector stochastically. So the proposed explainable framework cannot just remember the correct behaviors but has to understand the dynamic strategies based on varying scenarios. This problem setting further increases the difficulty of behavior explanations.</p>
<sec id="s4-1">
<title>4.1 Multi-agent reinforcement learning</title>
<p>The aircraft separation assurance problem is formalized as a multi-agent reinforcement learning problem. Each aircraft is treated as an agent and needs to coordinate with all other aircraft. The same DRL model and explainable framework are implemented on each aircraft to maintain safety separation and provide explanations during execution.</p>
</sec>
<sec id="s4-2">
<title>4.2 Action space</title>
<p>Since the radar surveillance system updates the en route position every 12&#xa0;s, the agent is allowed to select an action per 12&#xa0;s. The action space is simplified with three available actions: deceleration <italic>a</italic>
<sub>&#x2212;</sub>, acceleration <italic>a</italic>
<sub>&#x2b;</sub>, and maintaining the current speed <italic>a</italic>
<sub>0</sub>.</p>
</sec>
<sec id="s4-3">
<title>4.3 State space</title>
<p>Both information of ownship and intruders is included in the state space. Since the information needs to be shared among all aircraft, we proposed the functions of communication and coordination. Specifically, the following features are contained in ownship state: location, current speed, acceleration information, route identifier, and the distance to sector exit. Extra information is included in intruder state: Distance between ownship and intruders, distance between ownship and intersections, and the distance between intruders and intersections.</p>
</sec>
<sec id="s4-4">
<title>4.4 Termination</title>
<p>Aircraft will be generated and fly following the routes in each simulation episode. The termination happens when all aircraft either 1) exit the sector without conflict or 2) have collisions with other aircraft.</p>
</sec>
<sec id="s4-5">
<title>4.5 Reward</title>
<p>The same reward function is implemented on all aircraft. To sum up, the reward function penalizes the local collisions and the speed changes of all aircraft agents. The collision reward <italic>R</italic>
<sub>
<italic>c</italic>
</sub> penalizes only the two or more aircraft in conflict. The speed-change reward <italic>R</italic>
<sub>
<italic>s</italic>
</sub> is implemented because speed changes should be avoided unless necessary in the real world. Specifically, the collision reward function <italic>R</italic>
<sub>
<italic>c</italic>
</sub> and the speed-change reward function <italic>R</italic>
<sub>
<italic>s</italic>
</sub> are defined as follows:<disp-formula id="e2">
<mml:math id="m16">
<mml:msub>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mfenced open="{" close="">
<mml:mrow>
<mml:mtable class="cases">
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mspace width="1em"/>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mtext>if</mml:mtext>
<mml:mspace width="0.22em"/>
<mml:msubsup>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x3c;</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo>,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mo>&#x2212;</mml:mo>
<mml:mi>&#x3b1;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3b4;</mml:mi>
<mml:mo>&#x22c5;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mspace width="1em"/>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mtext>if</mml:mtext>
<mml:mspace width="0.22em"/>
<mml:mn>3</mml:mn>
<mml:mo>&#x2264;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x3c;</mml:mo>
<mml:mn>10</mml:mn>
<mml:mo>,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mn>0</mml:mn>
<mml:mspace width="1em"/>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mtext>otherwise</mml:mtext>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mfenced>
</mml:math>
<label>(2)</label>
</disp-formula>
<disp-formula id="e3">
<mml:math id="m17">
<mml:msub>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mfenced open="{" close="">
<mml:mrow>
<mml:mtable class="cases">
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mn>0</mml:mn>
<mml:mspace width="1em"/>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mtext>if</mml:mtext>
<mml:mspace width="0.22em"/>
<mml:mi>a</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mo>&#x2212;</mml:mo>
<mml:mi>&#x3c8;</mml:mi>
<mml:mspace width="1em"/>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mtext>otherwise</mml:mtext>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mfenced>
</mml:math>
<label>(3)</label>
</disp-formula>
</p>
<p>Here <inline-formula id="inf5">
<mml:math id="m18">
<mml:msubsup>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>o</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula> is the distance between the intruder and the ownship. <italic>&#x3b1;</italic> and <italic>&#x3b4;</italic> are parameters ensuring the reward range. <italic>&#x3c8;</italic> is used to mitigate the number of speed changes.</p>
</sec>
</sec>
<sec id="s5">
<title>5 Proposed methods</title>
<p>Our objective is to provide explanations of the DRL agent behaviors in aircraft separation assurance. In order to achieve this goal, we propose an explanation framework with both online and offline modules to pair with the given DRL model.</p>
<sec id="s5-1">
<title>5.1 Distillation module</title>
<p>This module distills the knowledge from a DRL model into a shallow SDT. The DRL model controls the agent to perform aircraft separation assurance task. It also generates transitions consisting of state-action pairs (<italic>s</italic>, <italic>a</italic>) to train the SDT model. <italic>a</italic> is the predicted action for state <italic>s</italic> from the DRL model.</p>
<p>SDTs are fitted with the above-mentioned transitions using supervised learning. During the execution phase, SDT generates a decision path for input state <italic>s</italic>. And each state variable (e.g., velocity, acceleration) in state <italic>s</italic> from DRL model is treated as a feature in SDT. The feature weights and the decision path in SDT are then utilized in the visualization module to support behavior explanations.</p>
<p>Because the feature scales may vary greatly, we further implement a Batch Normalization (BN) layer in front of the SDT to normalize the input. BN operator subtracts the mean value of mini-batch and subsequently divides the centered input by the standard deviation of mini-batch during training. The mean and standard deviation are calculated per dimension over the mini-batches during the training phase. In the evaluation phase, the estimated population statistics are then used for normalization. Therefore, BN helps normalize the input and provide more meaningful interpretations.</p>
</sec>
<sec id="s5-2">
<title>5.2 Visualization of distillation module</title>
<p>The distillation module is not sufficient to help users understand the agent behaviors because there is too much redundant information in the SDT model. Therefore, we implement a visualization module that provides visual explanations efficiently. The visualization module offers the explanation information extracted from the distillation module with a graphical interface to human users. Specifically, the visualization module contains 1) a tree plot showing feature weights of each node and decision path in a tree-structured image and 2) a trajectory plot showing the trajectories of all aircraft in the structured airspace with precise explanations. A sample tree plot and trajectory plot is shown in <xref ref-type="fig" rid="F2">Figure 2</xref>.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>Sample tree plot and trajectory plot. <bold>(A)</bold> Sample tree plot, <bold>(B)</bold> Sample Trajectory Plot.</p>
</caption>
<graphic xlink:href="fpace-01-1071793-g002.tif"/>
</fig>
<sec id="s5-2-1">
<title>5.2.1 Tree plot</title>
<p>Each non-leaf node in SDT processes all input features with a one-layer network, so the feature weights give the information on how features influence the decision in that node. The feature weights and output values of nodes along the decision path can provide the behavior explanations on how the hierarchical decisions are made.</p>
<p>The tree plot illustrates feature weights of all SDT nodes in a tree-structured image. Feature weights of each node are visualized as a heatmap. The decision path is demonstrated with dense arrows connecting nodes along the decision path.</p>
<p>The explanation information is projected into the tree plot following these rules:<list list-type="simple">
<list-item>
<p>&#x2022; Each feature weight is represented as a colored square in the heatmap.</p>
</list-item>
<list-item>
<p>&#x2022; Feature weights of the same aircraft are drawn in the same row of heatmap.</p>
</list-item>
<list-item>
<p>&#x2022; The higher the absolute value of weight is, the larger the size and the deeper the color is of the square. This also implies that the feature is more influential on the decision in the current layer.</p>
</list-item>
<list-item>
<p>&#x2022; Red color and blue color represent positive and negative influences respectively.</p>
</list-item>
</list>
</p>
</sec>
<sec id="s5-2-2">
<title>5.2.2 Trajectory plot</title>
<p>While the tree plot gives a comprehensive explanation of behaviors, the trajectory plot only shows the most influential features for decision-making with both visual symbols and text. The simple structure of trajectory plot provides users with the most important information to understand the agent behaviors.</p>
<p>There are three main components in a trajectory plot: 1) all aircraft flying along routes, 2) highlighted influential factors, and 3) text boxes showing action information and ownship behavior explanations. Following rules are applied to demonstrate vital information in the trajectory plot:<list list-type="simple">
<list-item>
<p>&#x2022; For each node in decision path, the feature with the largest absolute value will be selected as the important feature. For an SDT with depth <italic>m</italic>, there will be <italic>m</italic> important features in the trajectory plot. Their icons will be emphasised in the trajectory plot.</p>
</list-item>
<list-item>
<p>&#x2022; Different symbols are used to emphasize features. For example, distance feature will be drawn as an green solid line.</p>
</list-item>
</list>
</p>
</sec>
</sec>
<sec id="s5-3">
<title>5.3 Saliency module</title>
<p>LEG works as the saliency module to recover the gradient of a neural network by perturbing its input features (in our case, input states). In this work, we utilize two state variables, location and speed, for the purpose of offline explanations. The location of an aircraft is represented by its longitude given the assigned route<xref ref-type="fn" rid="fn2">
<sup>1</sup>
</xref>, and we assume that latitude of longitude is monotonic all along the route. Therefore, the gradient of the DRL model is recovered by perturbing the longitude and speed of each aircraft. We reduce the input state space in order to save computational costs. Since all the distance features can be computed from location, we preserve location as a high-level feature of aircraft. We also keep the speed feature as the dynamics information of aircraft. Additionally, the route identifier of each aircraft is a fixed value. Acceleration reflects the decision at the previous time step, and its absolute value is fixed. So route identifier and acceleration are not included in the reduced state space.</p>
<p>As mentioned above, the gradient of the DRL model is recovered by perturbing the longitude and speed of each aircraft, so we need to determine the perturbation structure, namely <italic>F</italic> in the definition of LEG. According to <xref ref-type="bibr" rid="B2">Brittain and Wei (2019)</xref> and <xref ref-type="bibr" rid="B4">Brittain et al. (2021)</xref>, intruders of an ownship are selected based on intersections and conflicting routes. If the ownship or any intruders pass an intersection after perturbation, the dimension of the ownship&#x2019;s state space may change. To avoid this change, we define the following rules of perturbation:<list list-type="simple">
<list-item>
<p>&#x2022; The ownship must not pass an intersection after perturbation.</p>
</list-item>
<list-item>
<p>&#x2022; Intruders on conflicting routes must not pass any intersection after perturbation.</p>
</list-item>
<list-item>
<p>&#x2022; Aircraft must stay on their routes.</p>
</list-item>
</list>Considering the first two rules, we need the values of state features to be bounded, so we independently sample perturbations of each feature from uniform distributions. Since aircraft must stay on their routes, we are then able to compute latitudes from the perturbed longitudes.</p>
</sec>
<sec id="s5-4">
<title>5.4 Visualization of saliency module</title>
<p>We visualize LEG with saliency maps that show the importance of locations and speeds to DRL policies. Additionally, position maps showing airspace and aircraft are used to pair with saliency maps. A sample saliency map and position map is shown in <xref ref-type="fig" rid="F3">Figure 3</xref>.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Sample saliency map and position map. <bold>(A)</bold> Sample saliency map, <bold>(B)</bold> Sample position map.</p>
</caption>
<graphic xlink:href="fpace-01-1071793-g003.tif"/>
</fig>
<sec id="s5-4-1">
<title>5.4.1 Saliency map</title>
<p>A saliency map consists of three heatmaps that illustrate the importance of locations and speeds to the three actions: acceleration, deceleration, and maintaining the current speed, respectively. Each heatmap follows similar rules to tree plots:<list list-type="simple">
<list-item>
<p>&#x2022; The saliency score of each feature is displayed in a colored square.</p>
</list-item>
<list-item>
<p>&#x2022; Saliency scores of the same aircraft are drawn in the same row.</p>
</list-item>
<list-item>
<p>&#x2022; Normalized speed values are displayed alongside corresponding saliency scores.</p>
</list-item>
<list-item>
<p>&#x2022; Higher absolute values of saliency scores imply more important features.</p>
</list-item>
<list-item>
<p>&#x2022; The red and blue colors represent positive and negative influences respectively.</p>
</list-item>
</list>
</p>
</sec>
<sec id="s5-4-2">
<title>5.4.2 Position map</title>
<p>For online explanations, trajectory plots integrate the most influential features with the information on airspace and aircraft. Similarly, for offline explanations, we need a graphical interface showing airspace and aircraft to pair with the saliency map. We implement a position map for this purpose. While a trajectory plot shows one ownship and its intruders, a position map shows all aircraft in the current airspace to provide a more comprehensive view. Analysts may take different aircraft as the ownship based on their needs.</p>
<p>Position maps follow these rules:<list list-type="simple">
<list-item>
<p>&#x2022; Each aircraft is represented as a triangle.</p>
</list-item>
<list-item>
<p>&#x2022; Each aircraft is indexed by order of entering the airspace.</p>
</list-item>
<list-item>
<p>&#x2022; Green color represents the ownship focused by the current analysis, red color represents intruders, and blue color represents other aircraft in the airspace.</p>
</list-item>
</list>
</p>
</sec>
</sec>
<sec id="s5-5">
<title>5.5 Integration of modules</title>
<p>To provide explanations of agent behaviors for aircraft separation assurance, we integrate the distillation module, saliency module, and their visualization modules together. We illustrate the architecture of the integrated framework in <xref ref-type="fig" rid="F1">Figure 1</xref>.</p>
<p>At each time step, one forward pass for input state <italic>s</italic> is executed by the SDTs in distillation module. The decision path <italic>p</italic> is generated and transited to the visualization module. Based on the feature weights and decision path, visualization module draws tree plots and trajectory plots to provide online explanations of agent behaviors.</p>
<p>At the same time, LEG in the saliency module recovers the estimates of the gradient of the DRL models <italic>&#x3b3;</italic> by perturbing input state <italic>s</italic>. Based on <italic>&#x3b3;</italic>, the visualization module draws saliency maps and position maps to provide offline explanations of agent behaviors.</p>
</sec>
</sec>
<sec id="s6">
<title>6 Experiments</title>
<sec id="s6-1">
<title>6.1 Settings</title>
<p>In this work, SESAME framework distills knowledge for aircraft separation assurance from two DRL models: D2MAV-A (<xref ref-type="bibr" rid="B4">Brittain et al., 2021</xref>) and D2MAV-NC (<xref ref-type="bibr" rid="B2">Brittain and Wei, 2019</xref>). This further increases the difficulty because now SESAME has to generalize well with different DRL models. The number of intruders in D2MAV-A is fixed to make the performances of two models comparable. All SDTs in the SESAME framework are trained with the transitions from the same amount of episodes. No other validation phase is implemented.</p>
<p>We use BlueSky (<xref ref-type="bibr" rid="B16">Hoekstra and Ellerbroek, 2016</xref>) as the air traffic simulator. In BlueSky, We evaluate the SESAME framework on two challenging case studies, A and B, with multiple intersections and high-density air traffic. Each of them is a dynamic simulation environment where aircraft enter the airspace stochastically. <xref ref-type="fig" rid="F4">Figure 4</xref> shows these two case studies.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Case Studies for evaluation in the BlueSky air traffic simulator. Specifically, the green triangles, dotted lines, solid lines represent the aircraft, flight routes, and sector boundaries respectively. Symbol <italic>R</italic>
<sub>
<italic>i</italic>
</sub> and <italic>I</italic>
<sub>
<italic>j</italic>
</sub> stand for the <italic>i</italic>
<sub>
<italic>th</italic>
</sub> route and <italic>j</italic>
<sub>
<italic>th</italic>
</sub> intersection. The green numbers are the aircraft IDs. <bold>(A)</bold> Case study A, <bold>(B)</bold> Case study B.</p>
</caption>
<graphic xlink:href="fpace-01-1071793-g004.tif"/>
</fig>
</sec>
<sec id="s6-2">
<title>6.2 Fidelity of soft decision trees</title>
<p>Since the SDTs in SESAME framework are considered as surrogate models for the DRL models, one important property is whether the predictions of SDT models match those of the original DRL models. Specifically, the <italic>fidelity score</italic> is defined as:<disp-formula id="equ11">
<mml:math id="m19">
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mo movablelimits="false" form="prefix">&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mn mathvariant="double-struck">1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">SDT</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>Y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">DRL</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mfenced open="&#x2016;" close="&#x2016;">
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:math>
</disp-formula>
</p>
<p>Here <italic>S</italic> is the set of all states resulting from evaluated transitions. <italic>Y</italic>
<sub>
<italic>SDT</italic>
</sub> and <italic>Y</italic>
<sub>
<italic>DRL</italic>
</sub> are the output actions of SDTs and original DRL models. Therefore, a higher fidelity score means that the behaviors of SDT and DRL models match better. The training batch size is 1,280. And transitions from 100 episodes are generated for evaluation.</p>
<p>The fidelity scores of different models for case A and case B are reported in <xref ref-type="table" rid="T1">Table 1</xref> and <xref ref-type="table" rid="T2">Table 2</xref>. SDTs trained with D2MAV-A and D2MAV-NC are named as SDT-A and SDT-NC respectively. SDTs with a BN layer on the top of tree are named as SDT-BN. To give a comprehensive comparison, the traditional hard decision trees are also trained as baseline models and named as HDTs. The other baseline is random policy, whose fidelity score is 33.33%.</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Fidelity scores (%) in case A.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="center">Model</th>
<th colspan="6" align="center">Depth</th>
</tr>
<tr>
<th align="left">1</th>
<th align="left">2</th>
<th align="left">3</th>
<th align="left">4</th>
<th align="left">5</th>
<th align="left">6</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">SDT-A</td>
<td align="left">66.71</td>
<td align="left">73.74</td>
<td align="left">72.59</td>
<td align="left">71.22</td>
<td align="left">72.34</td>
<td align="left">72.00</td>
</tr>
<tr>
<td align="left">SDT-A-BN</td>
<td align="left">68.49</td>
<td align="left">76.34</td>
<td align="left">78.55</td>
<td align="left">78.35</td>
<td align="left">76.71</td>
<td align="left">78.17</td>
</tr>
<tr>
<td align="left">HDT-A</td>
<td align="left">69.34</td>
<td align="left">67.67</td>
<td align="left">74.33</td>
<td align="left">75.53</td>
<td align="left">73.05</td>
<td align="left">76.42</td>
</tr>
<tr>
<td align="left">SDT-NC</td>
<td align="left">67.77</td>
<td align="left">87.50</td>
<td align="left">89.47</td>
<td align="left">90.91</td>
<td align="left">90.96</td>
<td align="left">92.00</td>
</tr>
<tr>
<td align="left">SDT-NC-BN</td>
<td align="left">46.61</td>
<td align="left">87.76</td>
<td align="left">90.75</td>
<td align="left">94.15</td>
<td align="left">95.10</td>
<td align="left">95.80</td>
</tr>
<tr>
<td align="left">HDT-NC</td>
<td align="left">42.87</td>
<td align="left">72.64</td>
<td align="left">76.58</td>
<td align="left">82.55</td>
<td align="left">85.81</td>
<td align="left">85.83</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>Fidelity scores (%) in case B.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="center">Model</th>
<th colspan="6" align="center">Depth</th>
</tr>
<tr>
<th align="left">1</th>
<th align="left">2</th>
<th align="left">3</th>
<th align="left">4</th>
<th align="left">5</th>
<th align="left">6</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">SDT-A</td>
<td align="left">58.26</td>
<td align="left">71.71</td>
<td align="left">74.42</td>
<td align="left">76.73</td>
<td align="left">77.72</td>
<td align="left">78.86</td>
</tr>
<tr>
<td align="left">SDT-A-BN</td>
<td align="left">57.50</td>
<td align="left">71.64</td>
<td align="left">78.96</td>
<td align="left">82.63</td>
<td align="left">84.66</td>
<td align="left">88.18</td>
</tr>
<tr>
<td align="left">HDT-A</td>
<td align="left">44.69</td>
<td align="left">48.43</td>
<td align="left">59.30</td>
<td align="left">68.63</td>
<td align="left">71.15</td>
<td align="left">71.30</td>
</tr>
<tr>
<td align="left">SDT-NC</td>
<td align="left">88.92</td>
<td align="left">91.72</td>
<td align="left">92.99</td>
<td align="left">93.37</td>
<td align="left">93.53</td>
<td align="left">94.01</td>
</tr>
<tr>
<td align="left">SDT-NC-BN</td>
<td align="left">88.93</td>
<td align="left">91.72</td>
<td align="left">94.72</td>
<td align="left">96.07</td>
<td align="left">97.26</td>
<td align="left">97.67</td>
</tr>
<tr>
<td align="left">HDT-NC</td>
<td align="left">71.44</td>
<td align="left">88.96</td>
<td align="left">88.96</td>
<td align="left">90.70</td>
<td align="left">90.50</td>
<td align="left">91.66</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Comparing the fidelity scores from the same column, we see that the SDTs get higher fidelity scores in almost all cases than baseline models given the same model depth. This shows that our proposed SDT models work better than both hard decision tree models or random policy baselines in most cases except for some cases based on D2MAV-A and the outputs of proposed SDTs match those of DRL models. Based on results in the same row, we notice that a tree with more layers does not always perform better in terms of model fidelity given the same tree structure. The reason may be that the deeper models can have too many parameters to learn and the training process does not cover all of them perfectly. We also notice that our SESAME framework is not devoted to a specific DRL model because the SDTs gain high fidelity scores in both D2MAV-A and D2MAV-NC cases. We compare the performances of SDTs trained with the same DRL model and find that SDTs with a BN layer have higher scores. The results show that batch normalization helps improve the fidelity of SDTs. Fidelity has also been evaluated on other case studies in our previous paper (<xref ref-type="bibr" rid="B13">Guo and Wei (2022)</xref>. The previous results are consistent with the analysis in this paper.</p>
</sec>
<sec id="s6-3">
<title>6.3 Tree plots for explanation</title>
<p>In this subsection, we demonstrate how the tree plot can be used to explain the agent behaviors since the feature weights of non-leaf nodes along the decision path offer the explanations on how each feature influences the agent decisions. The tree plot based on a transition in case A with model SDT-NC-BN is drawn in <xref ref-type="fig" rid="F5">Figure 5</xref>.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>Tree plot using model SDT-NC-BN with depth 3 for case A in one step. The same state is used as input in all non-leaf nodes. The heatmaps are drawn based on the feature weights of all non-leaf nodes. Specifically, each feature is represented as a cell in the heatmap. The information of ownship <italic>O</italic> is in the first row. The information of five intruders <italic>I</italic>
<sub>1</sub>, &#x2026;, <italic>I</italic>
<sub>5</sub> is in the second to the sixth rows. Each column shows the values of the same feature for all six aircraft: distance to goal <italic>dg</italic>, current speed <italic>v</italic>, route identifier <italic>r</italic>, current acceleration <italic>ac</italic>, distance between ownship and the intruder <italic>da</italic>, distance between ownship and the intersection <italic>do</italic>, and distance between intruder and intersection <italic>di</italic>. The dense orange arrows represent the decision path. The leaf nodes show the prediction actions. <italic>Hold</italic> stands for maintaining the current speed. <italic>Acc</italic> stands for acceleration. <italic>Dec</italic> stands for deceleration.</p>
</caption>
<graphic xlink:href="fpace-01-1071793-g005.tif"/>
</fig>
<p>Starting from the root node, we find that the node concentrates on the feature at cell (1, 1), which is the distance from ownship to the destination. This makes sense because the ultimate goal of the aircraft is to reach the destination and it should focus on the goal. When this feature value becomes larger, the SDT tends to traverse to its left child node.</p>
<p>In the second layer, we find that the left node focuses on the distance from the third intruder to the destination. This shows that our SDT models have comprehensive understanding of the case and do not only concentrate on the closest intruder. At the same time, the right node focuses on the distance between intruders and the ownship. This shows that the child nodes can concentrate on different features given the output of the parent node.</p>
<p>Along the decision path, the second non-leaf node on the third layer focuses on the distance between the closest intruder and the ownship. The distance between intruders and ownship is of great importance because it is very likely to have a collision if the distance is small. Among all intruders, the closest intruder is the most urgent. Finally, the deceleration action is selected.</p>
</sec>
<sec id="s6-4">
<title>6.4 Trajectory plots for explanation</title>
<p>We demonstrate how precise behavior explanations can be provided by trajectory plots in this subsection. Here a trajectory plot is drawn based on the SDT-NC-BN model for case A in <xref ref-type="fig" rid="F6">Figure 6</xref>. <xref ref-type="fig" rid="F5">Figures 5</xref>, <xref ref-type="fig" rid="F6">6</xref> show the explanations for the same state.</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption>
<p>Trajectory plot using model SDT-NC-BN with depth 3 for case A. The light green ownship decelerates to avoid the potential collision at the near intersection. Three important features are emphasized in this case: 1) The distance from ownship <italic>O</italic> to goal, 2) the distance from the third intruder <italic>I</italic>
<sub>3</sub> to goal, and 3) the distance between ownship <italic>O</italic> and the closest intruder <italic>I</italic>
<sub>1</sub>. The distance information is highlighted as orange dense lines. The explanations and the advisory speed are also provided in text.</p>
</caption>
<graphic xlink:href="fpace-01-1071793-g006.tif"/>
</fig>
<p>The ownship near the intersection decides to decelerate in this case to avoid the potential collision at the near intersection. The important features are defined as the features having the highest absolute value in each node along the decision path. Here the important features are 1) distance from ownship to goal, 2) distance from the third intruder to goal, and 3) distance between ownship and the closest intruder. We have introduced the importance of these features in the previous subsection. Solid green lines are used to emphasize the distance information.</p>
<p>So compared with the tree plot, the trajectory plot only shows the most influential factors that explain the agent behaviors. By integrating the tree plot and trajectory plot, our proposed SESAME framework can provide precise explanation in trajectory plot and supplemental details in tree plot.</p>
</sec>
<sec id="s6-5">
<title>6.5 Saliency map and position map for explanation</title>
<p>While human operators need real-time and precise explanations, certification agencies are responsible for conducting more comprehensive analyses. In this subsection, we demonstrate how saliency maps and position maps can be used to provide offline explanations. The saliency map and position map based on a transition in case B with model D2MAV-A are drawn in <xref ref-type="fig" rid="F7">Figures 7</xref>, <xref ref-type="fig" rid="F8">8</xref> respectively. They show the explanations for the same state. In this subsection and the following subsection, <italic>Hold</italic> stands for maintaining the current speed, <italic>Acc</italic> stands for acceleration, and <italic>Dec</italic> stands for deceleration. The sample size used for computing LEG is 1,000, namely <italic>n</italic> &#x3d; 1,000 in <xref ref-type="disp-formula" rid="e1">Eq. (1)</xref>.</p>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption>
<p>Position map using model D2MAV-A for case B in one transition. There are 27 aircraft in the airspace, each with an index indicating the order of entering the airspace. In parentheses are the speed advisories provided by the D2MAV-A model: <italic>H</italic> stands for maintaining the current speed; <italic>A</italic> stands for acceleration; <italic>D</italic> stands for deceleration. We focus on aircraft 18.</p>
</caption>
<graphic xlink:href="fpace-01-1071793-g007.tif"/>
</fig>
<fig id="F8" position="float">
<label>FIGURE 8</label>
<caption>
<p>Saliency map using normalized LEG values for the decision of aircraft 18. The intruders are sorted by distance to the ownship in ascending order: 14, 22, 19, 24, 15, 27. Larger absolute values imply more important features, and the sign implies the direction of influence. In parentheses are min-max normalized speed values.</p>
</caption>
<graphic xlink:href="fpace-01-1071793-g008.tif"/>
</fig>
<p>First, we focus on the most important location (i.e., longitude) features<xref ref-type="fn" rid="fn3">
<sup>2</sup>
</xref>, namely the locations of aircraft 18 and aircraft 22. Aircraft 18s location has a negative influence on <italic>Acc</italic>. So if the longitude value of aircraft 18 decreases, which means that it moves closer to aircraft 22, the speed advisory for it will be more likely to be acceleration in order for maintaining a safe distance. On the contrary, aircraft 22s location has a positive influence on <italic>Acc</italic>. So if the longitude value of aircraft 22 increases, which also means that the distance between aircraft 22 and aircraft 18 becomes closer, the speed advisory for aircraft 18 will be more likely to be acceleration. Therefore, we can conclude from the above consistent observations that one major cause of the <italic>Hold</italic> decision is the safe distance between aircraft 18 and aircraft 22.</p>
<p>Second, we focus on aircraft 14, the closest intruder to aircraft 18. Despite being closest to the ownship, the absolute values of its saliency scores are relatively small. A possible reason is that the speed of ownship is lower than those of all its intruders. In this circumstance, aircraft behind the ownship (aircraft 22) should be given more attention than those ahead (aircraft 14). We also notice that the speed advisory for aircraft 14 is acceleration, which will further assure the safe separation. This may imply a certain cooperation scheme that the D2MAV-A model learns to make reasonable decisions.</p>
</sec>
<sec id="s6-6">
<title>6.6 Decision patterns of deep reinforcement learning models</title>
<p>Apart from the critical state features for the decision at each time step, analysts from certification agencies also need to determine whether or not the speed advisories follow any patterns given specific aircraft locations in the airspace. In this subsection, we study two kinds of aircraft&#x2019;s relative locations and discover some similar decisions and corresponding explanations.</p>
<p>First, we focus on three adjacent aircraft on the same route, and the ownship is in the middle. In the transition shown in <xref ref-type="fig" rid="F7">Figure 7</xref>, aircraft 18, aircraft 14 and aircraft 22 are adjacent and all located on route <italic>R</italic>
<sub>1</sub>. We still take aircraft 18 as the ownship. According to the previous discussion, if the distance between aircraft 18 and aircraft 22 decreases, the speed advisory for aircraft 18 will be more likely to be acceleration. On the contrary, we can conclude from <xref ref-type="fig" rid="F8">Figure 8</xref> that the speed advisory will be more likely to be deceleration if the distance between aircraft 18 and aircraft 14 decreases. Furthermore, the location of aircraft 14 has more influence on <italic>Dec</italic> than <italic>Acc</italic>, while the location of aircraft 22 has more influence on <italic>Acc</italic> than <italic>Dec</italic>. The difference shows that the D2MAV-A model is able to encode the order information for the purpose of attributing different importance to the intruders behind and in front of the ownship respectively.</p>
<p>The same conclusions can be drawn from aircraft 13 (ownship), aircraft 16 and aircraft 9. The saliency map shown in <xref ref-type="fig" rid="F9">Figure 9</xref> suggests that the speed advisory tends to acceleration when the ownship becomes closer to the intruder behind it (aircraft 16) and tends to deceleration when the ownship becomes closer to the intruder in front of it (aircraft 9). Additionally, the location of aircraft 16 has substantially more influence on <italic>Acc</italic> than <italic>Dec</italic>, while the location of aircraft 9 has substantially more influence on <italic>Dec</italic> than <italic>Acc</italic>. These consistent results provide insights into the decisions of aircraft that are on the same route.</p>
<fig id="F9" position="float">
<label>FIGURE 9</label>
<caption>
<p>Saliency map using normalized LEG values for the decision of aircraft 13. The intruders are sorted by distance to the ownship in ascending order: 16, 9. Larger absolute values imply more important features, and the sign implies the direction of influence. In parentheses are min-max normalized speed values.</p>
</caption>
<graphic xlink:href="fpace-01-1071793-g009.tif"/>
</fig>
<p>Second, we focus on two aircraft that are on different routes and about to pass the same intersection. One of them is closer to the intersection than the other. Aircraft (22, 21), Aircraft (3, 17), Aircraft (4, 12), and Aircraft (15, 14) are four examples. In the first three pairs, aircraft closer to the intersection (22, 3, 4) decide to accelerate, while others decide to maintain the current speed or decelerate. These decisions show the cooperation between two aircraft that the closer one passes the intersection first. However, aircraft 15 and aircraft 14 both select the acceleration action. The reason why aircraft 14 does not choose to maintain the current speed or decelerate is that its surrounding air traffic situation is more complex. <xref ref-type="fig" rid="F10">Figure 10</xref> shows that besides aircraft 15, aircraft 18 and aircraft 19 are also highly important to aircraft 14s decision. Therefore, aircraft 14 chooses to accelerate in order to keep ahead of aircraft 18 and aircraft 19, which again suggests the ability of the D2MAV-A model to encode order information.</p>
<fig id="F10" position="float">
<label>FIGURE 10</label>
<caption>
<p>Saliency map using normalized LEG values for the decision of aircraft 14. The intruders are sorted by distance to the ownship in ascending order: 15, 18, 19, 10, 24, 22. Larger absolute values imply more important features, and the sign implies the direction of influence. In parentheses are min-max normalized speed values.</p>
</caption>
<graphic xlink:href="fpace-01-1071793-g010.tif"/>
</fig>
</sec>
</sec>
<sec sec-type="conclusion" id="s7">
<title>7 Conclusion</title>
<p>In this paper, we propose a novel framework to explain DRL-based aircraft separation assurance models. Our framework provides both online explanations to human operators and offline explanations to certification agencies. Through numerical experiments in the BlueSky air traffic simulator, our results show that the proposed framework is capture crucial factors in the decisions of DRL models. In addition, we explain two specific patterns that DRL policies follow using the proposed framework, which suggests that the decision-making processes of DRL models can be more predictable and trustworthy. The promising results encourage us to further explore the effectiveness of the framework and its extensions for other safety-critical applications in the future.</p>
</sec>
</body>
<back>
<sec sec-type="data-availability" id="s8">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.</p>
</sec>
<sec id="s9">
<title>Author contributions</title>
<p>WG and YZ implemented the algorithms and conducted experiments. The writing of the manuscript was done by WG and YZ. PW led the project and helped revise manuscript.</p>
</sec>
<sec id="s10">
<title>Funding</title>
<p>This project is supported by the NASA Grant 80NSSC21M0087 under the NASA System-Wide Safety (SWS) program.</p>
</sec>
<ack>
<p>This work is an extension of a previous work published in WG and PW, Explainable Deep Reinforcement Learning for Aircraft Separation Assurance. 2022 IEEE/AIAA 41st Digital Avionics Systems Conference (DASC).</p>
</ack>
<sec sec-type="COI-statement" id="s11">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s12">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<fn-group>
<fn id="fn2">
<label>1</label>
<p>The location of an aircraft can also be represented by its latitude</p>
</fn>
<fn id="fn3">
<label>2</label>
<p>When we analyze the change of one feature, other features are assumed to be fixed.</p>
</fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Ancona</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Ceolini</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>&#xd6;ztireli</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Gross</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Towards better understanding of gradient-based attribution methods for deep neural networks</article-title>,&#x201d; in <conf-name>International Conference on Learning Representations</conf-name>, <conf-loc>Vancouver, BC, Canada</conf-loc>, <conf-date>April 30&#x2013;May 3, 2018</conf-date>.</citation>
</ref>
<ref id="B2">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Brittain</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Autonomous separation assurance in an high-density en route sector: A deep multi-agent reinforcement learning approach</article-title>,&#x201d; in <conf-name>2019 IEEE Intelligent Transportation Systems Conference (ITSC)</conf-name>, <conf-loc>Auckland, New Zealand</conf-loc>, <conf-date>October 27&#x2013;30, 2019</conf-date> (<publisher-name>IEEE</publisher-name>), <fpage>3256</fpage>&#x2013;<lpage>3262</lpage>.</citation>
</ref>
<ref id="B3">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Brittain</surname>
<given-names>M. W.</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>One to any: Distributed conflict resolution with deep multi-agent reinforcement learning and long short-term memory</article-title>,&#x201d; in <conf-name>AIAA Scitech 2021 Forum. 1952</conf-name>, <conf-date>January 11&#x2013;15 &#x26; 19&#x2013;21, 2021</conf-date>. <comment>Virtual event</comment>.</citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brittain</surname>
<given-names>M. W.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Autonomous separation assurance with deep multi-agent reinforcement learning</article-title>. <source>J. Aerosp. Inf. Syst.</source> <volume>18</volume>, <fpage>890</fpage>&#x2013;<lpage>905</lpage>. <pub-id pub-id-type="doi">10.2514/1.i010973</pub-id>
</citation>
</ref>
<ref id="B5">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Cideron</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Seurin</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Strub</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Pietquin</surname>
<given-names>O.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Self-educated language agent with hindsight experience replay for instruction following</article-title>,&#x201d; in <conf-name>Visually Grounded Interaction and Language (ViGIL), NeurIPS 2019 Workshop</conf-name>, <conf-loc>Vancouver, Canada</conf-loc>, <conf-date>December 13, 2019</conf-date>.</citation>
</ref>
<ref id="B6">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Coppens</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Efthymiadis</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Lenaerts</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Now&#xe9;</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Weber</surname>
<given-names>R.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). &#x201c;<article-title>Distilling deep reinforcement learning policies in soft decision trees</article-title>,&#x201d; in <conf-name>Proceedings of the IJCAI 2019 workshop on explainable artificial intelligence</conf-name>, <conf-loc>Macao, China</conf-loc>, <conf-date>August 10&#x2013;16, 2019</conf-date>, <fpage>1</fpage>&#x2013;<lpage>6</lpage>.</citation>
</ref>
<ref id="B7">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Dahlin</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Kalagarla</surname>
<given-names>K. C.</given-names>
</name>
<name>
<surname>Naik</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Jain</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Nuzzo</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2020</year>). <source>Designing interpretable approximations to deep reinforcement learning with soft decision trees</source>. <comment>arXiv preprint arXiv:2010.14785</comment>.</citation>
</ref>
<ref id="B8">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ding</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Hernandez-Leal</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Ding</surname>
<given-names>G. W.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2020</year>). <source>Cdt: Cascading decision trees for explainable reinforcement learning</source>. <comment>arXiv preprint arXiv:2011.07553</comment>.</citation>
</ref>
<ref id="B9">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Frosst</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Hinton</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2017</year>). <source>Distilling a neural network into a soft decision tree</source>. <comment>arXiv preprint arXiv:1711.09784</comment>.</citation>
</ref>
<ref id="B10">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Ghosh</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Laguna</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Lim</surname>
<given-names>S. H.</given-names>
</name>
<name>
<surname>Wynter</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Poonawala</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>A deep ensemble method for multi-agent reinforcement learning: A case study on air traffic control</article-title>,&#x201d;. <comment>(virtual)</comment> in <conf-name>Proceedings of the Thirty-First International Conference on Automated Planning and Scheduling</conf-name>, <conf-loc>Guangzhou, China</conf-loc>, <conf-date>August 2-13, 2021</conf-date> (<publisher-name>AAAI Press</publisher-name>), <fpage>468</fpage>&#x2013;<lpage>476</lpage>. <comment>ICAPS 2021</comment>.</citation>
</ref>
<ref id="B11">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Greydanus</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Koul</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Dodge</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Fern</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Visualizing and understanding atari agents</article-title>,&#x201d; in <conf-name>International Conference on Machine Learning</conf-name>, <conf-loc>Stockholm, Sweden</conf-loc>, <conf-date>July 10&#x2013;15, 2018</conf-date> (<publisher-name>PMLR</publisher-name>), <fpage>1792</fpage>&#x2013;<lpage>1801</lpage>.</citation>
</ref>
<ref id="B12">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Guo</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Brittain</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Safety enhancement for deep reinforcement learning in autonomous separation assurance</article-title>,&#x201d; in <conf-name>2021 IEEE International Intelligent Transportation Systems Conference (ITSC)</conf-name>, <conf-loc>Indianapolis, IN, United States</conf-loc>, <conf-date>September 19&#x2013;22, 2021</conf-date>, <fpage>348</fpage>&#x2013;<lpage>354</lpage>. <pub-id pub-id-type="doi">10.1109/ITSC48978.2021.9564466</pub-id>
</citation>
</ref>
<ref id="B13">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Guo</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2022</year>). &#x201c;<article-title>Explainable deep reinforcement learning for aircraft separation assurance</article-title>,&#x201d; in <conf-name>2022 IEEE/AIAA 41st Digital Avionics Systems Conference (DASC)</conf-name>, <conf-loc>Portsmouth, VA, United States</conf-loc>, <conf-date>September 18&#x2013;22, 2022</conf-date> (<publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x2013;<lpage>10</lpage>.</citation>
</ref>
<ref id="B14">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Gupta</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Puri</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Verma</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Kayastha</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Deshmukh</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Krishnamurthy</surname>
<given-names>B.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). &#x201c;<article-title>Explain your move: understanding agent actions using specific and relevant feature attribution</article-title>,&#x201d; in <conf-name>International Conference on Learning Representations (ICLR)</conf-name>, <conf-loc>Addis Ababa, Ethiopia</conf-loc>, <conf-date>April 26&#x2013;30, 2020</conf-date>.</citation>
</ref>
<ref id="B15">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Haksar</surname>
<given-names>R. N.</given-names>
</name>
<name>
<surname>Schwager</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Distributed deep reinforcement learning for fighting forest fires with a network of aerial robots</article-title>,&#x201d; in <conf-name>2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</conf-name>, <conf-loc>Madrid, Spain</conf-loc>, <conf-date>October 1&#x2013;5, 2018</conf-date> (<publisher-name>IEEE</publisher-name>), <fpage>1067</fpage>&#x2013;<lpage>1074</lpage>.</citation>
</ref>
<ref id="B16">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Hoekstra</surname>
<given-names>J. M.</given-names>
</name>
<name>
<surname>Ellerbroek</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Bluesky ATC simulator project: an open data and open source approach</article-title>,&#x201d; in <conf-name>Proceedings of the 7th International Conference on Research in Air Transportation (ICRAT)</conf-name>, <conf-loc>Philadelphia, PA</conf-loc>, <conf-date>June 20&#x2013;24, 2016</conf-date> (<publisher-name>FAA/Eurocontrol USA/Europe</publisher-name>), <fpage>132</fpage>.</citation>
</ref>
<ref id="B17">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Isufaj</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Aranega Sebastia</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Piera</surname>
<given-names>M. A.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Towards conflict resolution with deep multi-agent reinforcement learning</article-title>,&#x201d; in <conf-name>Proceedings of the 14th USA/Europe Air Traffic Management Research and Development Seminar (ATM2021)</conf-name>, <conf-loc>New Orleans, LA, USA</conf-loc>, <conf-date>September 20&#x2013;23, 2021</conf-date>, <fpage>20</fpage>&#x2013;<lpage>24</lpage>.</citation>
</ref>
<ref id="B18">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Iyer</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Lewis</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Sundar</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Sycara</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Transparency and explanation in deep reinforcement learning neural networks</article-title>,&#x201d; in <conf-name>Proceedings of the 2018 AAAI/ACM Conference on AI</conf-name>, <conf-loc>New Orleans, LA, United States</conf-loc>, <conf-date>February 02&#x2013;03, 2018</conf-date>, <fpage>144</fpage>&#x2013;<lpage>150</lpage>. <comment>Ethics, and Society</comment>.</citation>
</ref>
<ref id="B19">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Jarrett</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>H&#xfc;y&#xfc;k</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Van Der Schaar</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Inverse decision modeling: Learning interpretable representations of behavior</article-title>,&#x201d; in <conf-name>International Conference on Machine Learning</conf-name> (<publisher-name>PMLR</publisher-name>), <fpage>4755</fpage>&#x2013;<lpage>4771</lpage>.</citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jonschkowski</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Brock</surname>
<given-names>O.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Learning state representations with robotic priors</article-title>. <source>Auton. Robots</source> <volume>39</volume>, <fpage>407</fpage>&#x2013;<lpage>428</lpage>. <pub-id pub-id-type="doi">10.1007/s10514-015-9459-7</pub-id>
</citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Karakovskiy</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Togelius</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>The mario ai benchmark and competitions</article-title>. <source>IEEE Trans. Comput. Intell. AI Games</source> <volume>4</volume>, <fpage>55</fpage>&#x2013;<lpage>67</lpage>. <pub-id pub-id-type="doi">10.1109/tciaig.2012.2188528</pub-id>
</citation>
</ref>
<ref id="B22">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Kim</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Moon</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Rohrbach</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Darrell</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Canny</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Advisable learning for self-driving vehicles by internalizing observation-to-action rules</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>, <conf-loc>Seattle, WA, United States</conf-loc>, <conf-date>June 13&#x2013;19, 2020</conf-date>, <fpage>9661</fpage>&#x2013;<lpage>9670</lpage>.</citation>
</ref>
<ref id="B23">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Schulte</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Q.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Toward interpretable deep reinforcement learning with linear model u-trees</article-title>,&#x201d; in <conf-name>Joint European Conference on Machine Learning and Knowledge Discovery in Databases</conf-name>, <conf-loc>Dublin, Ireland</conf-loc>, <conf-date>September 10&#x2013;14, 2018</conf-date> (<publisher-name>Springer</publisher-name>), <fpage>414</fpage>&#x2013;<lpage>429</lpage>.</citation>
</ref>
<ref id="B24">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Luo</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Barut</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Jin</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Statistically consistent saliency estimation</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)</conf-name>, <conf-loc>Montreal, QC, Canada</conf-loc>, <conf-date>October 10&#x2013;17, 2021</conf-date>, <fpage>745</fpage>&#x2013;<lpage>753</lpage>.</citation>
</ref>
<ref id="B25">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Mnih</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Kavukcuoglu</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Silver</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Graves</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Antonoglou</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Wierstra</surname>
<given-names>D.</given-names>
</name>
<etal/>
</person-group> (<year>2013</year>). &#x201c;<article-title>Playing atari with deep reinforcement learning</article-title>,&#x201d; in <conf-name>NIPS Deep Learning Workshop</conf-name>.</citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Quinlan</surname>
<given-names>J. R.</given-names>
</name>
</person-group> (<year>1986</year>). <article-title>Induction of decision trees</article-title>. <source>Mach. Learn.</source> <volume>1</volume>, <fpage>81</fpage>&#x2013;<lpage>106</lpage>. <pub-id pub-id-type="doi">10.1007/bf00116251</pub-id>
</citation>
</ref>
<ref id="B27">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Ribeiro</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Ellerbroek</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Hoekstra</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Determining optimal conflict avoidance manoeuvres at high densities with reinforcement learning</article-title>,&#x201d; in <conf-name>Proceedings of the Tenth SESAR Innovation Days, Virtual Conference</conf-name>, <conf-date>December 7&#x2013;10, 2020</conf-date>, <fpage>7</fpage>&#x2013;<lpage>10</lpage>. <comment>Virtual Conference</comment>.</citation>
</ref>
<ref id="B28">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Schulman</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wolski</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Dhariwal</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Radford</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Klimov</surname>
<given-names>O.</given-names>
</name>
</person-group> (<year>2017</year>). <source>Proximal policy optimization algorithms</source>. <comment>arXiv preprint arXiv:1707.06347</comment>.</citation>
</ref>
<ref id="B29">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Selvaraju</surname>
<given-names>R. R.</given-names>
</name>
<name>
<surname>Cogswell</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Das</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Vedantam</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Parikh</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Batra</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Grad-cam: Visual explanations from deep networks via gradient-based localization</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE international conference on computer vision</conf-name>, <conf-loc>Venice, Italy</conf-loc>, <conf-date>October 22&#x2013;29, 2017</conf-date>, <fpage>618</fpage>&#x2013;<lpage>626</lpage>.</citation>
</ref>
<ref id="B30">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Silva</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Gombolay</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Killian</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Jimenez</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Son</surname>
<given-names>S.-H.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Optimization methods for interpretable differentiable decision trees applied to reinforcement learning</article-title>,&#x201d; in <conf-name>International conference on artificial intelligence and statistics</conf-name>, <conf-loc>Palermo, Sicily, Italy (Online)</conf-loc>, <conf-date>August 26&#x2013;28, 2020</conf-date> (<publisher-name>PMLR</publisher-name>), <fpage>1855</fpage>&#x2013;<lpage>1865</lpage>.</citation>
</ref>
<ref id="B31">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Simonyan</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Vedaldi</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Zisserman</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2013</year>). <source>Deep inside convolutional networks: Visualising image classification models and saliency maps</source>. <comment>arXiv preprint arXiv:1312.6034</comment>.</citation>
</ref>
<ref id="B32">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Smilkov</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Thorat</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Vi&#xe9;gas</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Wattenberg</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2017</year>). <source>Smoothgrad: removing noise by adding noise</source>. <comment>arXiv preprint arXiv:1706.03825</comment>.</citation>
</ref>
<ref id="B33">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Sundararajan</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Taly</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Yan</surname>
<given-names>Q.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Axiomatic attribution for deep networks</article-title>,&#x201d; in <conf-name>International conference on machine learning</conf-name>, <conf-loc>Sydney, NSW, Australia</conf-loc>, <conf-date>August 6&#x2013;11, 2017</conf-date> (<publisher-name>PMLR</publisher-name>), <fpage>3319</fpage>&#x2013;<lpage>3328</lpage>.</citation>
</ref>
<ref id="B34">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Van Hasselt</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Guez</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Silver</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Deep reinforcement learning with double q-learning</article-title>,&#x201d; in <conf-name>Proceedings of the AAAI conference on artificial intelligence</conf-name>, <conf-loc>Phoenix, AZ, United States</conf-loc>, <conf-date>February 12&#x2013;17, 2016</conf-date>.</citation>
</ref>
<ref id="B35">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Verma</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Murali</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Singh</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Kohli</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Chaudhuri</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Programmatically interpretable reinforcement learning</article-title>,&#x201d; in <conf-name>International Conference on Machine Learning</conf-name>, <conf-loc>Stockholm, Sweden</conf-loc>, <conf-date>July 10&#x2013;15, 2018</conf-date> (<publisher-name>PMLR</publisher-name>), <fpage>5045</fpage>&#x2013;<lpage>5054</lpage>.</citation>
</ref>
<ref id="B36">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Deep reinforcement learning based conflict detection and resolution in air traffic control</article-title>. <source>IET Intell. Transp. Syst.</source> <volume>13</volume>, <fpage>1041</fpage>&#x2013;<lpage>1047</lpage>. <pub-id pub-id-type="doi">10.1049/iet-its.2018.5357</pub-id>
</citation>
</ref>
<ref id="B37">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Schaul</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Hessel</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Hasselt</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Lanctot</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Freitas</surname>
<given-names>N.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Dueling network architectures for deep reinforcement learning</article-title>,&#x201d; in <conf-name>International conference on machine learning</conf-name> (<publisher-name>PMLR</publisher-name>), <fpage>1995</fpage>&#x2013;<lpage>2003</lpage>.</citation>
</ref>
<ref id="B38">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Wulfe</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2017</year>). <source>Uav collision avoidance policy optimization with deep reinforcement learning</source>. <comment>arXiv preprint paper [Dataset]</comment>.</citation>
</ref>
<ref id="B39">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Zahavy</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Ben-Zrihem</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Mannor</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Graying the black box: Understanding dqns</article-title>,&#x201d; in <conf-name>International conference on machine learning</conf-name>, <conf-loc>New York City, NY, United States</conf-loc>, <conf-date>June 19&#x2013;24, 2016</conf-date> (<publisher-name>PMLR</publisher-name>), <fpage>1899</fpage>&#x2013;<lpage>1908</lpage>.</citation>
</ref>
<ref id="B40">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Zeiler</surname>
<given-names>M. D.</given-names>
</name>
<name>
<surname>Fergus</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Visualizing and understanding convolutional networks</article-title>,&#x201d; in <conf-name>European conference on computer vision</conf-name>, <conf-loc>Zurich, Switzerland</conf-loc>, <conf-date>September 6&#x2013;12, 2014</conf-date> (<publisher-name>Springer</publisher-name>), <fpage>818</fpage>&#x2013;<lpage>833</lpage>.</citation>
</ref>
</ref-list>
</back>
</article>