<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neurorobot.</journal-id>
<journal-title>Frontiers in Neurorobotics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neurorobot.</abbrev-journal-title>
<issn pub-type="epub">1662-5218</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnbot.2022.1072887</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Intelligent air defense task assignment based on hierarchical reinforcement learning</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Liu</surname> <given-names>Jia-yi</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/2109668/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Wang</surname> <given-names>Gang</given-names></name>
</contrib>
<contrib contrib-type="author">
<name><surname>Guo</surname> <given-names>Xiang-ke</given-names></name>
</contrib>
<contrib contrib-type="author">
<name><surname>Wang</surname> <given-names>Si-yuan</given-names></name>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Fu</surname> <given-names>Qiang</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2057869/overview"/>
</contrib>
</contrib-group>
<aff><institution>Air and Missile Defense College, Air Force Engineering University</institution>, <addr-line>Xi&#x2019;an</addr-line>, <country>China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Hong Qiao, University of Chinese Academy of Sciences, China</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Chunlin Chen, Nanjing University, China; Gaganpreet Singh, Institut Sup&#x00E9;rieur de l&#x2019;A&#x00E9;ronautique et de l&#x2019;Espace (ISAE-SUPAERO), France</p></fn>
<corresp id="c001">&#x002A;Correspondence: Qiang Fu, <email>fuqiang_66688@163.com</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>01</day>
<month>12</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>16</volume>
<elocation-id>1072887</elocation-id>
<history>
<date date-type="received">
<day>18</day>
<month>10</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>17</day>
<month>11</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2022 Liu, Wang, Guo, Wang and Fu.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Liu, Wang, Guo, Wang and Fu</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>Modern air defense battlefield situations are complex and varied, requiring high-speed computing capabilities and real-time situational processing for task assignment. Current methods struggle to balance the quality and speed of assignment strategies. This paper proposes a hierarchical reinforcement learning architecture for ground-to-air confrontation (HRL-GC) and an algorithm combining model predictive control with proximal policy optimization (MPC-PPO), which effectively combines the advantages of centralized and distributed approaches. To improve training efficiency while ensuring the quality of the final decision. In a large-scale area air defense scenario, this paper validates the effectiveness and superiority of the HRL-GC architecture and MPC-PPO algorithm, proving that the method can meet the needs of large-scale air defense task assignment in terms of quality and speed.</p>
</abstract>
<kwd-group>
<kwd>air defense task assignment</kwd>
<kwd>hierarchical reinforcement learning</kwd>
<kwd>model predictive control</kwd>
<kwd>proximal policy optimization</kwd>
<kwd>agent</kwd>
</kwd-group>
<counts>
<fig-count count="8"/>
<table-count count="0"/>
<equation-count count="16"/>
<ref-count count="32"/>
<page-count count="14"/>
<word-count count="7753"/>
</counts>
</article-meta>
</front>
<body>
<sec id="S1" sec-type="intro">
<title>Introduction</title>
<p>Modern air defense operations are becoming more complex with the rapid development of long-range, elemental, and intelligent processes. The rational planning of interception plans for incoming air targets to maximize operational effectiveness has become a massive challenge for the defenders in modern air defense operations (<xref ref-type="bibr" rid="B25">Yang et al., 2019</xref>). Task assignment changes the weapon target assignment (WTA) fire unit-target model to a task-target assignment model. This improves the ability to coordinate the various components, and the assignment scheme is more flexible, providing fundamental assurance of maximum operational effectiveness (<xref ref-type="bibr" rid="B21">Wang et al., 2019</xref>). With the continuous adoption of new technologies on both sides of the battlefield, the combat process is becoming increasingly complex, involving many elements. The battlefield environment and the adversary&#x2019;s strategy are rapidly changing and challenging to quantify. Relying on human judgment and decision-making can no longer adapt to the requirements of fast-paced and high-intensity confrontation, and depending on traditional analytical model processing cannot adapt to the needs of complex and changing scenarios. Reinforcement learning (RL) does not require an accurate mathematical model of the environment and the task and is less dependent on external guidance information. Therefore, some scholars have investigated the task assignment problem through intelligent methods such as single-agent reinforcement learning, multi-agent reinforcement learning (MARL), and deep reinforcement learning (DRL). <xref ref-type="bibr" rid="B28">Zhang et al. (2020)</xref> proposed an Imitation augmented deep reinforcement learning (IADRL) model to enable unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs) to form a complementary and cooperative alliance to accomplish tasks that they cannot do alone. <xref ref-type="bibr" rid="B24">Wu et al. (2022)</xref> proposed a dynamic multi-UAV task assignment algorithm based on reinforcement learning and a deep neural network, which effectively solves the problem of poor mission execution quality in complex dynamic environments. <xref ref-type="bibr" rid="B32">Zhao et al. (2019)</xref> proposed a Q-learning-based fast task assignment (FTA) algorithm for solving the task assignment problem of heterogeneous UAVs.</p>
<p>In modern air defense operations, the threat to the defense can be either a large-scale air attack or a small-scale contingency, so mission assignment methods must balance effectiveness and dynamism. A centralized assignment solution is not fast enough, while a fully distributed assignment method does not respond effectively to unexpected events (<xref ref-type="bibr" rid="B9">Lee et al., 2012</xref>). The one-general agent with multiple narrow agents (OGMN) architecture proposed in the literature (<xref ref-type="bibr" rid="B11">Liu J. Y. et al., 2022</xref>), which divides agents into general and narrow agents, improves the computational speed and coordination ability. However, the narrow agent in the OGMN is entirely rule-driven. It lacks a certain degree of autonomy, which cannot fully adapt to the complex and changing battlefield environment. Therefore, this paper proposes the hierarchical reinforcement learning architecture for ground-to-air confrontation (HRL-GC) architecture based on the OGMN architecture, which layers the agents into scheduling and execution. The scheduling agent is responsible for assigning targets to the execution agent, which makes the final decision based on its state. Data drive both types of agents. Considering the inefficiency of the initial phase of agents training, this paper proposes a model-based model predictive control with proximal policy optimization (MPC-PPO) algorithm to train the execution agent to reduce inefficient exploration. Finally, the HRL-GC is compared with two other architectures in a large-scale air defense scenario, and the effectiveness of the MPC-PPO algorithm is verified. Experimental results show that the HRL-GC architecture and MPC-PPO algorithm are suitable for large-scale air defense problems, effectively balances the effectiveness and dynamics of task assignment.</p>
</sec>
<sec id="S2">
<title>Related work</title>
<sec id="S2.SS1">
<title>Deep reinforcement learning</title>
<p>Reinforcement learning was first introduced in the 1950s (<xref ref-type="bibr" rid="B13">Minsky, 1954</xref>) with the central idea of allowing an agent to learn in its environment and continuously refine its behavioral strategies through constant interaction with the environment and exploration by trial and error (<xref ref-type="bibr" rid="B14">Moos et al., 2022</xref>). With the continuous development of RL, algorithms such as Q-learning (<xref ref-type="bibr" rid="B22">Watkins and Dayan, 1992</xref>) and SARSA (<xref ref-type="bibr" rid="B5">Chen et al., 2008</xref>) have been proposed. However, when faced with problems in large-scale, high-dimensional decision-making environments, traditional RL methods also rapidly increase the computation, and storage space required to solve such problems.</p>
<p>Deep reinforcement learning is a combination of RL and deep learning (DL). DL enables reinforcement learning to be extended to previously intractable decision problems and has led to significant results in areas such as drone surveys (<xref ref-type="bibr" rid="B29">Zhang et al., 2022</xref>), recommender search systems (<xref ref-type="bibr" rid="B18">Shen et al., 2021</xref>), and natural language processing (<xref ref-type="bibr" rid="B10">Li et al., 2022</xref>), particularly in the area of continuous end-to-end control (<xref ref-type="bibr" rid="B30">Zhao J. et al., 2021</xref>). In the problem studied in this paper, the decisions shaped by the DRL for the agents must be temporally correlated, thus enabling the air defense task assignment strategy to maximize future gains and take the lead on the battlefield more easily.</p>
</sec>
<sec id="S2.SS2">
<title>Hierarchical reinforcement learning</title>
<p>Hierarchical reinforcement learning (HRL) was proposed to solve the curse of dimensionality in reinforcement learning. The idea of this method is to decompose a whole task into multi-level subtasks by introducing mechanisms such as State space decomposition (<xref ref-type="bibr" rid="B20">Takahashi, 2001</xref>), State abstraction (<xref ref-type="bibr" rid="B1">Abel, 2019</xref>), and Temporal abstraction (<xref ref-type="bibr" rid="B4">Bacon and Precup, 2018</xref>) so that each subtask can be solved in a small-scale state space, thus speeding up the solution of the whole task. To model these abstract mechanisms, researchers introduced the semi-Markov Decision Process (SMDP) (<xref ref-type="bibr" rid="B3">Ascione and Cuomo, 2022</xref>) model to handle actions that must be completed at multiple time steps. The state space decomposition approach decomposes the state space into different subsets. It adopts a divide-and-conquer strategy for solving so that each solution is performed in a smaller subspace. Based on this idea, this paper divides the task assignment problem into two levels, scheduling and execution, and proposes the HRL-GC architecture to combine the advantages of centralized and distributed assignment effectively.</p>
</sec>
<sec id="S2.SS3">
<title>Model-based reinforcement learning</title>
<p>Model-free RL does not require environmental models (e.g., state transfer probability models and reward function models) but is trained directly to obtain high-performance policies (<xref ref-type="bibr" rid="B2">Abouheaf et al., 2015</xref>). On the other hand, model-based RL is an approach that first learns the model during the learning process and then searches for an optimized policy based on that model knowledge (<xref ref-type="bibr" rid="B31">Zhao T. et al., 2021</xref>). Model-free RL is less computationally intensive at each iteration because it does not require learning model knowledge but has the disadvantage that too much invalid exploration leads to inefficient agents&#x2019; learning. Model-based RL methods can use a minimal number of samples to learn complex gaits, using the data collected to understand the model. The model is then used to generate a large amount of simulation data to learn a &#x201C;state-action&#x201D; value function to reduce the interaction between the system and the environment and improve sampling efficiency (<xref ref-type="bibr" rid="B8">Gu et al., 2016</xref>). For air defense scenarios, the sampling cost is high, and it isn&#x2019;t easy to collect many data samples. Therefore, this paper uses a model-based RL approach to build a neural network model based on a small amount of sample data collected. The agent interacts with the model to obtain the data, thus reducing the sampling cost and improving the training efficiency.</p>
</sec>
<sec id="S2.SS4">
<title>Model predictive control</title>
<p>Model predictive control (MPC) is a branch of optimal control (<xref ref-type="bibr" rid="B12">Liu S. et al., 2022</xref>), and the idea of MPC is widely used in model-based RL algorithms due to its efficiency in unconstrained planning problems. It is based on the specific idea of using the collected data to train a model and obtain an optimal sequence of actions by solving an unconstrained optimization problem (<xref ref-type="bibr" rid="B26">Yang and Lucia, 2021</xref>), as shown in Eq. 1.</p>
<disp-formula id="S2.E1">
<label>(1)</label>
<mml:math id="M1">
<mml:mtable displaystyle="true" rowspacing="0pt">
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mrow>
<mml:msubsup>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mo>&#x002A;</mml:mo>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>a</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>&#x002A;</mml:mo>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">&#x2026;</mml:mi>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>a</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>H</mml:mi>
</mml:mrow>
<mml:mo>&#x002A;</mml:mo>
</mml:msubsup>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:munder>
<mml:mrow>
<mml:mi>arg</mml:mi>
<mml:mo movablelimits="false">&#x2061;</mml:mo>
<mml:mi>max</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">&#x2026;</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:munder>
<mml:mrow>
<mml:mstyle displaystyle="false">
<mml:msubsup>
<mml:mo largeop="true" symmetric="true">&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mi>H</mml:mi>
</mml:msubsup>
</mml:mstyle>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mpadded lspace="77.8pt" width="+77.8pt">
<mml:mi mathvariant="normal">s</mml:mi>
</mml:mpadded>
<mml:mo>.</mml:mo>
<mml:mi mathvariant="normal">t</mml:mi>
<mml:mo rspace="5.3pt">.</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>f</mml:mi>
<mml:mo stretchy="false">^</mml:mo>
</mml:mover>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo rspace="0.8pt">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">&#x2026;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
<p>Where <inline-formula><mml:math id="INEQ1"><mml:mrow><mml:mover accent="true"><mml:mi>f</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x2219;</mml:mo><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> is the learned model, the model is often a parametric neural network whose input is the current moment action <italic>a</italic><sub><italic>t</italic></sub>, and the present moment state <italic>s</italic><sub><italic>t</italic></sub> outputs the predicted state <inline-formula><mml:math id="INEQ2"><mml:msub><mml:mover accent="true"><mml:mi>s</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> for the next moment; the loss function of the neural network can be constructed as (<xref ref-type="bibr" rid="B27">Yaqi, 2021</xref>)</p>
<disp-formula id="S2.E2">
<label>(2)</label>
<mml:math id="M2">
<mml:mrow>
<mml:mrow>
<mml:mi>&#x03B5;</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>&#x03B8;</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi class="ltx_font_mathcaligraphic">&#x1D49F;</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:munder>
<mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2208;</mml:mo>
<mml:mi class="ltx_font_mathcaligraphic">&#x1D49F;</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mrow>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mn>2</mml:mn>
</mml:mfrac>
<mml:mo>&#x2062;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo fence="true">||</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>-</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>-</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>f</mml:mi>
<mml:mo stretchy="false">^</mml:mo>
</mml:mover>
<mml:mi>&#x03B8;</mml:mi>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo fence="true">||</mml:mo>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Where &#x1D49F; is the collected demonstration dataset, it is obtained by first generating random strategies to interact with the model, calculating the reward value for each policy, and selecting the sequence of actions with the highest cumulative reward. The first action of this sequence is then acted upon in the environment to obtain a new state, add the data to the demonstration dataset &#x1D49F;, and repeat the same method to get the next action value. The model is trained using Eq. 2, and the dataset is continuously optimized, repeating the process until both the model and the taught dataset achieve good performance. By doing so, model errors and external disturbances can be effectively suppressed, and robustness can be improved (<xref ref-type="bibr" rid="B16">Nagabandi et al., 2018</xref>). Based on this idea, the MPC-PPO algorithm is proposed to train the model by the MPC method and then use the model to pre-train the network of PPO to improve the pre-training efficiency.</p>
</sec>
</sec>
<sec id="S3">
<title>Problem modeling</title>
<sec id="S3.SS1">
<title>Problem formulation</title>
<p>Modern large-scale air defense missions are no longer a one-to-one confrontation of one interceptor against one incoming target but rather a one-to-many and many-to-one confrontation accomplished through efficient organizational synergy in the form of tactical coordination. This is in response to saturated long-range attacks by cruise missiles and a multi-directional and multi-dimensional suppression attack by a mixture of human-crewed and uncrewed aircraft. However, this one-to-many and many-to-one confrontation assignment is not fixed; during air defense confrontations, the air attack offensive posture changes in real-time, and the confrontation assignment needs to be highly dynamic to respond to changes in the posture of the air attack threat (<xref ref-type="bibr" rid="B17">Rosier, 2009</xref>). The critical issue in this paper is the effective integration of combat resources according to the characteristics of different weapon systems and the ability to dynamically change the strategy according to the situation so that they can play a &#x201C;1 + 1 &#x003E; 2&#x201D; combat effectiveness.</p>
<p>To reduce complexity while satisfying dynamism, this paper divides the air defense operations process into two parts, resource scheduling and mission execution, based on the idea of HRL. The complexity of the high-dimensional state-action space is reduced by decomposing the entire process into multiple more minor problems and then integrating the solutions to these problems into a solution to the overall task assignment problem.</p>
</sec>
<sec id="S3.SS2">
<title>Markov Decision Process modeling of executive agents</title>
<p>In this paper, we study the air defense task assignment problem in a red-blue confrontation scenario, where the red side is the ground defender, and the blue side is the air attacker. We define a sensor and several interceptors around it as an interception unit. We use an independent learning framework to build the same MDP model for each interception unit.</p>
<p>State space: (1) states information of the defender&#x2019;s defended objects; (2) resource assignment of the unit, sensor and interceptor states; (3) states information of the attacker&#x2019;s targets within its own tracking and interception range; (4) states information of the attacker&#x2019;s incoming targets that are assigned to it.</p>
<p>Action space: (1) what timing to choose to track the target; (2) which interceptor to choose to intercept the target; (3) how many resources to choose to intercept the target; and (4) what timing to choose to intercept.</p>
<p>Reward function: To balance the efficiency of exploration and learning of the agent, guiding the agent progressively toward the winning direction. This paper uses the principle of least resources to design the reward function.</p>
<disp-formula id="S3.E3">
<label>(3)</label>
<mml:math id="M3">
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
<mml:mo>&#x2062;</mml:mo>
<mml:mi mathvariant="normal">m</mml:mi>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>&#x2062;</mml:mo>
<mml:mi mathvariant="normal">n</mml:mi>
</mml:mrow>
</mml:mrow>
<mml:mo>-</mml:mo>
<mml:mrow>
<mml:mn>5</mml:mn>
<mml:mo>&#x2062;</mml:mo>
<mml:mi mathvariant="normal">i</mml:mi>
</mml:mrow>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">j</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Where <italic>m</italic> is the number of human-crewed aircraft intercepted, <italic>n</italic> is the number of high threat targets blocked, <italic>j</italic> is the number of missiles intercepted, and <italic>i</italic> is the number of times our unit has been attacked as a result of a failed interception. Add five points for blocking staffed units, two points for intercepting high-threat targets, one point for intercepting missiles, and five points for each time our unit is attacked due to a failed interception.</p>
</sec>
<sec id="S3.SS3">
<title>Markov Decision Process modeling of scheduling agents</title>
<p>The task of the scheduling agent is to coordinate the tracking and interception tasks to interception units based on the global situation, with a state space, action space, and reward function designed as follows:</p>
<p>State space: (1) states information of the defender&#x2019;s defended objects; (2) states information of the defender&#x2019;s interception units, including resource assignment, sensor and interceptor states, and states information of the attacker&#x2019;s targets within the unit&#x2019;s interception range; (3) states information of the attacker&#x2019;s incoming targets; and (4) states information of the attacker&#x2019;s units that can be attacked.</p>
<p>Action space: (1) select the target to be tracked; (2) select the target to be intercepted; (3) select the interception unit.</p>
<p>Reward function: The merit of the task assignment strategy depends on the final result of the task execution, so the reward of the scheduling agent is the sum of the tips of all the executing agents at the bottom plus the base reward, as shown in Eq. 4.</p>
<disp-formula id="S3.E4">
<label>(4)</label>
<mml:math id="M4">
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mtable displaystyle="true" rowspacing="0pt">
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mrow>
<mml:mstyle displaystyle="false">
<mml:msubsup>
<mml:mo largeop="true" symmetric="true">&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:msubsup>
</mml:mstyle>
<mml:msub>
<mml:mtext>r</mml:mtext>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo mathvariant="italic" separator="true">&#x2003;&#x2003;&#x2002;</mml:mo>
<mml:mi>Fail</mml:mi>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mn>50</mml:mn>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mstyle displaystyle="false">
<mml:msubsup>
<mml:mo largeop="true" symmetric="true">&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:msubsup>
</mml:mstyle>
<mml:mrow>
<mml:mpadded width="+8.3pt">
<mml:msub>
<mml:mtext>r</mml:mtext>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mpadded>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+8.3pt">
<mml:mi>Win</mml:mi>
</mml:mpadded>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mi/>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Where <sub>r<sub>i</sub></sub> is the bonus value earned by each executing agent, with a base bonus value of 50 points for a win and 0 points for a failure, the failure and victory conditions are described in Section &#x201C;Experimental environment setting&#x201D; based on the specific scenario. This paper uses a stage-by-stage approach of giving reward values to guide the agent to find the strategy that achieves victory. For example, the corresponding reward value is given after losing the blue side high-value unit. After the red side wins, it is given the winning reward value. This approach can increase the effect of maximizing global revenue on the agent&#x2019;s revenue and reduce the agent&#x2019;s self-interest as much as possible, enhancing robustness while ensuring the reliability of the strategy.</p>
</sec>
</sec>
<sec id="S4">
<title>Hierarchical architecture design for agents</title>
<sec id="S4.SS1">
<title>General structure</title>
<p>Reinforcement learning methods applied to task assignment can be broadly classified into two categories, centralized and distributed. The centralized idea is to extend the single-agent algorithm to learn the output of a joint action directly, but it is not easy to define how each of these agents should make decisions (<xref ref-type="bibr" rid="B15">Moradi, 2016</xref>). Distributed is where each agent learns its reward function independently, where for each agent, the other agents are part of the environment (<xref ref-type="bibr" rid="B19">Suttle et al., 2020</xref>). In large-scale air defense mission assignment problems, centralized methods can achieve globally optimal results but are often of low value for large-scale complex issues that are too costly in terms of time spent. Distributed algorithms, on the other hand, can negotiate a better result more quickly without having to have information about the specific parameters of individual weapons and the state of the surrounding environment. Still, they also face a significant problem: the assignment results are locally optimal and less globally coordinated for unexpected events (<xref ref-type="bibr" rid="B23">Wu et al., 2019</xref>).</p>
<p>To combine global coordination capability and high-speed computing capability, this paper follows the idea of OGMN architecture and proposes HRL-GC architecture. This architecture layers the agent into scheduling agents and executing agents, strengthening the autonomy of the underlying executing agent and making the assignment policy more reasonable, as shown in <xref ref-type="fig" rid="F1">Figure 1</xref>.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption><p>HRL-GC architecture.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-1072887-g001.tif"/>
</fig>
<p>The agent interacts with the environment to generate the simulation data, which the output port of the environment converts into state information as input to the scheduling agent; the high-level scheduling agent outputs the task assignment result and assigns the task to the underlying agent; the task execution agent outputs the final action information according to the assignment result and its state information; finally, the action information is then transformed into combat instructions according to the required data structure and input to the simulation environment. The above is a complete interaction process of the HRL-GC architecture at one time. We decompose the whole process into scheduling and execution, with different agents making decisions to reduce the complexity of the high-dimensional state-action space. The framework retain the global coordination capability of the centralized approach and add the efficient advantage of multiple agents. This method can preserve the scheduling agent&#x2019;s coordination ability, avoiding missing key targets, duplicate shots, and wasted resources. Moreover, it can reduce the computational pressure of the scheduling agent and improve assignment efficiency.</p>
</sec>
<sec id="S4.SS2">
<title>Design of a hierarchical training framework for agents</title>
<p>Based on the idea of HRL, we need to train the scheduling and execution agents separately offline and then combine them for online inference. The reward function of the scheduling agent requires the reward values of all executing agents, which in turn depend to some extent on the outcome of the assignment of the scheduling agent. Therefore, the executive agent is trained to a certain level using expert assignment knowledge. All the executive agent&#x2019;s network parameters are then fixed for introducing the scheduling agent, and the trained scheduling agent&#x2019;s parameters are set for training the executive agent. The training framework is shown in <xref ref-type="fig" rid="F2">Figure 2</xref>.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption><p>Hierarchical training framework.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-1072887-g002.tif"/>
</fig>
<p>The task assignment scheme of the knowledge rule base (<xref ref-type="bibr" rid="B7">Fu et al., 2020</xref>) is first used to train the underlying executing agent. When the executing agent reaches a certain level, the parameters of the executing agent are then fixed to train the scheduling agent. This paper is based on the Actor-Critic architecture (<xref ref-type="bibr" rid="B6">Fernandez-Gauna et al., 2022</xref>), which uses a centralized learning and decentralized execution approach for the training of multiple executing agents, as shown in <xref ref-type="fig" rid="F3">Figure 3</xref>.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption><p>Centralized learning and decentralized execution.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-1072887-g003.tif"/>
</fig>
<p>During training we use the <italic>n</italic>-value network (critic) to obtain actions, state observations and rewards for each executing agent, which are used to evaluate the decisions of the <italic>n</italic>-strategy network (actor). At the end of training, the critic is no longer used. The algorithm for executing the training of the agents is one of the critical issues studied in this paper and will be described in detail in Section &#x201C;Model-based model predictive control with proximal policy optimization algorithm.&#x201D; The training method for the scheduling agent refers to the proximal policy optimization for task assignment of general and narrow agents (PPO-TAGNA) algorithm in the literature (<xref ref-type="bibr" rid="B11">Liu J. Y. et al., 2022</xref>) to ensure the training effect and demonstrate more intuitively the changes the executive agent brings.</p>
</sec>
<sec id="S4.SS3">
<title>Network architecture design for hierarchical reinforcement learning</title>
<p>For DRL, the network structure of the agents is key to the research. For the HRL-GC architecture, we decoupled the general agent network in the OGMN into two parts. We improved them according to the MDP model in Sections &#x201C;Markov Decision Process modeling of executive agents&#x201D; and &#x201C;Markov Decision Process modeling of scheduling agents&#x201D; as the training networks for the scheduling and execution agents, respectively, as shown in <xref ref-type="fig" rid="F4">Figure 4</xref>.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption><p>Network structure.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-1072887-g004.tif"/>
</fig>
<p>The input to the scheduling agents is global situational data, and the global features are obtained through feature extraction, vector concatenation, and other operations. After that, the global features output the value evaluation and task assignment results through two layers of FC-ReLU and attention mechanism operations, respectively. It is noteworthy that the actor&#x2019;s output is changed from the real action to the task assignment matrix, i.e., the subject of the action, which significantly reduces the dimensionality of the action space and improves the computational speed.</p>
<p>The actor-network structure of the executing agent is the focus of this paper. Its input mainly consists of its state and the state of the assigned incoming target. After generating local features through feature extraction, vector connection, and Gated Recurrent Unit (GRU), it once again combines the information of the assigned target state to perform attention operations and finally outputs the timing of the execution of the task. We somewhat reuse the scheduling agent&#x2019;s network and combine it vertically with a rule base, which is used to select the resources to be used. In this way, instead of using the rule base exclusively for decision making, we enhance the autonomy of the task execution agent, share more computational pressure on the scheduling agent, and make the assignment results more reasonable.</p>
</sec>
</sec>
<sec id="S5">
<title>Model-based model predictive control with proximal policy optimization algorithm</title>
<p>Sampling for large-scale adversarial tasks is a significant factor in excessive training time costs. The model-based RL approach can effectively improve this problem by building virtual models to interact with the agents. We use the MPC approach to train the virtual model, allowing the agent to interact with the model to obtain the demonstration data. To reduce the impact of model errors, we use only the demonstration data set to pre-train the network for the PPO algorithm, thus accelerating the exploration process during the initial training phase.</p>
<sec id="S5.SS1">
<title>Model predictive control approach for multi-agent task assignment</title>
<p>Based on the idea of MPC, this paper defines the entire task process time domain as [0,<italic>nT</italic>], and the system makes a decision every <italic>T</italic> moments. The time domain [0,<italic>T</italic>] is the period in which the task is executed. Before reaching the moment <italic>T</italic>, the agent needs to optimize the prediction of the strategy in the time domain [<italic>T</italic>,2<italic>T</italic>] based on the available situational information and resources. After reaching moment <italic>T</italic>, the agent executes the first action of the optimal action sequence while predicting and optimizing the decision solution for [2<italic>T</italic>,3<italic>T</italic>] in moment [<italic>T</italic>,2<italic>T</italic>], and so on until the end of the task.</p>
<p>Define the system as a synergy of m agents, <italic>s</italic>(<italic>k</italic>) denotes the states at moment <italic>k</italic>, &#x03BC;(<italic>k</italic>) represents the command input in the period <italic>k</italic>,<italic>k</italic> + <italic>T</italic>,<italic>f</italic> is the resource selection model, and <italic>k</italic> = 0,<italic>T</italic>,2<italic>T</italic>&#x2026;<italic>nT</italic> denotes the decision time point, then the discrete-time equation of state of the system is</p>
<disp-formula id="S5.E5">
<label>(5)</label>
<mml:math id="M5">
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo rspace="0.8pt">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mi>&#x03BC;</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo rspace="0.8pt">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>&#x03BC;</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>&#x2208;</mml:mo>
<mml:mo>&#x222A;</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where the system state <italic>s</italic>(<italic>k</italic>) and the input decision &#x03BC;(<italic>k</italic>) can be expressed as</p>
<disp-formula id="S5.E6">
<label>(6)</label>
<mml:math id="M6">
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mtable displaystyle="true" rowspacing="0pt">
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo rspace="0.8pt">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo rspace="0.8pt">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mi mathvariant="normal">&#x22EF;</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mtext>T</mml:mtext>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mrow>
<mml:mi>&#x03BC;</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>&#x03BC;</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo rspace="0.8pt">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>&#x03BC;</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo rspace="0.8pt">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mi mathvariant="normal">&#x22EF;</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>&#x03BC;</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
<mml:mi>T</mml:mi>
</mml:msup>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mi/>
</mml:mrow>
</mml:math>
</disp-formula>
<p>With <italic>s</italic><sub><italic>k+iT</italic></sub> denoting the predicted state of the control resource for the subsequent <italic>iT</italic> moments, using moment <italic>k</italic> as the current moment, the above equation shows that the expected state within [<italic>k</italic>,<italic>k</italic> + <italic>T</italic>] can be obtained based on the state <italic>s</italic>(<italic>k</italic>) at the moment <italic>k</italic> and the input decision &#x03BC;(<italic>k</italic>), then Eq. 7 can be obtained.</p>
<disp-formula id="S5.E7">
<label>(7)</label>
<mml:math id="M7">
<mml:mtable displaystyle="true" rowspacing="0pt">
<mml:mtr>
<mml:mtd columnalign="center">
<mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>&#x03BC;</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:msub>
<mml:mo rspace="0.8pt">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="center">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2062;</mml:mo>
<mml:mi mathvariant="normal">&#x2026;</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
<p>where <italic>H</italic> is the number of predicted steps.</p>
<p>Defining the value of the reward at moment <italic>k</italic> as r(<italic>s</italic>(<italic>k</italic>)), the global reward in the (<italic>k</italic>,<italic>k</italic> + <italic>T</italic>) time domain of <italic>h</italic> is</p>
<disp-formula id="S5.E8">
<label>(8)</label>
<mml:math id="M8">
<mml:mrow>
<mml:mrow>
<mml:mtext>r</mml:mtext>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:munderover>
<mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>m</mml:mi>
</mml:munderover>
<mml:mrow>
<mml:mtext>r</mml:mtext>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo lspace="2.5pt" rspace="2.5pt" stretchy="false">|</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Where r(<italic>s</italic>(<italic>j</italic>|<italic>k</italic>)) is the reward of the jth agent at the moment <italic>k</italic>, this leads to an optimal task assignment model for the global system.</p>
<disp-formula id="S5.E9">
<label>(9)</label>
<mml:math id="M9">
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>&#x03C8;</mml:mi>
<mml:mo>&#x002A;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:munder>
<mml:mrow>
<mml:mi>arg</mml:mi>
<mml:mo movablelimits="false">&#x2061;</mml:mo>
<mml:mi>max</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x03C8;</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:munder>
<mml:mrow>
<mml:munderover>
<mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mi>H</mml:mi>
</mml:munderover>
<mml:mrow>
<mml:mtext>r</mml:mtext>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>&#x03BC;</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>.</mml:mo>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="S5.Ex1">
<mml:math id="M10">
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mtable displaystyle="true" rowspacing="0pt">
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>&#x03BC;</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:msub>
<mml:mo rspace="0.8pt">)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">&#x2026;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>H</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mrow>
<mml:mi>&#x03C8;</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>&#x03BC;</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mpadded width="-1.7pt">
<mml:msub>
<mml:mi>&#x03BC;</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mpadded>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mi mathvariant="normal">&#x2026;</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>&#x03BC;</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>H</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mi>&#x03C8;</mml:mi>
<mml:mo>&#x002A;</mml:mo>
</mml:msup>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mi>&#x03BC;</mml:mi>
<mml:mi>k</mml:mi>
<mml:mo>&#x002A;</mml:mo>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>&#x03BC;</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>&#x002A;</mml:mo>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mi mathvariant="normal">&#x2026;</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:msubsup>
<mml:mi>&#x03BC;</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>H</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:mrow>
<mml:mo>&#x002A;</mml:mo>
</mml:msubsup>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mrow>
<mml:mi>Y</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo rspace="0.8pt">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mi>&#x03C8;</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>&#x2264;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mi/>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Where <italic>Y</italic>(<italic>s</italic>(<italic>k</italic>),&#x03C8;(<italic>k</italic>))&#x2264;0 is the system constraint, which will be described in detail in Section &#x201C;Constraints on the system.&#x201D;</p>
<p>&#x03C8;&#x002A;(<italic>k</italic>) is the action input sequence of the executing agent; the first action of this sequence, i.e., <italic>&#x03BC;</italic><sub><italic>k</italic></sub>&#x002A;, is acted upon in the environment to obtain a new state. This round of data is added to the demonstration data set, and the next game of information is repeated using the MPC method, and so on until the end of the task. We then use the demonstration dataset &#x1D49F; to train model <inline-formula><mml:math id="INEQ25"><mml:msub><mml:mover accent="true"><mml:mi>f</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mi>&#x03B8;</mml:mi></mml:msub></mml:math></inline-formula> <italic>via</italic> Eq. 10 and so on to continuously improve the quality of the taught dataset for the next step of network pre-training.</p>
<disp-formula id="S5.E10">
<label>(10)</label>
<mml:math id="M11">
<mml:mrow>
<mml:mrow>
<mml:mpadded lspace="-5pt" width="-8pt">
<mml:mi>&#x03B5;</mml:mi>
</mml:mpadded>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>&#x03B8;</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi class="ltx_font_mathcaligraphic">&#x1D49F;</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:munder>
<mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>&#x03BC;</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2208;</mml:mo>
<mml:mi class="ltx_font_mathcaligraphic">&#x1D49F;</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mrow>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mn>2</mml:mn>
</mml:mfrac>
<mml:mo>&#x2062;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo fence="true">||</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>-</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>-</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>f</mml:mi>
<mml:mo stretchy="false">^</mml:mo>
</mml:mover>
<mml:mi>&#x03B8;</mml:mi>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>&#x03BC;</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo fence="true">||</mml:mo>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
</sec>
<sec id="S5.SS2">
<title>Model predictive control with proximal policy optimization algorithm</title>
<p>After training the model and obtaining samples using the MPC method, combined with the principles of the PPO algorithm, this method uses a pool of demonstration experience playback <italic>R</italic><sub><italic>d</italic></sub> to store this demonstration data and additionally constructs a pool of exploration experience playback <italic>R</italic><sub><italic>e</italic></sub> to store the exploration data of the agent. We obtained data from the two empirical playback pools mentioned above in a particular proportion. Considering the cumulative error of the model, the balance of demonstration data to the extracted data decreases with increasing time steps, and after 1,000 steps, the exploration data is used exclusively. The specific algorithm is described as shown in <xref ref-type="table" rid="A1">Algorithm 1</xref>.</p>
<table-wrap position="float" id="A1">
<label>Algorithm 1</label>
<caption><p>Model predictive control with proximal policy optimization (MPC-PPO) algorithm.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<tbody>
<tr>
<td valign="top" align="left">
<monospace>Initialize the demonstration dataset</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;<italic>R</italic><sub><italic>d</italic></sub>, model <inline-formula><mml:math id="INEQ29"><mml:msub><mml:mover accent="true"><mml:mi>f</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mi>&#x03B8;</mml:mi></mml:msub></mml:math></inline-formula></monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;Repeat for <italic>N</italic> rounds</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;Train <inline-formula><mml:math id="INEQ30"><mml:msub><mml:mover accent="true"><mml:mi>f</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mi>&#x03B8;</mml:mi></mml:msub></mml:math></inline-formula> using data from <italic>R</italic><sub><italic>d</italic></sub></monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;Repeat <italic>T</italic>-step</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;Estimation of the optimal action</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;sequence A using the MPC</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;algorithm</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;Interact the first action<italic>a</italic><sub><italic>t</italic></sub> in</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;A with the environment to obtain</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;the state<italic>s</italic><sub><italic>t</italic> + 1</sub></monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;Add (<italic>s</italic><sub><italic>t</italic></sub>,<italic>a</italic><sub><italic>t</italic></sub>,<italic>r</italic><sub><italic>t</italic></sub>,<italic>s</italic><sub><italic>t</italic> + 1</sub>) to the data set <italic>R</italic><sub><italic>d</italic></sub></monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;Initialize the policy parameters &#x03B8;,</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;<italic>&#x03B8;</italic><sub><italic>old</italic></sub>, and the exploration data pool</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;<italic>R</italic><sub><italic>e</italic></sub></monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;Repeat each round of updates</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;Repeat for &#x03B5;<italic>N</italic> Actors</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;Repeat <italic>t</italic> steps</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;Each step uses the old policy</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;parameters <italic>&#x03B8;</italic><sub><italic>old</italic></sub> to generate</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;decisions</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;The advantage estimate A is</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;calculated in each step</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;Store the sample data in <sub><italic>R<sub>e</sub></italic></sub></monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;Iterate <italic>K</italic> steps</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;Solving for the strategy</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;gradient of the cumulative</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;expected reward function</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;Using small batches of data at</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;a time, scaled from <sub><italic>R<sub>e</sub></italic></sub> and <sub><italic>R<sub>d</sub></italic></sub></monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;The policy parameter <sub>&#x03B8;</sub> is</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;updated with the policy</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;gradient</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;Update the new policy</monospace><break/>
<monospace>&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;parameters to <italic>&#x03B8;</italic><sub><italic>old</italic></sub></monospace></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Where <italic>&#x03B8;</italic><sub><italic>old</italic></sub> and &#x03B8; refer to the old and new parameters, respectively, and in each iteration, the algorithm runs &#x03B5;<italic>N</italic> Actors in parallel, with &#x03B5; being the proportion of the total data explored. Each Actor runs <italic>T</italic> steps, collecting a total of &#x03B5;<italic>NT</italic>. The dominance estimate <italic>A</italic><sub>1</sub>&#x2026;<italic>A</italic><sub><italic>T</italic></sub> is calculated at each step, and the remaining data is extracted from <italic>R</italic><sub><italic>d</italic></sub>. After the data has been acquired, it will be used to update the policy parameters, iterating through each round and selecting small batches of data sets. Since, in the PPO algorithm, the data in the buffer needs to be emptied after x updates, a certain amount of demonstration samples need to be added after each emptying of the buffer. The proportion of demonstration samples decreases as the number of updates increases so that the impact of the cumulative error of the model can be reduced to a certain extent.</p>
</sec>
<sec id="S5.SS3">
<title>Constraints on the system</title>
<p>Air defense tasks require the highest safety level in policy and maximum avoidance of unsafe maneuvers during training. Therefore, to suppress the uncertainty in the model learning process and make the model error smaller, we also need to add some constraints to the system to satisfy the realism and safety of the model. The specific rules are as follows:</p>
<list list-type="simple">
<list-item>
<label>(1)</label>
<p>Cooperative guidance constraints</p>
</list-item>
</list>
<p>For multi-platform cooperative systems, the constraints on unified guidance accuracy and distance must be satisfied during suitable guidance, as shown in Eq. 11.</p>
<disp-formula id="S5.E11">
<label>(11)</label>
<mml:math id="M12">
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mtable displaystyle="true" rowspacing="0pt">
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>&#x03B8;</mml:mi>
<mml:mi>T</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msub>
<mml:mi>&#x03B8;</mml:mi>
<mml:mrow>
<mml:mi>g</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>u</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>d</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">&#x2026;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">&#x03C3;</mml:mi>
<mml:mi>T</mml:mi>
</mml:msub>
<mml:mo>&#x2265;</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x03C3;</mml:mi>
<mml:mi>min</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">&#x2026;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mstyle displaystyle="false">
<mml:msub>
<mml:mo largeop="true" mathsize="160%" stretchy="false" symmetric="true">&#x22C3;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mstyle>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>&#x2265;</mml:mo>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>T</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">&#x2026;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mi/>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Where <italic>n</italic> denotes the number of missiles to be guided, <italic>&#x03B8;</italic><sub><italic>T</italic></sub> represents the set of flight airspace angles of the target missile, <italic>&#x03B8;</italic><sub><italic>guide</italic></sub> denotes the operating range of the sensor, &#x03C3;<sub><italic>T</italic></sub> denotes the guidance accuracy, &#x03C3;<sub><italic>min</italic></sub> denotes the minimum guidance accuracy requirement, <italic>S</italic><sub><italic>i</italic></sub> means the guidance distance of the sensor, and <italic>S</italic><sub><italic>T</italic></sub> represents the distance of the missile. That is, the constraints of minimum guidance accuracy and maximum guidance distance must be satisfied during cooperative guidance.</p>
<list list-type="simple">
<list-item>
<label>(2)</label>
<p>Time constraints</p>
</list-item>
</list>
<p>Due to the highly real-time nature of air defense tasks, task assignment is highly time-constrained, and, for the executing agent, the factors associated with the time constraint are mainly reflected in</p>
<list list-type="simple">
<list-item>
<label>1.</label>
<p>Timing of interceptions</p>
</list-item>
</list>
<p>Longest interception distance:</p>
<disp-formula id="S5.E12">
<label>(12)</label>
<mml:math id="M13">
<mml:mrow>
<mml:msub>
<mml:mi>D</mml:mi>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>I</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:msubsup>
<mml:mi>D</mml:mi>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msubsup>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>L</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>L</mml:mi>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:msubsup>
<mml:mi>D</mml:mi>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msubsup>
<mml:mo>-</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mi>H</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mi>P</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Nearest interception distance:</p>
<disp-formula id="S5.E13">
<label>(13)</label>
<mml:math id="M14">
<mml:mrow>
<mml:msub>
<mml:mi>D</mml:mi>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>I</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:msubsup>
<mml:mi>D</mml:mi>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msubsup>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:msubsup>
<mml:mi>D</mml:mi>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msubsup>
<mml:mo>-</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mi>H</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mi>P</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Where <italic>v</italic><sub><italic>m</italic></sub> is the speed of the target, <italic>H</italic> is the altitude of the target, <italic>P</italic> is the shortcut of the target&#x2019;s flight path, <italic>D</italic><sub><italic>LS</italic></sub> and <italic>D</italic><sub><italic>NS</italic></sub> are the target&#x2019;s kill zone oncoming far boundary and the target&#x2019;s kill zone oncoming near the border, and <italic>t</italic><sub><italic>L</italic></sub> and <italic>t</italic><sub><italic>N</italic></sub> are the times the target flies to the distant and near edges of the oncoming kill zone, respectively.</p>
<list list-type="simple">
<list-item>
<label>2.</label>
<p>Timing of sensor switch-on:</p>
</list-item>
</list>
<p>Sensor detection of the target is a prerequisite for intercepting the target. In combat, it takes a certain amount of time, called pre-interception preparation time <italic>t</italic><sub><italic>P</italic></sub>, from sensor detection to interceptor&#x2019;s interception of the target.</p>
<p>The required distance for sensors to find a target <italic>D</italic><sub><italic>S</italic></sub> is based on the length of the target at the furthest encounter point.</p>
<disp-formula id="S5.E14">
<label>(14)</label>
<mml:math id="M15">
<mml:mrow>
<mml:msub>
<mml:mi>D</mml:mi>
<mml:mi>S</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:msubsup>
<mml:mi>D</mml:mi>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msubsup>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mi>v</mml:mi>
<mml:mi>m</mml:mi>
<mml:mn>2</mml:mn>
</mml:msubsup>
<mml:mo>&#x2062;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>L</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>v</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>L</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mi>P</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:msubsup>
<mml:mi>D</mml:mi>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msubsup>
<mml:mo>-</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mi>H</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mi>P</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:math>
</disp-formula>
<p>We define the state that satisfies the security constraint as <italic>S</italic>, which gives us Eq. 15.</p>
<disp-formula id="S5.E15">
<label>(15)</label>
<mml:math id="M16">
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mtable displaystyle="true" rowspacing="0pt">
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>Y</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo rspace="0.8pt">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mi>&#x03C8;</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>&#x003E;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mo mathvariant="italic" separator="true">&#x2003;&#x2003;&#x2002;</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">s</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x2284;</mml:mo>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>Y</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo rspace="0.8pt">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mi>&#x03C8;</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>&#x2264;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mo mathvariant="italic" separator="true">&#x2003;&#x2003;&#x2002;</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">s</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x2282;</mml:mo>
<mml:mpadded width="+5.6pt">
<mml:mi>S</mml:mi>
</mml:mpadded>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mi/>
</mml:mrow>
</mml:math>
</disp-formula>
<p>In this multi-platform collaborative system, assignments can only be made when <italic>Y</italic>(<italic>s</italic>(<italic>k</italic>),&#x03C8;(<italic>k</italic>))&#x2264;0; otherwise, assignments against this batch of targets are invalid.</p>
</sec>
</sec>
<sec id="S6">
<title>Experiments and results</title>
<sec id="S6.SS1">
<title>Experimental environment setting</title>
<p>As an example of a large-scale air defense mission, the red side is the defender, with seven long-range interception units and five short-range interception units to defend a command post and an airfield. The long-range interception unit consists of one long-range sensor and eight long-range interceptors, and the short-range interception unit consists of one short-range sensor and three short-range interceptors. Blue is the attacker, setting up 18 cruise bombs, 20 UAVs, 12 fighters, and 2 jammers to attack Red in batches. Red loses when Red&#x2019;s command post is attacked three times; Red loses when the distance between Blue bombers and Red&#x2019;s command post is less than 10 km; Red loses when Red&#x2019;s sensor losses exceed 60%; Red wins when Blue loses more than 30% of its fighters.</p>
<p>A schematic diagram of the experimental scenario is shown in <xref ref-type="fig" rid="F5">Figure 5</xref>.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption><p>Schematic diagram of an experimental scenario.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-1072887-g005.tif"/>
</fig>
<p>In the experiments in this paper, the agent&#x2019;s interaction with the environment takes place on the digital battlefield. The digital battlefield is a DRL-oriented air defense combat simulation framework, which is responsible for the presentation of the battlefield environment and the simulation of the interaction process, including the simulation of the behavioral logic of each unit and the damage settlement of mutual attacks. It supports operations such as combat scenario editing and configuration of weapon and equipment capability indicators, allowing agents to be trained in different random scenarios. And physical constraints such as earth curvature/obscuration can be randomly changed within a certain range.</p>
</sec>
<sec id="S6.SS2">
<title>Experimental hardware configuration</title>
<p>The CPU running the simulation environment is an Intel Xeon E5-2678v3, 88 core, 256 G memory; GPU &#x00D7; 2, model Nvidia GeForce 2080Ti, 72 cores, 11G video memory. In PPO, the hyperparameters is &#x03B5; = 0.2, the learning rate is 10<sup>&#x2013;4</sup>, the batch size is 5,120, and the number of hidden layer units in the neural network is 128 and 256.</p>
</sec>
<sec id="S6.SS3">
<title>Agent architecture comparison</title>
<p>Alpha C2 (<xref ref-type="bibr" rid="B7">Fu et al., 2020</xref>) uses a commander structure, to which OGMN (<xref ref-type="bibr" rid="B11">Liu J. Y. et al., 2022</xref>) adds rule-driven the narrow agent, and the HRL-GC architecture proposed in this paper uses data-driven the narrow agent on top of OGMN. Therefore, in this experiment, we first trained the execution agents 50,000 times using a rule base and fixed parameters and then used three different algorithms to verify the differences in training efficiency between the three agent architectures. We iterated the three architectures 100,000 times on the digital battlefield using the PPO, A3C, and DDPG algorithms, respectively, collecting data from each game of the confrontation and counting the reward values and win rates obtained by the red-side agents. The results are shown in <xref ref-type="fig" rid="F6">Figure 6</xref>.</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption><p>Comparison of agent architecture training effect.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-1072887-g006.tif"/>
</fig>
<p>Comparing the mean reward curves shows that the HRL-GC architecture can significantly improve the training efficiency, and the final mean reward value is higher. The reward curve is more likely to stabilize. In terms of win ratio, the proposed agent architecture also achieves higher win ratios faster and is more likely to stabilize than Alpha C2 and OGMN. Experiments have demonstrated that the HRL-GC architecture further improves training efficiency and agents&#x2019; decision-making while retaining the ability to coordinate.</p>
</sec>
<sec id="S6.SS4">
<title>Algorithm performance comparison</title>
<sec id="S6.SS4.SSS1">
<title>Comparison of training data</title>
<p>To verify that the MPC-PPO algorithm proposed in this paper can improve the efficiency of the pre-training period, we first trained the scheduling agent 50,000 times using the PPO-TAGNA algorithm based on the HRL-GC architecture in the same scenario setting. Then, the execution agent performed the MPC-PPO algorithm, the PPO-TAGNA algorithm, and the PPO (<xref ref-type="bibr" rid="B7">Fu et al., 2020</xref>) algorithm 50,000 times for centralized training. The training results are shown in <xref ref-type="fig" rid="F7">Figure 7</xref>.</p>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption><p>Algorithm performance comparison.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-1072887-g007.tif"/>
</fig>
<p>The comparison results show that the MPC-PPO algorithm proposed in this paper can achieve a higher initial reward and a significant increase in win ratio in the early stages. Both the reward value curve and the win ratio curve of the training have an inevitable decline and are not very stable after the rise due to model errors and other factors; however, in general, MPC-PPO is more efficient in the first 50,000 steps of training compared to PPO-TAGNA and PPO and can achieve a faster increase in the reward value and win ratio obtained by the agent.</p>
</sec>
<sec id="S6.SS4.SSS2">
<title>Behavioral analysis</title>
<p>The model-based RL approach aims to allow the agent to reduce ineffective exploration in the initial training stages and reach a certain level quickly. So we trained only the PPO agent, the PPO-TAGNA agent, and the MPC-PPO agent proposed in this paper 50,000 times in a complex scenario. We then performed behavioral analysis separately and compared them with the untrained agent. The results are shown in <xref ref-type="fig" rid="F8">Figure 8</xref>.</p>
<fig id="F8" position="float">
<label>FIGURE 8</label>
<caption><p>Comparison of behavioral details.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-1072887-g008.tif"/>
</fig>
<p>The behavioral analysis shows that the untrained agents (top left) adopt a random policy, wasting too much ammunition in defending against the first attacks and eventually failing because they run out of resources; the PPO agents (top right) have not yet explored a mature policy at this stage and only attack high-value targets without dealing with incoming high-threat targets; the PPO-TAGNA agents (bottom left) at this stage The MPC-PPO agents (bottom right) has learned the strategy of coordinated interception, but the response timing is inaccurate, and the scope of coordination is small; the MPC-PPO agents (bottom right) at this stage can effectively coordinate the interception of high-threat targets while attacking high-value targets. Therefore, the MPC-PPO algorithm in large-scale complex scenarios enables the agents to reduce ineffective exploration in the initial stages of training and learn practical policies more quickly.</p>
</sec>
</sec>
</sec>
<sec id="S7" sec-type="conclusion">
<title>Conclusion</title>
<p>To address the problem that modern air defense task assignment is difficult to balance effectiveness and dynamism, this paper proposes the HRL-GC architecture, which layers the agents into a scheduling agent and execution agents, with the scheduling agent coordinating the global situation to ensure effectiveness and the execution agent distributing the execution to improve efficiency and thus ensure dynamism. To enhance the efficiency of the initial stage of agents training, this paper proposes a model-based MPC-PPO algorithm to train the execution agents. Finally, experiments compare the agent framework and the algorithm&#x2019;s performance in a large-scale air defense scenario. The experimental results show that the HRL-GC architecture and MPC-PPO algorithm can further improve the decision-making level of the agents and train them more efficiently. The assignment scheme is more in line with the needs of large-scale air defense, effectively balancing the effectiveness and dynamics of air defense task assignment.</p>
</sec>
<sec id="S8" sec-type="data-availability">
<title>Data availability statement</title>
<p>The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.</p>
</sec>
<sec id="S9">
<title>Author contributions</title>
<p>J-yL: conceptualization, software, and writing&#x2014;original draft preparation. J-yL and GW: methodology. J-yL, QF, and X-kG: validation. S-yW: formal analysis. X-kG: investigation. QF: resources, project administration, and funding acquisition. J-yL and S-yW: data curation and visualization. GW and QF: writing&#x2014;review and editing. GW: supervision. All authors read and agreed to the published version of the manuscript.</p>
</sec>
</body>
<back>
<sec id="S10" sec-type="funding-information">
<title>Funding</title>
<p>The authors would like to acknowledge the Project of National Natural Science Foundation of China (Grant Nos. 62106283 and 72001214) for provided fund to conducting experiments.</p>
</sec>
<sec id="S11" sec-type="COI-statement">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec id="S12" sec-type="disclaimer">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Abel</surname> <given-names>D.</given-names></name></person-group> (<year>2019</year>). &#x201C;<article-title>A theory of state abstraction for reinforcement learning</article-title>,&#x201D; in <source><italic>Proceedings of the 33rd AAAI conference on artificial intelligence</italic></source>, <volume>Vol. 33</volume>, (<publisher-loc>Palo Alto, CA</publisher-loc>: <publisher-name>AAAI Press</publisher-name>). <pub-id pub-id-type="doi">10.1609/aaai.v33i01.33019876</pub-id></citation></ref>
<ref id="B2"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Abouheaf</surname> <given-names>M. I.</given-names></name> <name><surname>Lewis</surname> <given-names>F. L.</given-names></name> <name><surname>Mahmoud</surname> <given-names>M. S.</given-names></name> <name><surname>Mikulski</surname> <given-names>D. G.</given-names></name></person-group> (<year>2015</year>). <article-title>Discrete-time dynamic graphical games: Model-free reinforcement learning solution.</article-title> <source><italic>Control Theory Technol.</italic></source> <volume>13</volume> <fpage>55</fpage>&#x2013;<lpage>69</lpage>. <pub-id pub-id-type="doi">10.1007/s11768-015-3203-x</pub-id></citation></ref>
<ref id="B3"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ascione</surname> <given-names>G.</given-names></name> <name><surname>Cuomo</surname> <given-names>S. A.</given-names></name></person-group> (<year>2022</year>). <article-title>Sojourn-based approach to semi-markov reinforcement learning.</article-title> <source><italic>J. Sci. Comput.</italic></source> <volume>92</volume>:<issue>36</issue>. <pub-id pub-id-type="doi">10.1007/S10915-022-01876-X</pub-id></citation></ref>
<ref id="B4"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bacon</surname> <given-names>P. L.</given-names></name> <name><surname>Precup</surname> <given-names>D.</given-names></name></person-group> (<year>2018</year>). <article-title>Constructing temporal abstractions autonomously in reinforcement learning.</article-title> <source><italic>AI Mag.</italic></source> <volume>39</volume> <fpage>39</fpage>&#x2013;<lpage>50</lpage>.</citation></ref>
<ref id="B5"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>Y.</given-names></name> <name><surname>Mabu</surname> <given-names>S.</given-names></name> <name><surname>Shimada</surname> <given-names>K.</given-names></name> <name><surname>Hirasawa</surname> <given-names>K.</given-names></name></person-group> (<year>2008</year>). <article-title>Trading rules on stock markets using genetic network programming with sarsa learning.</article-title> <source><italic>J. Adv. Comput. Intell. Intell. Inform.</italic></source> <volume>12</volume> <fpage>383</fpage>&#x2013;<lpage>392</lpage>. <pub-id pub-id-type="doi">10.20965/jaciii.2008.p0383</pub-id></citation></ref>
<ref id="B6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fernandez-Gauna</surname> <given-names>B.</given-names></name> <name><surname>Graa</surname> <given-names>M.</given-names></name> <name><surname>Osa-Amilibia</surname> <given-names>J. L.</given-names></name> <name><surname>Larrucea</surname> <given-names>X.</given-names></name></person-group> (<year>2022</year>). <article-title>Actor-critic continuous state reinforcement learning for wind-turbine control robust optimization.</article-title> <source><italic>Inf. Sci.</italic></source> <volume>591</volume> <fpage>365</fpage>&#x2013;<lpage>380</lpage>. <pub-id pub-id-type="doi">10.1016/j.ins.2022.01.047</pub-id></citation></ref>
<ref id="B7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fu</surname> <given-names>Q.</given-names></name> <name><surname>Fan</surname> <given-names>C. L.</given-names></name> <name><surname>Song</surname> <given-names>Y. F.</given-names></name></person-group> (<year>2020</year>). <article-title>Alpha C2&#x2013;an intelligent air defense commander independent of human decision-making.</article-title> <source><italic>IEEE Access</italic></source> <volume>8</volume> <fpage>87504</fpage>&#x2013;<lpage>87516</lpage>. <pub-id pub-id-type="doi">10.1109/ACCESS.2020.2993459</pub-id></citation></ref>
<ref id="B8"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gu</surname> <given-names>S.</given-names></name> <name><surname>Lillicrap</surname> <given-names>T.</given-names></name> <name><surname>Sutskever</surname> <given-names>I.</given-names></name> <name><surname>Levine</surname> <given-names>S.</given-names></name></person-group> (<year>2016</year>). &#x201C;<article-title>Continuous deep Q-learning with model-based acceleration</article-title>,&#x201D; in <source><italic>Proceedings of the 33rd international conference on international conference on machine learning</italic></source>, (<publisher-loc>Cambridge MA</publisher-loc>: <publisher-name>JMLR.org</publisher-name>).</citation></ref>
<ref id="B9"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>C. H.</given-names></name> <name><surname>Moon</surname> <given-names>G. H.</given-names></name> <name><surname>Yoo</surname> <given-names>D. W.</given-names></name> <name><surname>Tahk</surname> <given-names>M. J.</given-names></name> <name><surname>Lee</surname> <given-names>I. S.</given-names></name></person-group> (<year>2012</year>). <article-title>Distributed task assignment algorithm for SEAD mission of heterogeneous UAVs based on CBBA algorithm.</article-title> <source><italic>J. Korean Soc. Aeronaut. Space Sci.</italic></source> <volume>40</volume> <fpage>988</fpage>&#x2013;<lpage>996</lpage>.</citation></ref>
<ref id="B10"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>G.</given-names></name> <name><surname>Dong</surname> <given-names>M.</given-names></name> <name><surname>Ming</surname> <given-names>L.</given-names></name> <name><surname>Luo</surname> <given-names>C.</given-names></name> <name><surname>Yu</surname> <given-names>H.</given-names></name> <name><surname>Hu</surname> <given-names>X.</given-names></name><etal/></person-group> (<year>2022</year>). <article-title>Deep reinforcement learning based ensemble model for rumor tracking.</article-title> <source><italic>Inf. Syst.</italic></source> <volume>103</volume>:<issue>101772</issue>. <pub-id pub-id-type="doi">10.1016/J.IS.2021.101772</pub-id></citation></ref>
<ref id="B11"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>J. Y.</given-names></name> <name><surname>Wang</surname> <given-names>G.</given-names></name> <name><surname>Fu</surname> <given-names>Q.</given-names></name></person-group> (<year>2022</year>). <article-title>Task assignment in ground-to-air confrontation based on multiagent deep reinforcement learning.</article-title> <source><italic>Def. Technol.</italic></source> <pub-id pub-id-type="doi">10.1016/j.dt.2022.04.001</pub-id></citation></ref>
<ref id="B12"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>S.</given-names></name> <name><surname>Sadowska</surname> <given-names>A.</given-names></name> <name><surname>De Schutter</surname> <given-names>B.</given-names></name></person-group> (<year>2022</year>). <article-title>A scenario-based distributed model predictive control approach for freeway networks.</article-title> <source><italic>Transp. Res. C</italic></source> <volume>136</volume>:<issue>103261</issue>. <pub-id pub-id-type="doi">10.1016/J.TRC.2021.103261</pub-id></citation></ref>
<ref id="B13"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Minsky</surname> <given-names>L. M.</given-names></name></person-group> (<year>1954</year>). <source><italic>Theory of neural-analog reinforcement systems and its application to the brain-model problem.</italic></source> <publisher-loc>Princeton, NJ</publisher-loc>: <publisher-name>Princeton &#x00DC;niversitesi</publisher-name>.</citation></ref>
<ref id="B14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Moos</surname> <given-names>J.</given-names></name> <name><surname>Hansel</surname> <given-names>K.</given-names></name> <name><surname>Abdulsamad</surname> <given-names>H.</given-names></name> <name><surname>Stark</surname> <given-names>S.</given-names></name> <name><surname>Clever</surname> <given-names>D.</given-names></name> <name><surname>Peters</surname> <given-names>J.</given-names></name></person-group> (<year>2022</year>). <article-title>Robust reinforcement learning: A review of foundations and recent advances.</article-title> <source><italic>Mach. Learn. Knowl. Extr.</italic></source> <volume>4</volume> <fpage>276</fpage>&#x2013;<lpage>315</lpage>. <pub-id pub-id-type="doi">10.3390/MAKE4010013</pub-id></citation></ref>
<ref id="B15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Moradi</surname> <given-names>M.</given-names></name></person-group> (<year>2016</year>). &#x201C;<article-title>A centralized reinforcement learning method for multi-agent job scheduling in grid</article-title>,&#x201D; in <source><italic>Proceedings of the 6th international conference on computer and knowledge engineering (ICCKE 2016)</italic></source>, (<publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>IEEE</publisher-name>).</citation></ref>
<ref id="B16"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nagabandi</surname> <given-names>A.</given-names></name> <name><surname>Kahn</surname> <given-names>G.</given-names></name> <name><surname>Fearing</surname> <given-names>R. S.</given-names></name> <name><surname>Levine</surname> <given-names>S.</given-names></name></person-group> (<year>2018</year>). &#x201C;<article-title>Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning</article-title>,&#x201D; in <source><italic>Proceedings of the 2018 IEEE international conference on robotics and automation (ICRA)</italic></source>, (<publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>IEEE Press</publisher-name>), <fpage>7559</fpage>&#x2013;<lpage>7566</lpage>.</citation></ref>
<ref id="B17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rosier</surname> <given-names>F.</given-names></name></person-group> (<year>2009</year>). <article-title>Modern air defence: A lecture given at the RUSI on 14th December 1966.</article-title> <source><italic>R. U. Serv. Inst. J.</italic></source> <volume>112</volume> <fpage>229</fpage>&#x2013;<lpage>236</lpage>. <pub-id pub-id-type="doi">10.1080/03071846709429752</pub-id></citation></ref>
<ref id="B18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shen</surname> <given-names>Y.</given-names></name> <name><surname>Liu</surname> <given-names>T.</given-names></name> <name><surname>Liu</surname> <given-names>W.</given-names></name> <name><surname>Xu</surname> <given-names>R.</given-names></name> <name><surname>Li</surname> <given-names>Z.</given-names></name> <name><surname>Wang</surname> <given-names>J.</given-names></name></person-group> (<year>2021</year>). <article-title>Deep reinforcement learning for stock recommendation.</article-title> <source><italic>J. Phys.</italic></source> <volume>2050</volume>:<issue>012012</issue>. <pub-id pub-id-type="doi">10.1088/1742-6596/2050/1/012012</pub-id></citation></ref>
<ref id="B19"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Suttle</surname> <given-names>W.</given-names></name> <name><surname>Yang</surname> <given-names>Z.</given-names></name> <name><surname>Zhang</surname> <given-names>K.</given-names></name> <name><surname>Wang</surname> <given-names>Z.</given-names></name> <name><surname>Ba&#x015F;ar</surname> <given-names>T.</given-names></name> <name><surname>Liu</surname> <given-names>J.</given-names></name></person-group> (<year>2020</year>). <article-title>A multi-agent off-policy actor-critic algorithm for distributed reinforcement learning.</article-title> <source><italic>IFAC PapersOnLine</italic></source> <volume>53</volume> <fpage>1549</fpage>&#x2013;<lpage>1554</lpage>.</citation></ref>
<ref id="B20"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Takahashi</surname> <given-names>Y.</given-names></name></person-group> (<year>2001</year>). &#x201C;<article-title>Multi-controller fusion in multi-layered reinforcement learning</article-title>,&#x201D; in <source><italic>Proceedings of the conference documentation international conference on multisensor fusion and integration for intelligent systems. MFI 2001 (Cat. No.01TH8590)</italic></source>, (<publisher-loc>Baden-Baden</publisher-loc>: <publisher-name>IEEE</publisher-name>).</citation></ref>
<ref id="B21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>Y.</given-names></name> <name><surname>Cheng</surname> <given-names>Z.</given-names></name> <name><surname>Liu</surname> <given-names>L.</given-names></name> <name><surname>Zhu</surname> <given-names>D.</given-names></name> <name><surname>Zhao</surname> <given-names>R.</given-names></name></person-group> (<year>2019</year>). &#x201C;<article-title>Research on mission assignment assurance of remote rocket barrage based on stackelberg game</article-title>,&#x201D; in <source><italic>Proceedings of the 2nd international conference on frontiers of materials synthesis and processing IOP conference series: Materials science and engineering</italic></source>, <volume>Vol. 493</volume> <publisher-loc>Sanya</publisher-loc>. <pub-id pub-id-type="doi">10.1088/1757-899X/493/1/012056</pub-id></citation></ref>
<ref id="B22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Watkins</surname> <given-names>C. J. C. H.</given-names></name> <name><surname>Dayan</surname> <given-names>P.</given-names></name></person-group> (<year>1992</year>). <article-title>Q-learning.</article-title> <source><italic>Mach. Learn.</italic></source> <volume>8</volume> <fpage>279</fpage>&#x2013;<lpage>292</lpage>. <pub-id pub-id-type="doi">10.1007/BF00992698</pub-id></citation></ref>
<ref id="B23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>C.</given-names></name> <name><surname>Xu</surname> <given-names>G.</given-names></name> <name><surname>Ding</surname> <given-names>Y.</given-names></name> <name><surname>Zhao</surname> <given-names>J.</given-names></name></person-group> (<year>2019</year>). <article-title>Explore deep neural network and reinforcement learning to large-scale tasks processing in big data.</article-title> <source><italic>Int. J. Pattern Recognit. Artif. Intell.</italic></source> <volume>33</volume>:<issue>1951010</issue>.</citation></ref>
<ref id="B24"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>H.</given-names></name> <name><surname>Feng</surname> <given-names>J.</given-names></name> <name><surname>Dai</surname> <given-names>Q.</given-names></name></person-group> (<year>2022</year>). <article-title>Research on multi-UAV task assignment method based on reinforcement learning.</article-title> <source><italic>World Sci. Res. J.</italic></source> <volume>8</volume> <fpage>104</fpage>&#x2013;<lpage>109</lpage>. <pub-id pub-id-type="doi">10.6911/WSRJ.202201_8(1).0017</pub-id></citation></ref>
<ref id="B25"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>H. Y.</given-names></name> <name><surname>Zhang</surname> <given-names>S. W.</given-names></name> <name><surname>Li</surname> <given-names>X. Y.</given-names></name></person-group> (<year>2019</year>). <article-title>Modeling of situation assessment in regional air defense combat.</article-title> <source><italic>J. Def. Model. Simul. Appl. Methodol. Technol.</italic></source> <volume>16</volume> <fpage>91</fpage>&#x2013;<lpage>101</lpage>. <pub-id pub-id-type="doi">10.1177/1548512918809514</pub-id></citation></ref>
<ref id="B26"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>Y.</given-names></name> <name><surname>Lucia</surname> <given-names>S.</given-names></name></person-group> (<year>2021</year>). <article-title>Multi-step greedy reinforcement learning based on model predictive control.</article-title> <source><italic>IFAC PapersOnLine</italic></source> <volume>54</volume> <fpage>699</fpage>&#x2013;<lpage>705</lpage>. <pub-id pub-id-type="doi">10.1016/J.IFACOL.2021.08.323</pub-id></citation></ref>
<ref id="B27"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yaqi</surname> <given-names>G.</given-names></name></person-group> (<year>2021</year>). <source><italic>Tensegrity robot locomotion control via reinforcement learning.</italic></source> <publisher-loc>Dalian</publisher-loc>: <publisher-name>Dalian University of Technology</publisher-name>. <pub-id pub-id-type="doi">10.26991/d.cnki.gdllu.2021.000998</pub-id></citation></ref>
<ref id="B28"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>J.</given-names></name> <name><surname>Yu</surname> <given-names>Z.</given-names></name> <name><surname>Mao</surname> <given-names>S.</given-names></name> <name><surname>Periaswamy</surname> <given-names>S. C. G.</given-names></name> <name><surname>Patton</surname> <given-names>J.</given-names></name> <name><surname>Xia</surname> <given-names>X.</given-names></name></person-group> (<year>2020</year>). <article-title>IADRL: Imitation augmented deep reinforcement learning enabled UGV-UAV coalition for tasking in complex environments.</article-title> <source><italic>IEEE Access</italic></source> <volume>8</volume> <fpage>102335</fpage>&#x2013;<lpage>102347</lpage>. <pub-id pub-id-type="doi">10.1109/access.2020.2997304</pub-id></citation></ref>
<ref id="B29"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>S.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Dong</surname> <given-names>Q.</given-names></name></person-group> (<year>2022</year>). <article-title>Autonomous navigation of UAV in multi-obstacle environments based on a deep reinforcement learning approach.</article-title> <source><italic>Appl. Soft Comput. J.</italic></source> <volume>115</volume>:<issue>108194</issue>. <pub-id pub-id-type="doi">10.1016/J.ASOC.2021.108194</pub-id></citation></ref>
<ref id="B30"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>J.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name> <name><surname>Cai</surname> <given-names>Z.</given-names></name> <name><surname>Wang</surname> <given-names>L.</given-names></name> <name><surname>Wang</surname> <given-names>Y.</given-names></name></person-group> (<year>2021</year>). <article-title>End-to-end deep reinforcement learning for image-based UAV autonomous control.</article-title> <source><italic>Appl. Sci.</italic></source> <volume>11</volume>:<issue>8419</issue>. <pub-id pub-id-type="doi">10.3390/APP11188419</pub-id></citation></ref>
<ref id="B31"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>T.</given-names></name> <name><surname>Wang</surname> <given-names>Y.</given-names></name> <name><surname>Li</surname> <given-names>G.</given-names></name> <name><surname>Kong</surname> <given-names>L.</given-names></name> <name><surname>Chen</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>Y.</given-names></name><etal/></person-group> (<year>2021</year>). <article-title>A model-based reinforcement learning method based on conditional generative adversarial networks.</article-title> <source><italic>Pattern Recognit. Lett.</italic></source> <volume>152</volume> <fpage>18</fpage>&#x2013;<lpage>25</lpage>. <pub-id pub-id-type="doi">10.1016/J.PATREC.2021.08.019</pub-id></citation></ref>
<ref id="B32"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>X.</given-names></name> <name><surname>Zong</surname> <given-names>Q.</given-names></name> <name><surname>Tian</surname> <given-names>B.</given-names></name> <name><surname>Zhang</surname> <given-names>B.</given-names></name> <name><surname>You</surname> <given-names>M.</given-names></name></person-group> (<year>2019</year>). <article-title>Fast task allocation for heterogeneous unmanned aerial vehicles through reinforcement learning.</article-title> <source><italic>Aerosp. Sci. Technol.</italic></source> <volume>92</volume> <fpage>588</fpage>&#x2013;<lpage>594</lpage>. <pub-id pub-id-type="doi">10.1016/j.ast.2019.06.024</pub-id></citation></ref>
</ref-list>
</back>
</article>