<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neurorobot.</journal-id>
<journal-title>Frontiers in Neurorobotics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neurorobot.</abbrev-journal-title>
<issn pub-type="epub">1662-5218</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnbot.2023.1243174</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>MW-MADDPG: a meta-learning based decision-making method for collaborative UAV swarm</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Zhao</surname> <given-names>Minrui</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2345886/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Wang</surname> <given-names>Gang</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Fu</surname> <given-names>Qiang</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2057869/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Guo</surname> <given-names>Xiangke</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Chen</surname> <given-names>Yu</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Li</surname> <given-names>Tengda</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2261638/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Liu</surname> <given-names>XiangYu</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>College of Air and Missile Defense, Air Force Engineering University</institution>, <addr-line>Xi&#x00027;an</addr-line>, <country>China</country></aff>
<aff id="aff2"><sup>2</sup><institution>Graduate School, Academy of Military Science</institution>, <addr-line>Beijing</addr-line>, <country>China</country></aff>
<aff id="aff3"><sup>3</sup><institution>Unit 95866 of PLA</institution>, <addr-line>Baoding</addr-line>, <country>China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Ming-Feng Ge, China University of Geosciences Wuhan, China</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Mu Hua, University of Lincoln, United Kingdom; Yan Fang, Kennesaw State University, United States; Pengyu Yuan, Google, United States</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Qiang Fu <email>fuqiang_66688&#x00040;163.com</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>21</day>
<month>09</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>17</volume>
<elocation-id>1243174</elocation-id>
<history>
<date date-type="received">
<day>20</day>
<month>06</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>04</day>
<month>09</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2023 Zhao, Wang, Fu, Guo, Chen, Li and Liu.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Zhao, Wang, Fu, Guo, Chen, Li and Liu</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>Unmanned Aerial Vehicles (UAVs) have gained popularity due to their low lifecycle cost and minimal human risk, resulting in their widespread use in recent years. In the UAV swarm cooperative decision domain, multi-agent deep reinforcement learning has significant potential. However, current approaches are challenged by the multivariate mission environment and mission time constraints. In light of this, the present study proposes a meta-learning based multi-agent deep reinforcement learning approach that provides a viable solution to this problem. This paper presents an improved MAML-based multi-agent deep deterministic policy gradient (MADDPG) algorithm that achieves an unbiased initialization network by automatically assigning weights to meta-learning trajectories. In addition, a Reward-TD prioritized experience replay technique is introduced, which takes into account immediate reward and TD-error to improve the resilience and sample utilization of the algorithm. Experiment results show that the proposed approach effectively accomplishes the task in the new scenario, with significantly improved task success rate, average reward, and robustness compared to existing methods.</p></abstract>
<kwd-group>
<kwd>UAV</kwd>
<kwd>meta learning</kwd>
<kwd>multi-agent reinforcement learning (MARL)</kwd>
<kwd>Model Agnostic Meta Learning (MAML)</kwd>
<kwd>MADDPG</kwd>
</kwd-group>
<counts>
<fig-count count="9"/>
<table-count count="6"/>
<equation-count count="27"/>
<ref-count count="37"/>
<page-count count="16"/>
<word-count count="9256"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>As a reusable vehicle, Unmanned Aerial Vehicles (UAVs) do not need to be piloted. Instead, they are capable of accomplishing the given tasks by remote control or autonomous control (Silveira et al., <xref ref-type="bibr" rid="B27">2020</xref>; Yao et al., <xref ref-type="bibr" rid="B35">2021</xref>). This has received much attention from the industry in recent years. UAVs have several advantages, including low life-cycle cost (Lei et al., <xref ref-type="bibr" rid="B13">2021</xref>), low personnel risk (Rodriguez-Fernandez et al., <xref ref-type="bibr" rid="B26">2017</xref>), long duration of flight (Ge et al., <xref ref-type="bibr" rid="B6">2022</xref>; Pasha et al., <xref ref-type="bibr" rid="B22">2022</xref>), and maneuverability, size, and speed (Poudel and Moh, <xref ref-type="bibr" rid="B24">2022</xref>). These UAVs are increasingly being used in various fields such as tracking targets (Hu et al., <xref ref-type="bibr" rid="B10">2023</xref>), agriculture (Liu et al., <xref ref-type="bibr" rid="B18">2022b</xref>), rescue (Jin et al., <xref ref-type="bibr" rid="B12">2023</xref>), and transportation (Li et al., <xref ref-type="bibr" rid="B15">2021</xref>) for &#x0201C;Dull, Dirty, Dangerous, and Deep&#x0201D; (4D) missions (Aleksander, <xref ref-type="bibr" rid="B1">2018</xref>; Chamola et al., <xref ref-type="bibr" rid="B3">2021</xref>). The applications of UAVs are illustrated in <xref ref-type="fig" rid="F1">Figure 1</xref>. During a mission, UAVs typically operate in swarms to accomplish their objectives. Consequently, the cooperative control and decision-making methods used by UAV swarms have become increasingly critical. Effective collaborative decision-making techniques can enhance the efficiency and effectiveness of mission accomplishment. However, it is important to note that current cooperative decision-making methods, including non-learning methods and traditional heuristics for UAVs, have limited capacity to effectively manage conflicts between multiple aircraft and maintain a balance between adapting to variable mission environments and meeting time constraints. Therefore, this area has received significant attention from researchers seeking to develop more robust and versatile methods for UAV cooperative decision-making.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Unmanned aerial vehicles (UAVs) application scope diagram. UAVs have been widely utilized across various fields due to their numerous advantages.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1243174-g0001.tif"/>
</fig>
<p>At present, methods for cooperative control and decision-making of UAV swarms are typically classified into two main categories: top-down and bottom-up (Giles and Giammarco, <xref ref-type="bibr" rid="B7">2019</xref>). Top-down approaches are primarily utilized for centralized collaborative control and decision-making, while bottom-up approaches are mainly applied to distributed collaborative decision-making and control (Wang et al., <xref ref-type="bibr" rid="B29">2022</xref>).</p>
<p>The main advantage of the top-down approach is its ability to decompose complex tasks into smaller, more manageable components. In the context of UAV swarm collaborative decision-making, this approach can be used to break down the task into a task assignment problem, a trajectory planning problem, and a swarm control problem (Tang et al., <xref ref-type="bibr" rid="B28">2023</xref>). For example, Zhang et al. (<xref ref-type="bibr" rid="B36">2022</xref>) proposed a method for assigning search and rescue tasks to a combination of helicopters and UAVs. They analyzed the search and rescue level of each point and the hovering endurance of the UAV using principal component analysis and cluster analysis. They then constructed a multi-objective optimization model and solved it using the non-dominated sorting genetic algorithm-II to assign tasks to the UAVs. Liu et al. (<xref ref-type="bibr" rid="B16">2021</xref>) utilized the &#x0201C;Divide and Conquer&#x0201D; approach to create a hierarchical task scheduling framework that decomposed the UAV scheduling problem into several subproblems. They proposed a tabu-list-based simulated annealing (SATL) algorithm for task assignment and a variable neighborhood descent (VND) algorithm for generating the scheduling scheme. In another study, Liu et al. (<xref ref-type="bibr" rid="B17">2022a</xref>) proposed a particle swarm optimization algorithm for cluster scheduling of UAVs performing remote sensing tasks in emergency scenarios. While centralized decision-making methods have better global reach and simpler structures, their communication and computational costs increase significantly with an increase in the number of UAVs in the swarm. Therefore, there is a need to develop a distributed cooperative decision-making method for UAV swarms.</p>
<p>The bottom-up approach facilitates cooperative decision-making of UAV swarms through the observation, judgment, decision-making, and distributed negotiation of individual UAVs. This approach aligns well with the observe-orient-decide-act (OODA) theory and is particularly suited for distributed decision-making scenarios (Puente-Castro et al., <xref ref-type="bibr" rid="B25">2022</xref>), which are increasingly becoming the future trend (Ouyang et al., <xref ref-type="bibr" rid="B20">2023</xref>).</p>
<p>Wang and Zhang (<xref ref-type="bibr" rid="B30">2022</xref>) proposed a UAV cluster task allocation method based on the bionic wolf pack approach, which decomposes task allocation into three processes: task assignment, path planning, and coverage search. The UAV swarm is modeled according to the characteristics of a wolf pack, and distributed collaborative decision-making is achieved through information sharing within the UAV swarm. Yang et al. (<xref ref-type="bibr" rid="B34">2022</xref>) presented a distributed task reallocation method for the dynamic environment where tasks need to be reassigned among a UAV swarm. They proposed a distributed decision framework based on time-type processing policies and used a partial reassignment algorithm (PRA) to generate conflict-free solutions with less data communication and faster execution. Wei et al. (<xref ref-type="bibr" rid="B31">2021</xref>) introduced a distributed UAV cluster computational offloading method that leverages distributed Q-learning and proposes a cooperative exploration-based, prioritized experience replay method using distributed deep reinforcement learning techniques. This approach achieves distributed computational offloading and outperforms traditional methods in terms of average processing time, energy-task efficiency, and convergence rate (Ouyang et al., <xref ref-type="bibr" rid="B20">2023</xref>).</p>
<p>In recent years, deep reinforcement learning has shown promising results in various fields, such as training championship-level racers in Gran Turismo (Wurman et al., <xref ref-type="bibr" rid="B32">2022</xref>), achieving all-time top-three Stratego game ranking (Perolat et al., <xref ref-type="bibr" rid="B23">2022</xref>), and optimizing matrix multiplication operations (Fawzi et al., <xref ref-type="bibr" rid="B5">2022</xref>). However, when addressing the challenge of cooperative decision-making in UAV swarms, reinforcement learning suffers from weak generalization ability, low sample utilization, and slow learning speed (Beck et al., <xref ref-type="bibr" rid="B2">2023</xref>). To address these challenges, researchers have turned to meta-reinforcement learning, which is currently a hot topic in machine learning.</p>
<p>Meta-learning, also referred to as learn to learn, is a technique that involves training on a relevant task to learn meta-knowledge, which can then be applied to a new environment. This approach reduces the number of samples required and increases the training speed in the new environment (Hospedales et al., <xref ref-type="bibr" rid="B8">2022</xref>). Researchers have proposed meta-reinforcement learning methods by combining meta-learning with reinforcement learning techniques. Meta-reinforcement learning enhances the generalization ability and learning efficiency by utilizing the acquired meta-knowledge to guide the subsequent training process and achieve cross-task learning with limited samples (Beck et al., <xref ref-type="bibr" rid="B2">2023</xref>). Despite its successful implementation in various fields (Chen et al., <xref ref-type="bibr" rid="B4">2022</xref>; Jiang et al., <xref ref-type="bibr" rid="B11">2022</xref>; Zhao et al., <xref ref-type="bibr" rid="B37">2023</xref>), meta-reinforcement learning has not yet been widely adopted in the field of cooperative decision-making for heterogeneous UAV swarms.</p>
<p>The experience replay mechanism is a critical technique in deep reinforcement learning, first proposed in the deep Q network model (Mnih et al., <xref ref-type="bibr" rid="B19">2015</xref>). It improves data utilization, increases policy stability, and breaks correlations between states in the training data. To measure the priority of experience, Hou et al. (<xref ref-type="bibr" rid="B9">2017</xref>) proposed a method that uses the Temporal-Difference (TD) error, which improves the convergence speed of the algorithm. Pan et al. (<xref ref-type="bibr" rid="B21">2022</xref>) proposed a TD-Error and Time-based experience sampling method to reduce the influence of outdated experience. Li et al. (<xref ref-type="bibr" rid="B14">2022</xref>) introduced a Clustering experience replay (CER) method that clusters and replays transition using a divide-and-conquer framework based on time division, effectively exploiting the experience hidden in all explored transitions in the current training. However, prioritized experience replay algorithms that only consider TD-error in the learning process tend to ignore the role of immediate payoffs and experience with small time-differential errors, and the learning effectiveness of the algorithm is susceptible to the detrimental effects of temporal error outliers.</p>
<p>In this paper, we propose an improved MAML-based MADDPG algorithm to enhance the generalization capability, learning rate, and robustness of deep reinforcement learning methods used in UAV swarm collaborative decision-making for heterogeneous UAV swarms. The proposed algorithm incorporates a Reward-TD prioritized experience replay mechanism and buffer experience forgetting mechanism to improve the overall performance of the system. Firstly, the paper describes the problem of cooperative attack on ground targets by UAV swarms, models the UAV motion model, and formulates the cooperative decision-making problem as a POMDP model. Next, inspired by the Meta Weight Learning algorithm (Xu et al., <xref ref-type="bibr" rid="B33">2021</xref>), the paper proposes an improved meta-weight multi-agent deep deterministic policy gradient (MW-MADDPG) algorithm to obtain an unbiased initialization model by setting playback weights for trajectories and updates the meta-weights by gradient and momentum. To increase the effectiveness of the experience replay mechanism, the paper proposes a Reward-TD prioritized experience replay method with a forgetting mechanism. Finally, experiments are conducted to verify the generalization, robustness, and learning rate of the proposed approach. The main contributions of this paper include:</p>
<list list-type="order">
<list-item><p>Proposing the meta-weight multi-agent deep deterministic policy gradient (MW-MADDPG) algorithm for UAV swarm collaborative decision-making, which achieves end-to-end learning across tasks and can be applied to new scenarios quickly and stably after training.</p></list-item>
<list-item><p>Introducing the Reward-TD prioritized experience replay method to improve the convergence speed and utilization of experiences in the MW-MADDPG algorithm. The proposed method determines the priority of experience replay based on immediate reward and TD-error, thereby enhancing the quality of experience replay.</p></list-item>
<list-item><p>Employing a forgetting mechanism in the proposed MW-MADDPG algorithm to improve algorithm robustness and reduce overfitting. A threshold of sampling times is set to reduce the repetition of a small number of experiences during the experience replay process.</p></list-item>
</list>
</sec>
<sec id="s2">
<title>2. Background</title>
<sec>
<title>2.1. Reinforcement learning</title>
<p>Reinforcement learning is a trial-and-error technique for continuous learning, where an agent interacts with its external environment. The objective of the agent is to obtain the maximum cumulative reward from the external environment. Typically, reinforcement learning models the problem as a Markov decision process (MDP) or a partially observable Markov decision process (POMDP), which allows the agent to make decisions based on current states and future rewards, without requiring knowledge of the full environment model. Through repeated interactions with the environment, the agent learns through experience to select actions that lead to higher cumulative rewards, thereby improving its performance over time. A Markov reward process is usually represented by the tuple <italic>M</italic> &#x0003D; &#x0003C; <italic>S, A, T, R</italic>, &#x003B3; &#x0003E;, where: <italic>S</italic> &#x0003D; (<italic>s</italic><sub>1</sub>, <italic>s</italic><sub>2</sub>, &#x022EF;&#x000A0;&#x000A0;&#x000A0;, <italic>s</italic><sub>n</sub>), <italic>S</italic> is the set of all possible states in the MDP; <italic>A</italic> &#x0003D; (<italic>a</italic><sub>1</sub>, <italic>a</italic><sub>2</sub>, &#x022EF;&#x000A0;&#x000A0;&#x000A0;, <italic>a</italic><sub>m</sub>), <italic>A</italic> denotes the set of all possible actions in the MDP, &#x003B3; &#x02208; [0, 1], is the discount factor, which indicates the degree of influence of future rewards on the current behavior of the agents. &#x003B3; &#x0003D; 1 indicates that the future reward has the same effect as the current reward. &#x003B3; &#x0003D; 0 indicates that the future reward does not affect the current intelligence&#x00027;s action. In the reinforcement learning process, at each time step t, the intelligence is in state <italic>s</italic><sub><italic>t</italic></sub>, observes the environment, takes action <italic>a</italic><sub><italic>t</italic></sub>, gets feedback from the environment <italic>R</italic><sub><italic>t</italic></sub>, and moves to the next state <italic>s</italic><sub><italic>t</italic>&#x0002B;1</sub>. In an MDP, a state is called a Markov state when it satisfies the following conditions:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>P</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x022EF;</mml:mo><mml:mspace width="0.3em" class="thinspace"/><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The property that the state of the next moment is independent of the state of the past moment is known as the Markov property. In a Markov decision process (MDP), the state transition matrix <italic>P</italic> (also known as the state transition probability matrix) specifies the probability of transitioning from the current state <italic>s</italic> to the subsequent state <italic>s</italic>&#x02032;. Specifically, each element <inline-formula><mml:math id="M2"><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:msup><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub></mml:math></inline-formula> represents the probability of transitioning from state <italic>s</italic> to state <italic>s</italic>&#x02032; under a given action.</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M3"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:msup><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>P</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>s</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The reward <italic>R</italic><sub><italic>t</italic></sub> is also called cumulative reward, which is the sum of all rewards from the beginning to the end of the round:</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M4"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x0221E;</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msup><mml:mrow><mml:mi>&#x003B3;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msup><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The reward function indicates that the agent takes action <italic>a</italic>, and the expected reward after the transfer:</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M5"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mo>&#x1D53C;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x02223;</mml:mo><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
</sec>
<sec>
<title>2.2. Multi-agent reinforcement learning</title>
<p>In a multi-agent system, each agent has a limited observation range and can only obtain local information, making it challenging to observe the global environment. This problem is modeled as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) defined by the tuple <italic>M</italic> &#x0003D; &#x0003C; <italic>N, S, A, P, R, O</italic>, &#x003B3; &#x0003E;. Here, <italic>N</italic> represents the set of agents, <italic>S</italic> represents the set of agent states, <italic>A</italic> &#x0003D; <italic>A</italic><sub>1</sub> &#x000D7; <italic>A</italic><sub>2</sub> &#x000D7; &#x022EF; &#x000D7; <italic>A</italic><sub><italic>N</italic></sub> represents the joint action set of agents, where the action set of agent <italic>i</italic> is <italic>A</italic><sub><italic>i</italic></sub>, with <italic>i</italic> &#x02208; [1, <italic>N</italic>]. The state transition function <italic>P</italic>:<italic>S</italic> &#x000D7; <italic>A</italic> &#x000D7; <italic>S</italic> &#x02192; [0, 1] represents the probability of equipment transition. R is the reward function for all agents, and <italic>O</italic> &#x0003D; <italic>O</italic><sub>1</sub> &#x000D7; <italic>O</italic><sub>2</sub> &#x000D7; &#x022EF; &#x000D7; <italic>O</italic><sub><italic>N</italic></sub> represents the joint observation value of agents, where <italic>O</italic><sub><italic>i</italic></sub> denotes the observation value of agent <italic>i</italic>. Finally, &#x003B3; &#x02208; [0, 1] is the discount factor.</p>
<p>In Dec-POMDP, all agents select actions based on their own observations <italic>O</italic><sub><italic>i</italic></sub> in the state <italic>s</italic><sub><italic>t</italic></sub>, leading to a transition to the next state <italic>s</italic><sub><italic>t</italic>&#x0002B;1</sub> and receiving an environmental reward value <italic>r</italic><sub><italic>i</italic></sub>. The goal of each agent is to maximize the cumulative reward <inline-formula><mml:math id="M6"><mml:mi>G</mml:mi><mml:mo>=</mml:mo><mml:mstyle displaystyle="false"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msup><mml:mrow><mml:mi>&#x003B3;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msup><mml:msubsup><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>. This paper employs the classical MARL algorithm MADDPG, with further details provided in Section 4.1.</p>
</sec>
<sec>
<title>2.3. Meta-learning</title>
<p>Meta-learning, also known as learn-to-learn, is a recent research direction aimed at training an initial model to quickly adapt to new tasks with fewer data. Meta-learning comprises three phases: meta-training, meta-validation, and meta-testing. In the meta-training phase, a neural network uses support set data to train for a set of tasks and learn general knowledge for these tasks. In the meta-validation phase, the neural network selects query set data to verify model generalization and adjust hyperparameters used in meta-learning. Finally, in the meta-testing stage, the model is tested on new tasks to evaluate its training effect. The meta-learning paradigm is depicted in <xref ref-type="fig" rid="F2">Figure 2</xref>. The formal definition of meta-reinforcement learning is presented below, whereas the learning task of reinforcement learning is:</p>
<disp-formula id="E5"><label>(5)</label><mml:math id="M7"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>T</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>H</mml:mi></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Here, <italic>L</italic><sub><italic>T</italic></sub> represents the loss function that maps a given trajectory &#x003C4; &#x0003D; (<italic>s</italic><sub>0</sub>, <italic>a</italic><sub>1</sub>, <italic>s</italic><sub>1</sub>, <italic>r</italic><sub>1</sub>, &#x02026;, <italic>a</italic><sub><italic>H</italic></sub>, <italic>s</italic><sub><italic>H</italic></sub>, <italic>r</italic><sub><italic>H</italic></sub>) to a loss value. <italic>P</italic><sub><italic>T</italic></sub>(<italic>s</italic>) denotes the initial state distribution, while <italic>P</italic><sub><italic>T</italic></sub>(<italic>s</italic><sub><italic>t</italic>&#x0002B;1</sub>|<italic>s</italic><sub><italic>t</italic></sub>, <italic>a</italic><sub><italic>t</italic></sub>) refers to the state transition probability distribution. <italic>H</italic> corresponds to the trajectory length.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Schematic diagram of the meta-learning process. Meta-learning facilitates rapid adaptation to new tasks by leveraging knowledge acquired from previous tasks.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1243174-g0002.tif"/>
</fig>
<p>This paper discusses Model Agnostic Meta Learning (MAML), which is a model-independent general meta-learning algorithm that can be applied to any algorithm trained using gradient descent. MAML is adapted to deep neural network models through the use of meta-gradient updates and can be used for various neural network architectures such as convolutional, fully connected, recurrent neural networks, and more. Additionally, it can be applied to different types of machine-learning problems, such as regression, classification, clustering, reinforcement learning, and others.</p>
<p>The main idea of Model-Agnostic Meta-Learning (MAML) is to obtain an initial model that can be applied to a range of tasks and requires only a small amount of task-specific training to achieve good performance. Specifically, the strategy &#x003C0;<sub>&#x003B8;</sub> is obtained by interacting with the environment through the strategy &#x003C0;<sub>&#x003B8;</sub>, collecting <italic>K</italic> trajectories <inline-formula><mml:math id="M8"><mml:msubsup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>:</mml:mo><mml:mi>K</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, with the goal of minimizing the loss on the new task distribution <italic>D</italic>(<italic>T</italic>) and obtaining the strategy &#x003C0;<sub>&#x003D5;</sub>.</p>
<p>MAML updates the parameters &#x003D5; of the strategy &#x003C0;<sub>&#x003D5;</sub> by computing the gradient of the loss function <inline-formula><mml:math id="M9"><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>:</mml:mo><mml:mi>K</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> w.r.t. the parameter &#x003B8;, and updating &#x003D5; as:</p>
<disp-formula id="E6"><label>(6)</label><mml:math id="M10"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>&#x003D5;</mml:mi><mml:mo>=</mml:mo><mml:mi>&#x003B8;</mml:mi><mml:mo>-</mml:mo><mml:mi>&#x003B2;</mml:mi><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>:</mml:mo><mml:mi>K</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Here, <inline-formula><mml:math id="M11"><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>:</mml:mo><mml:mi>K</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> is the average loss over <italic>K</italic> trajectories, where <inline-formula><mml:math id="M12"><mml:msubsup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x0007E;</mml:mo><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003C4;</mml:mi><mml:mo>|</mml:mo><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. The loss function <italic>L</italic><sub><italic>T</italic></sub>(&#x003C4;<sub>&#x003B8;</sub>) for each trajectory &#x003C4;<sub>&#x003B8;</sub> is defined as:</p>
<disp-formula id="E7"><label>(7)</label><mml:math id="M13"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mo>&#x1D53C;</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0007E;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>H</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003B2; is the meta-learning rate.</p>
</sec>
</sec>
<sec id="s3">
<title>3. Problem formulation</title>
<sec>
<title>3.1. Task description</title>
<p>The objective of the UAV in the paper is to destroy the opponent&#x00027;s (blue side) strategic key location and ensure the survival of our side as much as possible while achieving this objective. The blue&#x00027;s strategic location is protected by Surface-to-air missiles (SAMs), which have a longer detection and attack range than our UAVs. Thus, it is imperative for the Red UAVs to exhibit cooperative behavior to successfully achieve the mission objective, which may involve the strategic &#x0201C;sacrifice&#x0201D; of detecting UAVs for locating SAM positions when necessary while minimizing the loss of attack UAVs. The neural network&#x00027;s strategy generation through learning is reliant on the adversary&#x00027;s strategy during training. Typically, the opponent&#x00027;s strategies are formulated by humans, which limits the samples to encompass the entire situation. To circumvent this issue, this work incorporates a large number of random variables into the SAM strategy modeling, such as the randomization of firing timing, firing number, and firing units. These variations introduce a dynamic battlefield environment in each confrontation, posing a challenge for the neural network. Although we know the location of the blue&#x00027;s strategic key location beforehand, we do not know the location of their SAMs, which can vary from mission to mission. Therefore, the red-side UAV algorithm needs to have fast adaptation capability. <xref ref-type="fig" rid="F3">Figure 3</xref> in the paper shows the experimental environment.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Experimental environment diagram. The objective of the red UAV swarm is to eliminate the blue airports and command posts, while the blue SAM is tasked with defending these targets.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1243174-g0003.tif"/>
</fig>
<sec>
<title>3.1.1. Force setting</title>
<p>Red side:</p>
<list list-type="bullet">
<list-item><p>Attack UAV: 3, detection range 35 km, attack range 30 km each carrying four anti-radiation missiles (ARM), four air-to-ground missiles (ATG);</p></list-item>
<list-item><p>Detect UAV: 4, detection range 10 km.</p></list-item>
</list>
<p>Blue side:</p>
<list list-type="bullet">
<list-item><p>Strategic key location: command post, airport;</p></list-item>
<list-item><p>SAM: three sets, each set is called a fire unit, attack range 35 km, with a guidance radar detection range of 40 km.</p></list-item>
</list>
</sec>
<sec>
<title>3.1.2. Winning rules</title>
<p>Red side:</p>
<list list-type="bullet">
<list-item><p>Victory condition: command post is destroyed;</p></list-item>
<list-item><p>Failure condition: command post is not destroyed at the endgame.</p></list-item>
</list>
<p>Blue side:</p>
<list list-type="bullet">
<list-item><p>Victory condition: command post is not destroyed at the endgame;</p></list-item>
<list-item><p>Failure condition: command post is destroyed.</p></list-item>
</list>
</sec>
<sec>
<title>3.1.3. Battlefield environment settings</title>
<list list-type="bullet">
<list-item><p>The red side is unable to detect the position of the blue side&#x00027;s SAMs until the guidance radar of the blue side&#x00027;s fire unit is activated;</p></list-item>
<list-item><p>The information collected by the Red Detect UAV regarding fire units is automatically synchronized and shared with other Red UAVs;</p></list-item>
<list-item><p>In each game, the position of the fire unit will remain&#x02018;unchanged;</p></list-item>
<list-item><p>The guidance radar of the fire units must be activated before they are able to launch their missiles;</p></list-item>
<list-item><p>Once the guidance radar of the fire units is turned on, it cannot be turned off again;</p></list-item>
<list-item><p>If the guidance radar of the fire unit is destroyed, the fire unit becomes inoperable and unable to launch missiles;</p></list-item>
<list-item><p>The guidance radar must be activated during the guidance procedure;</p></list-item>
<list-item><p>If the guidance radar of a fire unit is destroyed, any missiles launched by that unit will immediately self-destruct;</p></list-item>
<list-item><p>The ARM and ATG have a shooting range of 30 km and an 80% hit rate;</p></list-item>
<list-item><p>In the kill zone, ARM, ATG have a high kill probability of 75% and a low kill probability of 55%.</p></list-item>
</list>
</sec>
</sec>
<sec>
<title>3.2. UAV kinematic model</title>
<p>Typically, the flight control of UAVs involves considering their six degrees of freedom, such as heading, pitch, and roll. However, in this paper, we focus on studying the application of deep reinforcement learning methods in multi-UAV cooperative mission planning while taking into account the maneuvering performance of UAVs, which generally do not perform large-angle maneuvers or drastic changes in acceleration. Therefore, we establish a simplified UAV motion model as follows:</p>
<disp-formula id="E8"><label>(8)</label><mml:math id="M14"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mtable style="text-align:axis;" equalrows="false" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x01E8B;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x01E8F;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>&#x003C6;</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mtable style="text-align:axis;" equalrows="false" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo class="qopname">cos</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C6;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo class="qopname">sin</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C6;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x003D6;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x0016B;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where (<italic>x</italic><sub><italic>i</italic></sub>, <italic>y</italic><sub><italic>i</italic></sub>) denotes the position of UAV <italic>i</italic>, &#x003C6;<sub><italic>i</italic></sub> and<italic>v</italic><sub><italic>i</italic></sub> denote the heading angle and velocity of UAV <italic>i</italic>, and &#x003D6;<sub><italic>i</italic></sub>and &#x0016B;<sub><italic>i</italic></sub> denote the angular velocity and acceleration of UAV.</p>
<p>The UAV motion model has the following motion constraints:</p>
<disp-formula id="E9"><label>(9)</label><mml:math id="M15"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable style="text-align:axis;" equalrows="false" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:mn>0</mml:mn><mml:mo>&#x02264;</mml:mo><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02264;</mml:mo><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mo class="qopname">max</mml:mo></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn><mml:mo>&#x02264;</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02264;</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mo class="qopname">max</mml:mo></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mo class="qopname">min</mml:mo></mml:mrow></mml:msub><mml:mo>&#x02264;</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02264;</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mo class="qopname">max</mml:mo></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x003C6;</mml:mi></mml:mrow><mml:mrow><mml:mo class="qopname">min</mml:mo></mml:mrow></mml:msub><mml:mo>&#x02264;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C6;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02264;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C6;</mml:mi></mml:mrow><mml:mrow><mml:mo class="qopname">max</mml:mo></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
</sec>
<sec>
<title>3.3. POMDP model</title>
<p>This section models the decision problem for the UAVs as a POMDP and defines the observation space, action space, and reward function.</p>
<sec>
<title>3.3.1. Observation space</title>
<p>In this paper, the state space for the UAV decision-making process includes the necessary information for the UAVs. For UAV <italic>i</italic>, the observation space is defined as <italic>O</italic><sub><italic>i</italic></sub> &#x0003D; (<italic>x</italic><sub><italic>i</italic></sub>, <italic>y</italic><sub><italic>i</italic></sub>, &#x003C6;<italic>i, v</italic><sub><italic>i</italic></sub>, <italic>c</italic><sub><italic>ij</italic></sub>, <italic>o</italic><sub><italic>ik</italic></sub>). Here, <inline-formula><mml:math id="M16"><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C6;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> represents the information obtained by UAV <italic>i</italic> from UAV <italic>j</italic> within its observation range. The action of UAV <italic>j</italic> at the previous moment is denoted by <inline-formula><mml:math id="M17"><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003D6;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x0016B;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, where <italic>M</italic><sub><italic>j</italic></sub>(<italic>t</italic> &#x02212; 1) represents the action taken by UAV <italic>j</italic> in firing a missile. Additionally, <inline-formula><mml:math id="M18"><mml:msub><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> represents the information of fire unit <italic>k</italic> within UAV <italic>i</italic>&#x00027;s observation range. Here, <inline-formula><mml:math id="M19"><mml:msubsup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> denotes the state of the radar of fire unit <italic>k</italic> at the previous moment, while <inline-formula><mml:math id="M20"><mml:msubsup><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> denotes the last moment of missile-firing action taken by fire unit <italic>k</italic>.</p>
<p>Let the set of all UAVs be defined as <italic>D</italic> &#x0003D; {<italic>UAV</italic><sub>1</sub>, &#x02026;, <italic>UAV</italic><sub><italic>i</italic></sub>&#x02026;, <italic>UAV</italic><sub><italic>n</italic></sub>}. Here, <italic>UAV</italic><sub><italic>i</italic></sub> represents the UAV numbered <italic>i</italic> and <italic>n</italic> is the total number of UAVs. Similarly, let the set of all fire units be defined as <italic>F</italic> &#x0003D; {<italic>F</italic><sub>1</sub>, &#x02026;, <italic>F</italic><sub><italic>k</italic></sub>&#x02026;, <italic>F</italic><sub><italic>h</italic></sub>}, where <italic>F</italic><sub><italic>k</italic></sub> denotes the fire unit numbered <italic>k</italic>, and <italic>h</italic> is the total number of fire units.</p>
</sec>
<sec>
<title>3.3.2. Action space</title>
<p>The action space in this paper includes angular velocity, acceleration, launch missile, and radar state. The specific action space is defined as shown in <xref ref-type="table" rid="T1">Table 1</xref>.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Actions definition.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>Action variable</bold></th>
<th valign="top" align="left"><bold>Description</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">&#x003D6;<sub><italic>i</italic></sub>(<italic>t</italic>)</td>
<td valign="top" align="left">Angular velocity of UAV <italic>i</italic> at moment <italic>t</italic></td>
</tr> <tr>
<td valign="top" align="left">&#x0016B;<sub><italic>i</italic></sub>(<italic>t</italic>)</td>
<td valign="top" align="left">Acceleration of UAV <italic>i</italic> at moment <italic>t</italic></td>
</tr> <tr>
<td valign="top" align="left"><italic>M</italic><sub><italic>i</italic></sub>(<italic>t</italic>)</td>
<td valign="top" align="left">The target number of missile attacks fired by UAV/launch unit <italic>i</italic> at time <italic>t</italic>, which has an initial value of 0</td>
</tr>
<tr>
<td valign="top" align="left"><inline-formula><mml:math id="M21"><mml:msubsup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula></td>
<td valign="top" align="left">Fire unit <italic>k</italic> radar state at moment <italic>t</italic> (0 for off, 1 for on)</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>3.3.3. Reward function</title>
<p>The reward design should account for a large number of units on both the blue and red sides, resulting in a significant amount of status and action space. Providing a single reward value at the end of each battle round may result in sparse rewards and make it difficult for agents to explore winning states independently. Therefore, it is essential to create a well-designed reward function that can guide the agent&#x00027;s learning process effectively.</p>
<p>The approach is to assign a reward value for each type of unit on both the red and blue sides, such that the loss or victory of a unit during the battle triggers an appropriate bonus value (negative for losses suffered by our side, positive for those suffered by the opposing side). Additionally, to encourage the UAV to approach the fire unit, a reward is provided when the UAV moves closer to the target.</p>
<p>Providing rewards solely based on wins and losses can result in long training times and sparse rewards, particularly due to the duration of each round. To expedite the training process and enhance the quality of feedback provided during training, additional reward types such as episodic rewards, key event-driven rewards, and distance-based rewards are incorporated. The detailed reward design is presented in <xref ref-type="table" rid="T2">Table 2</xref>.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>UAV reward definition.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>Categories</bold></th>
<th valign="top" align="left"><bold>Event name</bold></th>
<th valign="top" align="center"><bold>Weights</bold></th>
<th valign="top" align="left"><bold>Description</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Episodic</td>
<td valign="top" align="left">Win</td>
<td valign="top" align="center">10</td>
<td valign="top" align="left">Win</td>
</tr> <tr>
<td valign="top" align="left">Reward</td>
<td valign="top" align="left">Loss</td>
<td valign="top" align="center">0</td>
<td valign="top" align="left">Loss</td>
</tr> <tr>
<td valign="top" align="left">Event</td>
<td valign="top" align="left">Destroyed command post</td>
<td valign="top" align="center">5</td>
<td valign="top" align="left">UAV destroys opponent&#x00027;s command post</td>
</tr> <tr>
<td valign="top" align="left">Based</td>
<td valign="top" align="left">Destroy airport</td>
<td valign="top" align="center">3</td>
<td valign="top" align="left">UAV destroy opponent&#x00027;s airport</td>
</tr> <tr>
<td valign="top" align="left">Reward</td>
<td valign="top" align="left">Destroy fire unit radar</td>
<td valign="top" align="center">2</td>
<td valign="top" align="left">Destroy a fire unit Radar</td>
</tr> <tr>
<td/>
<td valign="top" align="left">Detect UAV destroyed</td>
<td valign="top" align="center">&#x02212;0.5</td>
<td valign="top" align="left">One of detect UAV was destroyed</td>
</tr> <tr>
<td/>
<td valign="top" align="left">Attack UAV destroyed</td>
<td valign="top" align="center">&#x02212;1</td>
<td valign="top" align="left">One of attack UAV was destroyed</td>
</tr>
<tr>
<td valign="top" align="left" colspan="2">Distance based reward</td>
<td valign="top" align="center">&#x003BB;&#x000B7;<italic>d</italic><sub><italic>i</italic></sub></td>
<td valign="top" align="left"><inline-formula><mml:math id="M22"><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo class="qopname">min</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msqrt><mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>&#x0002B;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:msqrt></mml:mrow></mml:mrow></mml:math></inline-formula>, <italic>i</italic> &#x02208; <italic>D, k</italic> &#x02208; <italic>F</italic>) Weighting factor &#x003BB; determines the magnitude of the distance-based reward</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
</sec>
<sec id="s4">
<title>4. Method</title>
<sec>
<title>4.1. MADDPG-based collaborative decision-making method</title>
<p>Traditional single-agent reinforcement learning algorithms face challenges when dealing with collaborative multi-UAV tasks, such as large action spaces and unstable environments. In a multi-agent system, the increase in the number of agents leads to a larger state and action space. In addition, each agent&#x00027;s actions dynamically affect the environment in a way that does not exist in a static environment. For these reasons, traditional single-agent reinforcement learning algorithms are ineffective in a multi-agent environment. To address this problem, this paper employs the MADDPG algorithm in the framework of centralized training and decentralized execution. This approach alleviates the difficulties associated with fully centralized or fully decentralized algorithms by striking a balance between the two.</p>
<p>In contrast to traditional DRL algorithms, the MADDPG algorithm can leverage global information during training while utilizing only local information for decision-making. The following method is employed:</p>
<p>Suppose there are M agents in the multi-agent system, with a set of strategy networks denoted as &#x003BC; &#x0003D; (&#x003BC;<sub>1</sub>, &#x003BC;<sub>2</sub>, &#x022EF;&#x000A0;&#x000A0;&#x000A0;, &#x003BC;<sub><italic>M</italic></sub>), where &#x003BC;<sub><italic>i</italic></sub> represents the strategy network of the i-th agent. Additionally, there is a set of value networks denoted as <italic>q</italic> &#x0003D; (<italic>q</italic><sub>1</sub>, <italic>q</italic><sub>2</sub>, &#x022EF;&#x000A0;&#x000A0;&#x000A0;, <italic>q</italic><sub><italic>M</italic></sub>), where <italic>q</italic><sub><italic>i</italic></sub> represents the value network of the i-th agent. The parameter set for the strategy network is denoted as &#x003B8; &#x0003D; (&#x003B8;<sub>1</sub>, &#x003B8;<sub>2</sub>, &#x022EF;&#x000A0;&#x000A0;&#x000A0;, &#x003B8;<sub><italic>M</italic></sub>), where &#x003B8;<sub><italic>i</italic></sub> represents the strategy parameters of the <italic>i</italic>-th agent. Similarly, the parameter set for the value network is denoted as &#x003C9; &#x0003D; (&#x003C9;<sub>1</sub>, &#x003C9;<sub>2</sub>.&#x022EF;&#x000A0;&#x000A0;&#x000A0;, &#x003C9;<sub><italic>M</italic></sub>), where &#x003C9;<sub><italic>i</italic></sub> represents the value network parameters of the <italic>i</italic>-th agent. The objective function for the i-th agent is expressed as follows:</p>
<disp-formula id="E10"><label>(10)</label><mml:math id="M23"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mo>&#x1D53C;</mml:mo></mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>q</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>S</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>O</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>O</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x022EF;</mml:mo><mml:mspace width="0.3em" class="thinspace"/><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>O</mml:mi></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>For the deterministic strategy &#x003BC;<sub><italic>i</italic></sub>, the strategy gradient can be expressed as:</p>
<disp-formula id="E11"><label>(11)</label><mml:math id="M24"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mi>J</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mo>&#x1D53C;</mml:mo></mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mi>q</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>S</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>O</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>O</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x022EF;</mml:mo><mml:mspace width="0.3em" class="thinspace"/><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>O</mml:mi></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Here, &#x02207; represents the gradient operator.</p>
<p>A state is sampled from the experience pool D as follows: <inline-formula><mml:math id="M26"><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x022EF;</mml:mo><mml:mspace width="0.3em" class="thinspace"/><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, which can be used as an observation of the random variable. The agent&#x00027;s action is obtained from the policy network as:</p>
<disp-formula id="E13"><label>(12)</label><mml:math id="M27"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>&#x003BC;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x022EF;</mml:mo><mml:mspace width="0.3em" class="thinspace"/><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>&#x003BC;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msubsup><mml:mo>;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The gradient of the objective function is:</p>
<disp-formula id="E14"><label>(13)</label><mml:math id="M28"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>&#x003BC;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mi>q</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x022EF;</mml:mo><mml:mspace width="0.3em" class="thinspace"/><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The updated formula for the policy network parameters is:</p>
<disp-formula id="E15"><label>(14)</label><mml:math id="M29"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02190;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Here, &#x003B1;<sub>1</sub> represents the Actor learning rate.</p>
<p>The value network is updated through the TD algorithm as follows:</p>
<p>For the value network <italic>q</italic><sub><italic>i</italic></sub>(<italic>s, a</italic>; &#x003C9;<sub><italic>i</italic></sub>) of agent <italic>i</italic>, given the tuple (<italic>s</italic><sub><italic>t</italic></sub>, <italic>a</italic><sub><italic>t</italic></sub>, <italic>r</italic><sub><italic>t</italic></sub>, <italic>s</italic><sub><italic>t</italic>&#x0002B;1</sub>), the computational action according to the policy network is given by:</p>
<disp-formula id="E16"><label>(15)</label><mml:math id="M30"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>&#x003BC;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x022EF;</mml:mo><mml:mspace width="0.3em" class="thinspace"/><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>&#x003BC;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msubsup><mml:mo>;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Let <inline-formula><mml:math id="M31"><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x022EF;</mml:mo><mml:mspace width="0.3em" class="thinspace"/><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula>. The TD target is computed as:</p>
<disp-formula id="E17"><label>(16)</label><mml:math id="M32"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B3;</mml:mi><mml:mi>q</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The TD-error is calculated as:</p>
<disp-formula id="E18"><label>(17)</label><mml:math id="M33"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:msubsup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The value network parameters are then updated using gradient descent w.r.t. &#x003C9;<sub><italic>i</italic></sub>.</p>
<p>Update target network parameters for each agent <italic>i</italic>:</p>
<disp-formula id="E19"><label>(18)</label><mml:math id="M34"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mtext>&#x000A0;</mml:mtext></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msubsup><mml:mo>&#x02190;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mtext>&#x000A0;</mml:mtext></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msubsup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Here, &#x003C4;<sub>1</sub> is the soft update parameter.</p>
</sec>
<sec>
<title>4.2. Improved algorithm for MAML</title>
<p>This paper presents an improvement to the traditional MAML algorithm. The original MAML algorithm employs an average update method during gradient updates for each task in the task distribution. However, this can lead to biased models that perform better on one task than others. To overcome this issue, we propose an improved MAML method that introduces weights during the gradient update of different trajectories and incorporates an automatic weight calculation method. This approach aims to obtain an unbiased initialized network model.</p>
<p>The traditional MAML method updates the gradients of different trajectories without any distinction during the trajectory update process. This paper proposes a trajectory weighting method that leverages the concept of Adam&#x00027;s algorithm and utilizes gradient and momentum values to set the weights. This approach addresses the issue of subjective weight assignment and accelerates the convergence of the objective function to its minimum value.</p>
<p>The objective function for meta-learning in this paper is expressed as:</p>
<disp-formula id="E20"><label>(19)</label><mml:math id="M35"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>:</mml:mo><mml:mi>K</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>:</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x0007E;</mml:mo><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003C4;</mml:mi><mml:mo>|</mml:mo><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Here, to satisfy the normalization condition, let <inline-formula><mml:math id="M36"><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mo>&#x022EF;</mml:mo><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:math></inline-formula> be the weight of the k-th trajectory, where K is the total number of trajectories.</p>
<p>To obtain the optimal weights <inline-formula><mml:math id="M37"><mml:msubsup><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> that minimize the objective function, we update the weights <italic>w</italic><sub><italic>k</italic></sub> by computing their gradient. The gradient of the objective function w.r.t. the weights <italic>w</italic><sub><italic>k</italic></sub> is given as:</p>
<disp-formula id="E21"><label>(20)</label><mml:math id="M38"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtable style="text-align:axis;" equalrows="false" columnlines="none" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x02202;</mml:mi><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>:</mml:mo><mml:mi>K</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x02202;</mml:mi><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x02202;</mml:mi><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x02202;</mml:mi><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x02202;</mml:mi><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>&#x02026;</mml:mo><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x02202;</mml:mi><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>&#x02026;</mml:mo><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>&#x02026;</mml:mo><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>&#x02026;</mml:mo><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Drawing inspiration from the Adam optimization algorithm, we set the following parameters:</p>
<p>First-order momentum: <inline-formula><mml:math id="M39"><mml:msubsup><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msubsup><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula></p>
<p>Second order momentum: <inline-formula><mml:math id="M40"><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula></p>
<p>Bias-corrected first moment estimate:<inline-formula><mml:math id="M41"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>/</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></p>
<p>Bias-corrected second moment estimate:<inline-formula><mml:math id="M42"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>/</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></p>
<p>The updated weight for the next time: <inline-formula><mml:math id="M43"><mml:msubsup><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x02190;</mml:mo><mml:msubsup><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>-</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:msubsup><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>/</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msqrt><mml:mrow><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:msqrt><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B5;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></p>
<p>where &#x003B2;<sub>1</sub> and &#x003B2;<sub>2</sub> are exponential decay rates for the moment estimates, &#x003B5; &#x0003D; 10<sup>&#x02212;8</sup> is fuzz factor, &#x003B1; is weight learning rate.</p>
<p>Meta update: <inline-formula><mml:math id="M44"><mml:mi>&#x003BE;</mml:mi><mml:mo>&#x02190;</mml:mo><mml:mi>&#x003BE;</mml:mi><mml:mo>-</mml:mo><mml:mi>&#x003B2;</mml:mi><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow></mml:msub><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></p>
<p>where &#x003B2; is the meta-learning rate.</p>
<p>The proposed improved MAML algorithm is presented in <xref ref-type="table" rid="T5">Algorithm 1</xref>.</p>
<table-wrap position="float" id="T5">
<label>Algorithm 1</label>
<caption><p>MW-MADDPG algorithm.</p></caption>
<table frame="hsides" rules="groups">
<tbody>
<tr>
<td align="left" valign="top"><monospace> <bold>Input</bold>: &#x000A0;Weight learning rate &#x003B1;, meta-learning rate &#x003B2;, and exponential decay rate &#x003B2;<sub>1</sub>, &#x003B2;<sub>2</sub>;</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> <bold>Input</bold>: &#x000A0;The distribution over tasks <italic>P</italic><sub><italic>T</italic></sub>(<italic>s</italic>);</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 1: &#x000A0;Initialize model parameters &#x003BE;</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 2: &#x000A0;<bold>for</bold> <italic>i</italic> &#x0003D; 1, &#x022EF;&#x000A0;&#x000A0;&#x000A0;, <italic>N</italic> <bold>do</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 3: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Sample batch of tasks <italic>T</italic><sub><italic>i</italic></sub> &#x0007E; <italic>P</italic><sub><italic>T</italic></sub>(<italic>s</italic>)</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 4: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;<bold>for</bold> <italic>k</italic> &#x0003D; 1, &#x022EF;&#x000A0;&#x000A0;&#x000A0;, <italic>K</italic> <bold>do</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 5: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Sample trajectory <inline-formula><mml:math id="M45"><mml:msubsup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> from <italic>T</italic><sub><italic>i</italic></sub> using <xref ref-type="table" rid="T6">Algorithm 2</xref></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 6: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Compute the gradient of <inline-formula><mml:math id="M46"><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> w.r.t. &#x003BE;<sub><italic>k</italic></sub>: <inline-formula><mml:math id="M47"><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 7: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Optimize &#x003BE; with gradient descent: <inline-formula><mml:math id="M48"><mml:msubsup><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mtext>&#x000A0;</mml:mtext></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 8: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Re-sample K trajectories <inline-formula><mml:math id="M49"><mml:msubsup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mtext>&#x000A0;</mml:mtext></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>:</mml:mo><mml:mi>K</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 9: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;<bold>end for</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 10: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;<bold>for</bold> all <inline-formula><mml:math id="M50"><mml:msubsup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mtext>&#x000A0;</mml:mtext></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>:</mml:mo><mml:mi>K</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> <bold>do</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 11: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;The objective function w.r.t. the weights <italic>w</italic><sub><italic>k</italic></sub>: <inline-formula><mml:math id="M51"><mml:msubsup><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 12: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Compute the first-order and second-order momentum:</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> &#x000A0; <inline-formula><mml:math id="M52"><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msubsup><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> &#x000A0; <inline-formula><mml:math id="M53"><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 13: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Compute the bias-corrected first and second-moment estimates:</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> &#x000A0; <inline-formula><mml:math id="M54"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>/</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> &#x000A0; <inline-formula><mml:math id="M55"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>/</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 14: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Update the model weights: <inline-formula><mml:math id="M56"><mml:msubsup><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x02190;</mml:mo><mml:msubsup><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>-</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:msubsup><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>/</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msqrt><mml:mrow><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:msqrt><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B5;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 15: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Calculate <inline-formula><mml:math id="M57"><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:math></inline-formula> for each trajectory</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 16: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;<bold>end for</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 17: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Meta update: <inline-formula><mml:math id="M58"><mml:mi>&#x003BE;</mml:mi><mml:mo>&#x02190;</mml:mo><mml:mi>&#x003BE;</mml:mi><mml:mo>-</mml:mo><mml:mi>&#x003B2;</mml:mi><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow></mml:msub><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003BE;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 18: &#x000A0;<bold>end for</bold></monospace></td></tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>4.3. Improved prioritized experience replay mechanism</title>
<sec>
<title>4.3.1. Prioritized experience replay method based on immediate rewards and TD-error</title>
<p>Experience replay methods typically prioritize replay based on the size of TD-error to enhance neural network convergence speed and experience utilization. In this approach, sampling probability is proportional to the absolute value of TD-error, without considering the quality of the experience in supporting task performance. To address this limitation, this paper proposes an experience replay method based on reward and TD-error that includes immediate rewards from actions during the prioritization process. By considering the immediate reward as well as the TD-error, this improved approach can more accurately prioritize experiences that contribute most effectively to task completion.</p>
<p>The priority of TD-error and immediate reward-based experience replay is defined as:</p>
<disp-formula id="E22"><label>(21)</label><mml:math id="M59"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo>|</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>|</mml:mo><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B5;</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003B5; is a small constant that ensures the priority value is not zero.</p>
<p>The priority based on immediate rewards is given as:</p>
<disp-formula id="E23"><label>(22)</label><mml:math id="M60"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B5;</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>By sorting and ranking these priorities by size, we obtain <italic>rank</italic><sub><italic>r</italic></sub>(<italic>i</italic>) and <italic>rank</italic><sub><italic>T</italic></sub>(<italic>i</italic>). The combined ranking takes both priorities into account and is computed as:</p>
<disp-formula id="E24"><label>(23)</label><mml:math id="M61"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:msub><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>&#x003C1;</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:msub><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mi>&#x003C1;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:msub><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Here, &#x003C1; denotes the coefficient of importance of the experience which regulates the relative significance of the two experiences under consideration. When &#x003C1; &#x0003D; 0, only the TD-error is considered, while when &#x003C1; &#x0003D; 1, only the immediate reward is considered.</p>
<p>The combined priority of an experience is given as:</p>
<disp-formula id="E25"><label>(24)</label><mml:math id="M62"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:msub><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Here, &#x003B7; is the priority importance parameter that determines the degree of consideration given to priority. When &#x003B7; &#x0003D; 0, we have uniform experience sampling.</p>
<p>The experience sampling probability of an experience is obtained by normalizing its combined priority w.r.t. all experiences in the replay buffer:</p>
<disp-formula id="E26"><label>(25)</label><mml:math id="M63"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>This probability is used to sample experiences from the replay buffer during the learning process. Experiences with higher combined priorities are more likely to be sampled.</p>
</sec>
<sec>
<title>4.3.2. Forgetting mechanism</title>
<p>The immediate reward and TD-error are used to evaluate the learning value of experiences in the replay buffer, but excessive sampling of high-priority experiences can lead to overfitting. To alleviate this issue, this paper introduces a forgetting mechanism to alleviate overfitting.</p>
<p>The forgetting mechanism introduced in this paper includes setting a sampling threshold &#x003C8;. When the number of times an experience has been sampled, denoted as <italic>m</italic><sub><italic>i</italic></sub>, exceeds this threshold, its sampling probability is set to zero. This helps prevent overfitting by reducing the impact of experiences that have been repeatedly sampled.</p>
<p>The updated sampling probability of experience <italic>i</italic> after being processed by the forgetting mechanism is denoted as <inline-formula><mml:math id="M64"><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, and is given by:</p>
<disp-formula id="E27"><label>(26)</label><mml:math id="M65"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable style="text-align:axis;" equalrows="false" columnlines="none" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02264;</mml:mo><mml:mi>&#x003C8;</mml:mi></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0003E;</mml:mo><mml:mi>&#x003C8;</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula> 
 <p>Here, if <italic>m</italic><sub><italic>i</italic></sub> is less than or equal to the sampling threshold &#x003C8;, the sampling probability of experience <italic>i</italic> remains unchanged (<italic>p</italic><sub><italic>i</italic></sub>). Otherwise, if <italic>m</italic><sub><italic>i</italic></sub> is greater than &#x003C8;, the sampling probability of experience <italic>i</italic> is set to zero. When the replay buffer reaches capacity, experiences are removed in order of sampling replay priority from smallest to largest, based on the grooming of new experiences. This ensures that new experiences can enter the experience pool and contribute to the learning process.</p>
<p>The MADDPG algorithm with an improved prioritized experience replay mechanism is shown in <xref ref-type="table" rid="T6">Algorithm 2</xref>.</p>
<table-wrap position="float" id="T6">
<label>Algorithm 2</label>
<caption><p>MADDPG with improved Prioritized Experience Replay.</p></caption>
<table frame="hsides" rules="groups">
<tbody>
<tr>
<td align="left" valign="top"><monospace> <bold>Input</bold>: &#x000A0;Act noise <inline-formula><mml:math id="M66"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">N</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, discount factor &#x003B3;, constant &#x003B5;, coefficient of importance &#x003C1;, priority importance parameter &#x003B7;, sampling threshold &#x003C8;, actor</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 1: &#x000A0;learning rate &#x003B1;<sub>1</sub>, and soft update parameter &#x003C4;<sub>1</sub>;</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 2: &#x000A0;Initialize strategy networks &#x003BC; &#x0003D; (&#x003BC;<sub>1</sub>, &#x003BC;<sub>2</sub>, &#x022EF;&#x000A0;&#x000A0;&#x000A0;,</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 3: &#x000A0;&#x003BC;<sub><italic>M</italic></sub>), value networks <italic>q</italic> &#x0003D; (<italic>q</italic><sub>1</sub>, <italic>q</italic><sub>2</sub>, &#x022EF;&#x000A0;&#x000A0;&#x000A0;, <italic>q</italic><sub><italic>M</italic></sub>) and replay</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 4: &#x000A0;buffer <italic>D</italic></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 5: &#x000A0;<bold>for</bold> <italic>t</italic> &#x0003D; 1 to max-episode-length <bold>do</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 6: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Observe initial state <italic>s</italic><sub>1</sub></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 7: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;<bold>for</bold> agent <italic>i</italic> &#x0003D; 1, &#x022EF;&#x000A0;&#x000A0;&#x000A0;, <italic>M</italic> <bold>do</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 8: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;choose action <inline-formula><mml:math id="M67"><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>&#x003BC;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>o</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">N</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> w.r.t.</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 9: &#x000A0;current policy and exploration</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 10: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;<bold>end for</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 11: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Execute action <italic>a</italic><sub><italic>t</italic></sub> and observe reward <italic>r</italic><sub><italic>t</italic></sub> and</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 12: &#x000A0;next state <italic>s</italic><sub><italic>t</italic>&#x0002B;1</sub></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 13: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Add experience (<italic>s</italic><sub><italic>t</italic></sub>, <italic>a</italic><sub><italic>t</italic></sub>, <italic>r</italic><sub><italic>t</italic></sub>, <italic>s</italic><sub><italic>t</italic>&#x0002B;1</sub>) to replay buffer <italic>D</italic></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 14: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Sample a minibatch of <italic>B</italic> experiences from</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 15: &#x000A0;<italic>D</italic> using reward-TD prioritized experience replay method with forgetting mechanism</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 16: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;<bold>for</bold> <italic>i</italic> &#x0003D; 1, &#x022EF;&#x000A0;&#x000A0;&#x000A0;, <italic>M</italic> <bold>do</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 17: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Compute target <inline-formula><mml:math id="M68"><mml:msubsup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B3;</mml:mi><mml:mi>q</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 18: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Compute TD-error: <inline-formula><mml:math id="M69"><mml:msubsup><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:msubsup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 19: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Compute priority <inline-formula><mml:math id="M70"><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo>|</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>|</mml:mo><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003F5;</mml:mi></mml:math></inline-formula></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 20: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Compute priority <inline-formula><mml:math id="M71"><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003F5;</mml:mi></mml:math></inline-formula></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 21: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Compute rank <italic>rank</italic><sub><italic>T</italic></sub>(<italic>i</italic>) and <italic>rank</italic><sub><italic>r</italic></sub>(<italic>i</italic>) based on <italic>P</italic><sub><italic>T</italic></sub>(<italic>i</italic>) and <italic>P</italic><sub><italic>r</italic></sub>(<italic>i</italic>), respectively</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 22: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Compute combined rank <italic>rank</italic><sub><italic>C</italic></sub>(<italic>i</italic>) &#x0003D; &#x003C1;<italic>rank</italic><sub><italic>r</italic></sub>(<italic>i</italic>)&#x0002B;</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 23: &#x000A0;(1 &#x02212; &#x003C1;)<italic>rank</italic><sub><italic>T</italic></sub>(<italic>i</italic>)</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 24: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Compute combined priority <inline-formula><mml:math id="M72"><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:msub><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 25: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Compute sampling probability <inline-formula><mml:math id="M73"><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:math></inline-formula></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 26: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;<bold>if</bold> <italic>m</italic><sub><italic>i</italic></sub> &#x0003E; &#x003C8; <bold>then</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 27: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;<inline-formula><mml:math id="M74"><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mtext>&#x000A0;</mml:mtext></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 28: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;<bold>else if</bold> <italic>m</italic><sub><italic>i</italic></sub> &#x02264; &#x003C8; <bold>then</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 29: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;<inline-formula><mml:math id="M75"><mml:msubsup><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mtext>&#x000A0;</mml:mtext></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 30: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;<bold>end if</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 31: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;<bold>for</bold> agent <italic>i</italic> &#x0003D; 1, &#x022EF;&#x000A0;&#x000A0;&#x000A0;.<italic>M</italic> <bold>do</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 32: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Sample a minibatch of <italic>B</italic> samples from <italic>D</italic> using probabilities <italic>p</italic><sub><italic>i</italic></sub></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 33: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Compute the gradient <inline-formula><mml:math id="M76"><mml:msubsup><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> of the policy network of agent i</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 34: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Update policy network parameters:</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 35: &#x000A0;<inline-formula><mml:math id="M77"><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02190;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mi>g</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 36: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Update the value network parameters by minimizing the loss w.r.t. TD-error:</monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> &#x000A0; <inline-formula><mml:math id="M78"><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003C9;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:mfrac><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:munder><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 37: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;<bold>end for</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 38: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;Update target network parameters for each agent <italic>i</italic>: <inline-formula><mml:math id="M79"><mml:msubsup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mtext>&#x000A0;</mml:mtext></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msubsup><mml:mo>&#x02190;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mtext>&#x000A0;</mml:mtext></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msubsup></mml:math></inline-formula></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 39: &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;<bold>end for</bold></monospace></td>
</tr>
<tr>
<td align="left" valign="top"><monospace> 40: &#x000A0;<bold>end for</bold></monospace></td></tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
</sec>
<sec id="s5">
<title>5. Experiment</title>
<sec>
<title>5.1. Experiment setup</title>
<p>To assess the efficacy of the proposed method, the algorithm was validated in two simulation scenarios (as depicted in <xref ref-type="fig" rid="F4">Figure 4</xref> for training scenarios and <xref ref-type="fig" rid="F5">Figure 5</xref> for test scenarios) and compared against the MADDPG algorithm. The simulation scenarios are designed based on the force settings and battlefield environment assumptions described in Section 3. The primary focus of the evaluation is on the improved MAML method and the Reward-TD prioritized experience replay method proposed in this paper.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Training scenario experiment setup diagram, <bold>(A)</bold> is training scenario 1, and <bold>(B)</bold> is training scenario 2. Various training scenarios were employed to improve the generalization capacity of the proposed algorithm.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1243174-g0004.tif"/>
</fig>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Test scenario experiment setup diagram, <bold>(A)</bold> is test scenario 1 and <bold>(B)</bold> is test scenario 2. Various test scenarios were employed to evaluate the generalization capacity of the proposed algorithm.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1243174-g0005.tif"/>
</fig>
<p>The simulation scenario consists of four red reconnaissance UAVs and three attack UAVs, whose objective is to destroy the opponent&#x00027;s command post. During training, the position of the red UAVs is fixed at the beginning of each episode, while the positions of the opponent&#x00027;s command post and SAM are changed in the two training scenarios to enable meta-training of the neural network. The training hardware used for the experiments includes Intel Xeon E5-4655V4 CPU with eight cores, 512 GB RAM, and RTX3060 GPU with 12GB video memory. The proposed method is implemented using a standard fully connected multilayer perception (MLP) network with ReLU nonlinearities, consisting of three hidden layers. The size of the experimental environment is 240 km &#x000D7; 240 km, and the hyperparameters used in the experiments are shown in <xref ref-type="table" rid="T3">Table 3</xref>, with the settings referred to from Xu et al. (<xref ref-type="bibr" rid="B33">2021</xref>). During meta-training, the meta-training process lasts for 5 &#x000D7; 10<sup>5</sup> episodes to allow for sufficient learning and optimization of the neural network.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Hyperparameter setting for training process.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>Hyperparameter</bold></th>
<th valign="top" align="center"><bold>Value</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Replay buffer size</td>
<td valign="top" align="center">10<sup>5</sup></td>
</tr> <tr>
<td valign="top" align="left">Batch size</td>
<td valign="top" align="center">1,024</td>
</tr> <tr>
<td valign="top" align="left">minibatch size</td>
<td valign="top" align="center">32</td>
</tr> <tr>
<td valign="top" align="left">Discount factor</td>
<td valign="top" align="center">0.95</td>
</tr> <tr>
<td valign="top" align="left">Actor learning rate</td>
<td valign="top" align="center">0.0001</td>
</tr> <tr>
<td valign="top" align="left">Critic learning rate</td>
<td valign="top" align="center">0.0005</td>
</tr> <tr>
<td valign="top" align="left">Prioritized experience replay parameter</td>
<td valign="top" align="center">0.6</td>
</tr> <tr>
<td valign="top" align="left">Exponential decay rate</td>
<td valign="top" align="center">0.9</td>
</tr> <tr>
<td valign="top" align="left">Exponential decay rate</td>
<td valign="top" align="center">0.999</td>
</tr> <tr>
<td valign="top" align="left">Small constant</td>
<td valign="top" align="center">10<sup>&#x02212;4</sup></td>
</tr> <tr>
<td valign="top" align="left">Act noise</td>
<td valign="top" align="center">Uhlenbeck-Ornstein (UO)</td>
</tr> <tr>
<td valign="top" align="left">Weight learning rate</td>
<td valign="top" align="center">0.001</td>
</tr> <tr>
<td valign="top" align="left">Meta-learning rate</td>
<td valign="top" align="center">0.001</td>
</tr> <tr>
<td valign="top" align="left">Coefficient of the importance of the experience</td>
<td valign="top" align="center">0.4</td>
</tr> <tr>
<td valign="top" align="left">Priority importance parameter</td>
<td valign="top" align="center">1</td>
</tr> <tr>
<td valign="top" align="left">Sampling threshold</td>
<td valign="top" align="center">10</td>
</tr> <tr>
<td valign="top" align="left">Soft update parameter</td>
<td valign="top" align="center">0.01</td>
</tr>
<tr>
<td valign="top" align="left">Active function</td>
<td valign="top" align="center">ReLU</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>5.2. Experiment result</title>
<p>This section aims to evaluate the meta-learning and cold-start capability of the proposed MW-MADDPG algorithm in new task environments, as well as its generalization, convergence speed, and robustness compared to existing algorithms. Additionally, the performance of the proposed Reward-TD prioritized experience replay method with the forgetting mechanism is evaluated and compared to conventional methods.</p>
<sec>
<title>5.2.1. Cross-task performance comparison</title>
<p>The performance of the three algorithms (MW-MADDPG, MAML-MADDPG, and MADDPG) is evaluated using the reward value as the evaluation index across five random seeds in the two scenarios, as shown in <xref ref-type="fig" rid="F6">Figure 6</xref>. The results demonstrate that the MW-MADDPG and MAML-MADDPG algorithms with meta-learning outperform the MADDPG algorithm without meta-learning in both scenarios from the beginning episodes. Spechis indicates that the use of metaifically, the average reward for the MW-MADDPG method is &#x02212;1.39, for the MAML-MADDPG method is &#x02212;1.59, while the MADDPG method is &#x02212;4.93 in scenario 1. In scenario 2, the average reward for the MW-MADDPG method is &#x02212;2.25, for the MAML-MADDPG method is &#x02212;2.18, and for the MADDPG method is &#x02212;4.05.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Test scenario experiment setup diagram, <bold>(A)</bold> is reward curve of test scenario 1 and <bold>(B)</bold> is reward curve of test scenario 2. Various test scenarios were employed to evaluate the generalization capacity of the proposed algorithm.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1243174-g0006.tif"/>
</fig>
<p>Moreover, the initial performance of both methods employing meta-learning is significantly better than that of the MAML algorithm without meta-learning (<italic>p</italic> &#x0003C; 0.05). This indicates that the use of meta-learning methods can effectively improve the initial performance of the agent in this task.</p>
<p>In contrast, there is no significant difference between the initial performance of the MW-MADDPG method and the MAML-MADDPG method, indicating that the improvement in the initial performance of the proposed method in this paper is not statistically significant compared to existing reinforcement learning methods.</p>
<p>However, in terms of expected performance, the MW-MADDPG algorithm significantly outperforms the other two algorithms in terms of rewards when convergence is reached (<italic>p</italic> &#x0003C; 0.05). This suggests that the MW-MADDPG method proposed in this paper is capable of learning better strategies for the task at hand.</p>
<p>Regarding convergence rate, the MW-MADDPG algorithm reaches convergence at around 6 &#x000D7; 10<sup>5</sup> episodes, while the MAML-MADDPG algorithm takes around 8.5 &#x000D7; 10<sup>5</sup> episodes, and the MADDPG algorithm takes around 9 &#x000D7; 10<sup>5</sup> episodes to converge for both scenarios. This indicates that the MW-MADDPG method proposed in this paper can converge quickly in a new task environment and alleviate the cold-start problem, showcasing an advantage over existing methods.</p>
<p><xref ref-type="fig" rid="F7">Figure 7</xref> depicts the success rate of task execution in red, and it is evident that the MW-MADDPG method achieves a success rate of 77.71 and 72.21% in the two scenarios, respectively, which is significantly higher than the success rate of the other two methods (<italic>p</italic> &#x0003C; 0.05). These results indicate that the proposed method can effectively improve the performance of the agent under new tasks. Additionally, the variance of the MW-MADDPG method is smaller than that of the MAML-MADDPG method, indicating that the stability of the proposed method is better than that of the traditional meta-learning method.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Red side task execution success rate. In various test scenarios, the proposed method exhibits a higher winning rate compared to both the traditional meta-learning method and the non-meta-learning method.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1243174-g0007.tif"/>
</fig>
<p>Overall, the experiments demonstrate that the MW-MADDPG algorithm proposed in this paper can effectively learn the features of similar tasks, and learn from historical experience to obtain more effective strategies. The proposed method exhibits better initial performance, faster learning rate, better-expected performance, higher task success rate, and improved strategy stability in terms of reward and task execution success rate.</p>
</sec>
<sec>
<title>5.2.2. Reward-TD and FIFO performance</title>
<p>This section aims to verify the effectiveness of the proposed Reward-TD prioritized experience replay method and forgetting mechanism. Two sets of experiments are designed to apply the above experience replay mechanism to the MADDPG algorithm in training scenario 1 and training scenario 2, respectively. The reward curves obtained by the agent are analyzed across five random seeds to evaluate the performance of the proposed method.</p>
<p><xref ref-type="fig" rid="F8">Figure 8</xref> illustrates the reward curves of different experience replay methods in scenario 1 and scenario 2, with RPER representing the Reward-TD prioritized experience replay method, PER indicating the use of TD-error prioritized experience replay method, and VER standing for the random experience replay method. It can be observed that the final rewards obtained by using the RPER mechanism are significantly better than the other two methods (p &#x0003C; 0.05), indicating that the RPER mechanism can effectively improve the final reward level. In contrast, the difference between the final rewards of the PER and VER methods is not significant, suggesting that the TD-error-based preferred experience replay method has little effect on the final reward.</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>Reward curves for different experience replay methods. <bold>(A)</bold> Training scenario 1. <bold>(B)</bold> Training scenario 2. The RPER method outperforms the other two experience replay methods in terms of final reward, robustness, and algorithm convergence speed.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1243174-g0008.tif"/>
</fig>
<p>Regarding robustness, the RPER mechanism outperforms the PER mechanism, while the PER mechanism outperforms the VER mechanism. This indicates that the prioritized experience replay mechanism is better than the random uniform experience replay mechanism, and the Reward-TD based experience prioritization is better than the TD-error based experience prioritization.</p>
<p>In terms of convergence speed, the RPER algorithm achieves convergence significantly faster than the PER and VER algorithms. Specifically, RPER reaches convergence at around 8 &#x000D7; 10<sup>5</sup> episodes in both scenarios, while PER and VER reach convergence only after around 9 &#x000D7; 10<sup>5</sup> episodes. These results demonstrate that the RPER mechanism helps to improve the convergence speed of the algorithm, while PER and VER have no significant impact on the convergence speed.</p>
<p><xref ref-type="fig" rid="F9">Figure 9</xref> illustrates the graphs of different experience retention methods reward, comparing the effects of the forgetting mechanism (FM) and the first-in-first-out mechanism (FIFO) while using RPER and the MADDPG algorithm. From <xref ref-type="fig" rid="F9">Figure 9</xref>, it can be observed that the training speed using the forgetting mechanism is significantly better than the FIFO mechanism in terms of convergence speed (<italic>p</italic> &#x0003C; 0.05). This suggests that the forgetting mechanism proposed in this paper can effectively retain experience fragments that are beneficial to the agent and improve the training speed.</p>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption><p>Reward curves for different experience retention methods. <bold>(A)</bold> Training scenario 1. <bold>(B)</bold> Training scenario 2. The forgetting mechanism shows better convergence speed and robustness compared to the first-in-first-out mechanism in different training tasks, but the difference in the final reward is not significant.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1243174-g0009.tif"/>
</fig>
<p>In terms of robustness, the FM mechanism exhibits fewer curve fluctuations and a smaller range of error bands compared to the FIFO mechanism, as seen from the curve fluctuations and error band shading in the figure. The data show that the variance is reduced by 27.35% using FM compared to FIFO, indicating that FM can improve the algorithm&#x00027;s robustness during training.</p>
<p>Notably, there is no significant difference between the final rewards of the two experience retention mechanisms, suggesting that the use of different experience retention mechanisms has no significant effect on the final training effect.</p>
<p><xref ref-type="table" rid="T4">Table 4</xref> compares the proposed method with the original MADDPG method in terms of task success rate, strategic location ruin number, and other metrics to evaluate their advantages and disadvantages. The table shows that the proposed method outperforms the MADDPG method on both training and testing tasks. Specifically, the MW-MADDPG method exhibits significantly better attack UAV survival than detect UAV survival on testing tasks, indicating that it can learn an efficient strategy for attacking UAVs. These results suggest that the MW-MADDPG method proposed in this paper can effectively learn the common knowledge among tasks from training tasks and apply it to test scenarios, showcasing better cross-task capability.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Algorithm performance comparison.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="center" colspan="4"><bold>MADDPG</bold></th>
<th valign="top" align="center" colspan="4"><bold>MW-MADDPG</bold></th>
</tr>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="center"><bold>Scenario</bold></th>
<th valign="top" align="center"><bold>Training scenario 1</bold></th>
<th valign="top" align="center"><bold>Training scenario 2</bold></th>
<th valign="top" align="center"><bold>Test scenario 1</bold></th>
<th valign="top" align="center"><bold>Test scenario 2</bold></th>
<th valign="top" align="center"><bold>Training scenario 1</bold></th>
<th valign="top" align="center"><bold>Training scenario 2</bold></th>
<th valign="top" align="center"><bold>Test scenario 1</bold></th>
<th valign="top" align="center"><bold>Test scenario 2</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Reward</td>
<td valign="top" align="center">11.31 &#x000B1; 1.31</td>
<td valign="top" align="center">11.52 &#x000B1; 1.29</td>
<td valign="top" align="center">5.03 &#x000B1; 1.88</td>
<td valign="top" align="center">5.12 &#x000B1; 2.13</td>
<td valign="top" align="center">11.62 &#x000B1; 1.14</td>
<td valign="top" align="center">11.58 &#x000B1; 1.26</td>
<td valign="top" align="center">10.83 &#x000B1; 1.53</td>
<td valign="top" align="center">10.48 &#x000B1; 1.45</td>
</tr> <tr>
<td valign="top" align="left">Mission success Rate</td>
<td valign="top" align="center">80.74 &#x000B1; 7.84</td>
<td valign="top" align="center">82.47 &#x000B1; 7.93</td>
<td valign="top" align="center">13.88 &#x000B1; 7.76</td>
<td valign="top" align="center">15.55 &#x000B1; 8.94</td>
<td valign="top" align="center">81.39 &#x000B1; 6.49</td>
<td valign="top" align="center">79.85 &#x000B1; 7.39</td>
<td valign="top" align="center">78.76 &#x000B1; 7.94</td>
<td valign="top" align="center">75.48 &#x000B1; 8.93</td>
</tr> <tr>
<td valign="top" align="left">Strategic location Ruin number</td>
<td valign="top" align="center">0.93 &#x000B1; 0.44</td>
<td valign="top" align="center">0.91 &#x000B1; 0.36</td>
<td valign="top" align="center">0.12 &#x000B1; 0.09</td>
<td valign="top" align="center">0.13 &#x000B1; 0.08</td>
<td valign="top" align="center">0.85 &#x000B1; 0.37</td>
<td valign="top" align="center">0.87 &#x000B1; 0.31</td>
<td valign="top" align="center">0.81 &#x000B1; 0.54</td>
<td valign="top" align="center">0.79 &#x000B1; 0.66</td>
</tr> <tr>
<td valign="top" align="left">Detect UAV Survival number</td>
<td valign="top" align="center">0.83 &#x000B1; 0.53</td>
<td valign="top" align="center">0.88 &#x000B1; 0.48</td>
<td valign="top" align="center">0.21 &#x000B1; 0.13</td>
<td valign="top" align="center">0.19 &#x000B1; 0.11</td>
<td valign="top" align="center">0.91 &#x000B1; 0.47</td>
<td valign="top" align="center">0.83 &#x000B1; 0.53</td>
<td valign="top" align="center">0.63 &#x000B1; 0.37</td>
<td valign="top" align="center">0.71 &#x000B1; 0.41</td>
</tr>
<tr>
<td valign="top" align="left">Attack UAV Survival number</td>
<td valign="top" align="center">1.66 &#x000B1; 0.44</td>
<td valign="top" align="center">1.71 &#x000B1; 0.39</td>
<td valign="top" align="center">0.37 &#x000B1; 0.18</td>
<td valign="top" align="center">0.41 &#x000B1; 0.21</td>
<td valign="top" align="center">1.74 &#x000B1; 0.36</td>
<td valign="top" align="center">1.69 &#x000B1; 0.41</td>
<td valign="top" align="center">1.38 &#x000B1; 0.47</td>
<td valign="top" align="center">1.41 &#x000B1; 0.51</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Furthermore, the proposed Reward-TD prioritized experience replay method with the forgetting mechanism can improve the algorithm&#x00027;s robustness, exhibiting less variance and greater robustness for the MW-MADDPG method.</p>
</sec>
</sec>
</sec>
<sec sec-type="conclusions" id="s6">
<title>6. Conclusion</title>
<p>In summary, this paper proposes the MW-MADDPG algorithm for the cross-task heterogeneous UAV swarm cooperative decision-making problem. The proposed algorithm includes the improved MAML meta-learning method and the Reward-TD priority reward replay method with a forgetting mechanism, enabling cross-task intelligent UAV decision-making based on the MADDPG algorithm and achieving the expected goals. Experimental results demonstrate that the proposed methods can achieve better task success rates, robustness, and rewards compared to traditional methods, while also exhibiting better generalization performance, overcoming the cold start problem in traditional methods. The proposed algorithm has the potential to be extended to larger-scale scenarios and provide a solution to the cross-task heterogeneous UAV swarm surprise defense problem.</p>
<p>In the future, further research can be done by introducing meta-learning methods into intelligent decision-making in air defense systems to enable self-play between UAV penetration and air defense systems. Additionally, combining transfer learning with meta-learning may improve generalization performance. Furthermore, we prepare to build a high-fidelity battlefield environment that can provide a more accurate simulation of the battle process and enable more realistic testing of the proposed algorithms.</p>
</sec>
<sec sec-type="data-availability" id="s7">
<title>Data availability statement</title>
<p>The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.</p>
</sec>
<sec sec-type="author-contributions" id="s8">
<title>Author contributions</title>
<p>Conceptualization: MZ and QF. Data curation: XG. Funding acquisition and writing&#x02014;review and editing: GW. Investigation: MZ, XG, YC, and XL. Methodology and writing&#x02014;original draft: MZ. Software: TL. Supervision: GW and QF. Validation: QF and YC. Visualization: MZ, TL, and XL. All authors contributed to the article and approved the submitted version.</p>
</sec>
</body>
<back>
<sec sec-type="funding-information" id="s9">
<title>Funding</title>
<p>This research was funded by the National Natural Science Foundation of China (Grants: 62106283 and 52175282) and Basic Natural Science Research Program of Shaanxi Province (Grant: 2021JM-226).</p>
</sec>
<ack><p>We would like to thank the reviewers, whose insightful comments greatly improved the quality of this paper.</p>
</ack>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s10">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Aleksander</surname> <given-names>K. C.</given-names></name></person-group> (<year>2018</year>). <article-title>Military use of unmanned aerial vehicles-a historical study</article-title>. <source>Saf. Def</source>. <volume>4</volume>, <fpage>17</fpage>&#x02013;<lpage>21</lpage>. <pub-id pub-id-type="doi">10.37105/sd.4</pub-id><pub-id pub-id-type="pmid">24581931</pub-id></citation></ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Beck</surname> <given-names>J.</given-names></name> <name><surname>Vuorio</surname> <given-names>R.</given-names></name> <name><surname>Liu</surname> <given-names>E. Z.</given-names></name> <name><surname>Xiong</surname> <given-names>Z.</given-names></name> <name><surname>Zintgraf</surname> <given-names>L.</given-names></name> <name><surname>Finn</surname> <given-names>C.</given-names></name> <etal/></person-group>. (<year>2023</year>). <article-title>Survey of meta-reinforcement learning</article-title>. <source>arXiv</source>. [preprint]. <pub-id pub-id-type="doi">10.48550/arXiv.2301.08028</pub-id></citation>
</ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chamola</surname> <given-names>V.</given-names></name> <name><surname>Kotesh</surname> <given-names>P.</given-names></name> <name><surname>Agarwal</surname> <given-names>A.</given-names></name> <name><surname>Naren</surname></name> <name><surname>Gupta</surname> <given-names>N.</given-names></name> <name><surname>Guizani</surname> <given-names>M.</given-names></name></person-group> (<year>2021</year>). <article-title>A comprehensive review of unmanned aerial vehicle attacks and neutralization techniques</article-title>. <source>Ad Hoc Netw</source>. <volume>111</volume>, <fpage>102324</fpage>. <pub-id pub-id-type="doi">10.1016/j.adhoc.2020.102324</pub-id><pub-id pub-id-type="pmid">33071687</pub-id></citation></ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>L.</given-names></name> <name><surname>Hu</surname> <given-names>B.</given-names></name> <name><surname>Guan</surname> <given-names>Z. H.</given-names></name> <name><surname>Zhao</surname> <given-names>L.</given-names></name> <name><surname>Shen</surname> <given-names>X. M.</given-names></name></person-group> (<year>2022</year>). <article-title>Multiagent meta-reinforcement learning for adaptive multipath routing optimization</article-title>. <source>IEEE Trans. Neural Netw. Learn. Syst</source>. <volume>33</volume>, <fpage>5374</fpage>&#x02013;<lpage>5386</lpage>. <pub-id pub-id-type="doi">10.1109/TNNLS.2021.3070584</pub-id><pub-id pub-id-type="pmid">33881997</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fawzi</surname> <given-names>A.</given-names></name> <name><surname>Balog</surname> <given-names>M.</given-names></name> <name><surname>Huang</surname> <given-names>A.</given-names></name> <name><surname>Hubert</surname> <given-names>T.</given-names></name> <name><surname>Romera-Paredes</surname> <given-names>B.</given-names></name> <name><surname>Barekatain</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>Discovering faster matrix multiplication algorithms with reinforcement learning</article-title>. <source>Nature</source> <volume>610</volume>, <fpage>47</fpage>. <pub-id pub-id-type="doi">10.1038/s41586-022-05172-4</pub-id><pub-id pub-id-type="pmid">36198780</pub-id></citation></ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ge</surname> <given-names>J.</given-names></name> <name><surname>Liu</surname> <given-names>L.</given-names></name></person-group> (<year>2023</year>). <article-title>Electromagnetic interference modeling and elimination for a solar/hydrogen hybrid powered small-scale UAV</article-title>. <source>Chin. J. Aeronaut</source>. (2023). <pub-id pub-id-type="doi">10.1016/j.cja.2023.03.044</pub-id>. [Epub ahead of print].</citation>
</ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Giles</surname> <given-names>K.</given-names></name> <name><surname>Giammarco</surname> <given-names>K.</given-names></name></person-group> (<year>2019</year>). <article-title>A mission-based architecture for swarm unmanned systems</article-title>. <source>Syst. Eng</source>. <volume>22</volume>, <fpage>271</fpage>&#x02013;<lpage>281</lpage>. <pub-id pub-id-type="doi">10.1002/sys.21477</pub-id></citation>
</ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hospedales</surname> <given-names>T.</given-names></name> <name><surname>Antoniou</surname> <given-names>A.</given-names></name> <name><surname>Micaelli</surname> <given-names>P.</given-names></name> <name><surname>Storkey</surname> <given-names>A.</given-names></name></person-group> (<year>2022</year>). <article-title>Meta-learning in neural networks: a survey</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell</source>. <volume>44</volume>, <fpage>5149</fpage>&#x02013;<lpage>5169</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2021.3079209</pub-id><pub-id pub-id-type="pmid">33974543</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hou</surname> <given-names>Y.</given-names></name> <name><surname>Liu</surname> <given-names>L.</given-names></name> <name><surname>Wei</surname> <given-names>Q.</given-names></name> <name><surname>Xu</surname> <given-names>X.</given-names></name> <name><surname>Chen</surname> <given-names>C.</given-names></name></person-group> (<year>2017</year>). <article-title>A novel DDPG method with prioritized experience replay</article-title>. <source>2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC)</source> (<publisher-loc>Banff, AB</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>316</fpage>&#x02013;<lpage>321</lpage>. <pub-id pub-id-type="doi">10.1109/SMC.2017.8122622</pub-id></citation>
</ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hu</surname> <given-names>Z.</given-names></name> <name><surname>Gao</surname> <given-names>Z.</given-names></name> <name><surname>Wan</surname> <given-names>K.</given-names></name> <name><surname>Evgeny</surname> <given-names>N.</given-names></name> <name><surname>andLi</surname> <given-names>K.</given-names></name></person-group> (<year>2023</year>). <article-title>Imaginary filtered hindsight experience replay for UAV tracking dynamic targets in large-scale unknown environments</article-title>. <source>Chin. J. Aeronaut</source>. <volume>36</volume>, <fpage>377</fpage>&#x02013;<lpage>391</lpage>. <pub-id pub-id-type="doi">10.1016/j.cja.2022.09.008</pub-id></citation>
</ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jiang</surname> <given-names>P.</given-names></name> <name><surname>Song</surname> <given-names>S. J.</given-names></name> <name><surname>Huang</surname> <given-names>G.</given-names></name></person-group> (<year>2022</year>). <article-title>Attention-based meta-reinforcement learning for tracking control of AUV with time-varying dynamics</article-title>. <source>IEEE Trans. Neural Netw. Learn. Syst</source>. <volume>33</volume>, <fpage>6388</fpage>&#x02013;<lpage>6401</lpage>. <pub-id pub-id-type="doi">10.1109/TNNLS.2021.3079148</pub-id><pub-id pub-id-type="pmid">34029197</pub-id></citation></ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jin</surname> <given-names>N. S.</given-names></name> <name><surname>Gui</surname> <given-names>J. S.</given-names></name> <name><surname>Zhou</surname> <given-names>X. R.</given-names></name></person-group> (<year>2023</year>). <article-title>Equalizing service probability in UAV-assisted wireless powered mmWave networks for post-disaster rescue</article-title>. <source>Comput. Netw</source>. <volume>225</volume>, <fpage>109644</fpage>. <pub-id pub-id-type="doi">10.1016/j.comnet.2023.109644</pub-id></citation>
</ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lei</surname> <given-names>L.</given-names></name> <name><surname>Shen</surname> <given-names>G. Q.</given-names></name> <name><surname>Zhang</surname> <given-names>L. J.</given-names></name> <name><surname>Li</surname> <given-names>Z. L.</given-names></name></person-group> (<year>2021</year>). <article-title>Toward intelligent cooperation of UAV swarms: when machine learning meets digital twin</article-title>. <source>IEEE Netw</source>. <volume>35</volume>, <fpage>386</fpage>&#x02013;<lpage>392</lpage>. <pub-id pub-id-type="doi">10.1109/MNET.011.2000388</pub-id></citation>
</ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>M.</given-names></name> <name><surname>Huang</surname> <given-names>T.</given-names></name> <name><surname>Zhu</surname> <given-names>W.</given-names></name></person-group> (<year>2022</year>). <article-title>Clustering experience replay for the effective exploitation in reinforcement learning</article-title>. <source>Pattern Recognit</source>. <volume>131</volume>, <fpage>108875</fpage>. <pub-id pub-id-type="doi">10.1016/j.patcog.2022.108875</pub-id></citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>X.</given-names></name> <name><surname>Tan</surname> <given-names>J. W.</given-names></name> <name><surname>Liu</surname> <given-names>A. F.</given-names></name> <name><surname>Vijayakumar</surname> <given-names>P.</given-names></name> <name><surname>Kumar</surname> <given-names>N.</given-names></name> <name><surname>Alazab</surname> <given-names>M. A.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Novel UAV-enabled data collection scheme for intelligent transportation system through UAV speed control</article-title>. <source>IEEE Trans. Intell. Transp. Syst</source>. <volume>22</volume>, <fpage>2100</fpage>&#x02013;<lpage>2110</lpage>. <pub-id pub-id-type="doi">10.1109/TITS.2020.3040557</pub-id></citation>
</ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>H.</given-names></name> <name><surname>Li</surname> <given-names>X. M.</given-names></name> <name><surname>Wu</surname> <given-names>G. H.</given-names></name> <name><surname>Fan</surname> <given-names>M. F.</given-names></name> <name><surname>Wang</surname> <given-names>R.</given-names></name> <name><surname>Gao</surname> <given-names>L.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>An iterative two-phase optimization method based on divide and conquer framework for integrated scheduling of multiple UAVs</article-title>. <source>IEEE Trans. Intell. Transp. Syst</source>. <volume>22</volume>, <fpage>5926</fpage>&#x02013;<lpage>5938</lpage>. <pub-id pub-id-type="doi">10.1109/TITS.2020.3042670</pub-id></citation>
</ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>J. L.</given-names></name> <name><surname>Liao</surname> <given-names>X. H.</given-names></name> <name><surname>Ye</surname> <given-names>H. P.</given-names></name> <name><surname>Yue</surname> <given-names>H. Y.</given-names></name> <name><surname>Wang</surname> <given-names>Y.</given-names></name> <name><surname>Tan</surname> <given-names>X.</given-names></name> <etal/></person-group>. (<year>2022a</year>). <article-title>Swarm scheduling method for remote sensing observations during emergency scenarios</article-title>. <source>Remote Sens</source>. <volume>14</volume>, <fpage>1406</fpage>. <pub-id pub-id-type="doi">10.3390/rs14061406</pub-id></citation>
</ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>W.</given-names></name> <name><surname>Quijano</surname> <given-names>K.</given-names></name> <name><surname>Crawford</surname> <given-names>M. M.</given-names></name></person-group> (<year>2022b</year>). <article-title>YOLOv5-tassel: detecting tassels in RGB UAV imagery with improved YOLOv5 based on transfer learning</article-title>. <source>IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens</source>. <volume>15</volume>, <fpage>8085</fpage>&#x02013;<lpage>8094</lpage>. <pub-id pub-id-type="doi">10.1109/JSTARS.2022.3206399</pub-id></citation>
</ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mnih</surname> <given-names>V.</given-names></name> <name><surname>Kavukcuoglu</surname> <given-names>K.</given-names></name> <name><surname>Silver</surname> <given-names>D.</given-names></name> <name><surname>Rusu</surname> <given-names>A. A.</given-names></name> <name><surname>Veness</surname> <given-names>J.</given-names></name> <name><surname>Bellemare</surname> <given-names>M. G.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>Human-level control through deep reinforcement learning</article-title>. <source>Nature</source> <volume>518</volume>, <fpage>529</fpage>&#x02013;<lpage>533</lpage>. <pub-id pub-id-type="doi">10.1038/nature14236</pub-id><pub-id pub-id-type="pmid">25719670</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ouyang</surname> <given-names>Q.</given-names></name> <name><surname>Wu</surname> <given-names>Z. X.</given-names></name> <name><surname>Cong</surname> <given-names>Y. H.</given-names></name> <name><surname>Wang</surname> <given-names>Z. S.</given-names></name></person-group> (<year>2023</year>). <article-title>Formation control of unmanned aerial vehicle swarms: a comprehensive review</article-title>. <source>Asian J. Control</source> <volume>25</volume>, <fpage>570</fpage>&#x02013;<lpage>593</lpage>. <pub-id pub-id-type="doi">10.1002/asjc.2806</pub-id></citation>
</ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pan</surname> <given-names>W.</given-names></name> <name><surname>Wang</surname> <given-names>N.</given-names></name> <name><surname>Xu</surname> <given-names>C.</given-names></name> <name><surname>Hwang</surname> <given-names>K. S.</given-names></name></person-group> (<year>2022</year>). <article-title>A dynamically adaptive approach to reducing strategic interference for multiagent systems</article-title>. <source>IEEE Trans. Cogn. Develop. Syst</source>. <volume>14</volume>, <fpage>1486</fpage>&#x02013;<lpage>1495</lpage>. <pub-id pub-id-type="doi">10.1109/TCDS.2021.3110959</pub-id></citation>
</ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pasha</surname> <given-names>J.</given-names></name> <name><surname>Elmi</surname> <given-names>Z.</given-names></name> <name><surname>Purkayastha</surname> <given-names>S.</given-names></name> <name><surname>Fathollahi-Fard</surname> <given-names>A. M.</given-names></name> <name><surname>Ge</surname> <given-names>Y. E.</given-names></name> <name><surname>Lau</surname> <given-names>Y. Y.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>The drone scheduling problem: a systematic state-of-the-art review</article-title>. <source>IEEE Trans. Intell. Transp. Syst</source>. <volume>23</volume>, <fpage>14224</fpage>&#x02013;<lpage>14247</lpage>. <pub-id pub-id-type="doi">10.1109/TITS.2022.3155072</pub-id></citation>
</ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Perolat</surname> <given-names>J.</given-names></name> <name><surname>De Vylder</surname> <given-names>B.</given-names></name> <name><surname>Hennes</surname> <given-names>D.</given-names></name> <name><surname>Tarassov</surname> <given-names>E.</given-names></name> <name><surname>Strub</surname> <given-names>F.</given-names></name> <name><surname>de Boer</surname> <given-names>T.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>Mastering the game of Stratego with model-free multiagent reinforcement learning</article-title>. <source>Science</source> <volume>378</volume>, <fpage>990</fpage>. <pub-id pub-id-type="doi">10.1126/science.add4679</pub-id><pub-id pub-id-type="pmid">36454847</pub-id></citation></ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Poudel</surname> <given-names>S.</given-names></name> <name><surname>Moh</surname> <given-names>S.</given-names></name></person-group> (<year>2022</year>). <article-title>Task assignment algorithms for unmanned aerial vehicle networks: a comprehensive survey</article-title>. <source>Veh. Commun</source>. <volume>35</volume>, <fpage>100469</fpage>. <pub-id pub-id-type="doi">10.1016/j.vehcom.2022.100469</pub-id></citation>
</ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Puente-Castro</surname> <given-names>A.</given-names></name> <name><surname>Rivero</surname> <given-names>D.</given-names></name> <name><surname>Pazos</surname> <given-names>A.</given-names></name> <name><surname>Fernandez-Blanco</surname> <given-names>E.</given-names></name></person-group> (<year>2022</year>). <article-title>A review of artificial intelligence applied to path planning in UAV swarms</article-title>. <source>Neural Comput. Appl</source>. <volume>34</volume>, <fpage>153</fpage>&#x02013;<lpage>170</lpage>. <pub-id pub-id-type="doi">10.1007/s00521-021-06569-4</pub-id></citation>
</ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rodriguez-Fernandez</surname> <given-names>V.</given-names></name> <name><surname>Menendez</surname> <given-names>H. D.</given-names></name> <name><surname>Camacho</surname> <given-names>D.</given-names></name></person-group> (<year>2017</year>). <article-title>Analysing temporal performance profiles of UAV operators using time series clustering</article-title>. <source>Expert Syst. Appl</source>. <volume>70</volume>, <fpage>103</fpage>&#x02013;<lpage>118</lpage>. <pub-id pub-id-type="doi">10.1016/j.eswa.2016.10.044</pub-id></citation>
</ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Silveira</surname> <given-names>A.</given-names></name> <name><surname>Silva</surname> <given-names>A.</given-names></name> <name><surname>Coelho</surname> <given-names>A.</given-names></name> <name><surname>Real</surname> <given-names>J.</given-names></name> <name><surname>Silva</surname> <given-names>O.</given-names></name></person-group> (<year>2020</year>). <article-title>Design and real-time implementation of a wireless autopilot using multivariable predictive generalized minimum variance control in the state-space</article-title>. <source>Aerosp. Sci. Technol</source>. <volume>105</volume>, <fpage>106053</fpage>. <pub-id pub-id-type="doi">10.1016/j.ast.2020.106053</pub-id></citation>
</ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tang</surname> <given-names>J.</given-names></name> <name><surname>Duan</surname> <given-names>H. B.</given-names></name> <name><surname>Lao</surname> <given-names>S. Y.</given-names></name></person-group> (<year>2023</year>). <article-title>Swarm intelligence algorithms for multiple unmanned aerial vehicles collaboration: a comprehensive review</article-title>. <source>Artif. Intell. Rev</source>. <volume>56</volume>, <fpage>4295</fpage>&#x02013;<lpage>4327</lpage>. <pub-id pub-id-type="doi">10.1007/s10462-022-10281-7</pub-id></citation>
</ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>X. W.</given-names></name> <name><surname>Wang</surname> <given-names>H.</given-names></name> <name><surname>Zhang</surname> <given-names>H. Y.</given-names></name> <name><surname>Wang</surname> <given-names>M.</given-names></name> <name><surname>Wang</surname> <given-names>L.</given-names></name> <name><surname>Cui</surname> <given-names>K. K.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>A mini review on UAV mission planning</article-title>. <source>J. Ind. Manag. Optim</source>. <volume>19</volume>, <fpage>3362</fpage>&#x02013;<lpage>3382</lpage>. <pub-id pub-id-type="doi">10.3934/jimo.2022089</pub-id></citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>Z. H.</given-names></name> <name><surname>Zhang</surname> <given-names>J. L.</given-names></name></person-group> (<year>2022</year>). <article-title>A task allocation algorithm for a swarm of unmanned aerial vehicles based on bionic wolf pack method</article-title>. <source>Knowl. Based Syst</source>. <volume>250</volume>, <fpage>109072</fpage>. <pub-id pub-id-type="doi">10.1016/j.knosys.2022.109072</pub-id></citation>
</ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wei</surname> <given-names>D. W.</given-names></name> <name><surname>Ma</surname> <given-names>J. F.</given-names></name> <name><surname>Luo</surname> <given-names>L. B.</given-names></name> <name><surname>Wang</surname> <given-names>Y. B.</given-names></name> <name><surname>He</surname> <given-names>L.</given-names></name> <name><surname>Li</surname> <given-names>X. H.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Computation offloading over multi-UAV MEC network: a distributed deep reinforcement learning approach</article-title>. <source>Comput. Netw</source>. <volume>199</volume>, <fpage>108439</fpage>. <pub-id pub-id-type="doi">10.1016/j.comnet.2021.108439</pub-id></citation>
</ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wurman</surname> <given-names>P. R.</given-names></name> <name><surname>Barrett</surname> <given-names>S.</given-names></name> <name><surname>Kawamoto</surname> <given-names>K.</given-names></name> <name><surname>MacGlashan</surname> <given-names>J.</given-names></name> <name><surname>Subramanian</surname> <given-names>K.</given-names></name> <name><surname>Walsh</surname> <given-names>T. J.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>Outracing champion Gran Turismo drivers with deep reinforcement learning</article-title>. <source>Nature</source> <volume>602</volume>, <fpage>223</fpage>. <pub-id pub-id-type="doi">10.1038/s41586-021-04357-7</pub-id><pub-id pub-id-type="pmid">35140384</pub-id></citation></ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xu</surname> <given-names>Z. X.</given-names></name> <name><surname>Chen</surname> <given-names>X. L.</given-names></name> <name><surname>Tang</surname> <given-names>W.</given-names></name> <name><surname>Lai</surname> <given-names>J.</given-names></name> <name><surname>Cao</surname> <given-names>L.</given-names></name></person-group> (<year>2021</year>). <article-title>Meta weight learning via model-agnostic meta-learning</article-title>. <source>Neurocomputing</source> <volume>432</volume>, <fpage>124</fpage>&#x02013;<lpage>132</lpage>. <pub-id pub-id-type="doi">10.1016/j.neucom.2020.08.034</pub-id></citation>
</ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>M.</given-names></name> <name><surname>Bi</surname> <given-names>W. H.</given-names></name> <name><surname>Zhang</surname> <given-names>A.</given-names></name> <name><surname>Gao</surname> <given-names>F.</given-names></name></person-group> (<year>2022</year>). <article-title>A distributed task reassignment method in dynamic environment for multi-UAV system</article-title>. <source>Appl. Intell</source>. <volume>52</volume>, <fpage>1582</fpage>&#x02013;<lpage>1601</lpage>. <pub-id pub-id-type="doi">10.1007/s10489-021-02502-3</pub-id></citation>
</ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yao</surname> <given-names>C. H.</given-names></name> <name><surname>Tian</surname> <given-names>H.</given-names></name> <name><surname>Wang</surname> <given-names>C.</given-names></name> <name><surname>Song</surname> <given-names>L. B.</given-names></name> <name><surname>Jing</surname> <given-names>J.</given-names></name> <name><surname>Ma</surname> <given-names>W. F.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Joint optimization of control and communication in autonomous UAV swarms: challenges, potentials, and framework</article-title>. <source>IEEE Wirel. Commun</source>. <volume>28</volume>, <fpage>28</fpage>&#x02013;<lpage>35</lpage>. <pub-id pub-id-type="doi">10.1109/MWC.011.2100036</pub-id></citation>
</ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>M.</given-names></name> <name><surname>Li</surname> <given-names>W.</given-names></name> <name><surname>Wang</surname> <given-names>M. M.</given-names></name> <name><surname>Li</surname> <given-names>S. R.</given-names></name> <name><surname>Li</surname> <given-names>B. Q.</given-names></name></person-group> (<year>2022</year>). <article-title>Helicopter-UAVs search and rescue task allocation considering UAVs operating environment and performance</article-title>. <source>Comput. Ind. Eng</source>. <volume>167</volume>, <fpage>107994</fpage>. <pub-id pub-id-type="doi">10.1016/j.cie.2022.107994</pub-id></citation>
</ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>T. T.</given-names></name> <name><surname>Li</surname> <given-names>G. X.</given-names></name> <name><surname>Song</surname> <given-names>Y. J.</given-names></name> <name><surname>Wang</surname> <given-names>Y.</given-names></name> <name><surname>Chen</surname> <given-names>Y. R.</given-names></name> <name><surname>Yang</surname> <given-names>J. C.</given-names></name></person-group> (<year>2023</year>). <article-title>A multi-scenario text generation method based on meta reinforcement learning</article-title>. <source>Pattern Recognit. Lett</source>. <volume>165</volume>, <fpage>47</fpage>&#x02013;<lpage>54</lpage>. <pub-id pub-id-type="doi">10.1016/j.patrec.2022.11.031</pub-id></citation>
</ref>
</ref-list> 
</back>
</article>