<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neurorobot.</journal-id>
<journal-title>Frontiers in Neurorobotics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neurorobot.</abbrev-journal-title>
<issn pub-type="epub">1662-5218</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnbot.2024.1375309</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>A reinforcement learning enhanced pseudo-inverse approach to self-collision avoidance of redundant robots</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Hong</surname> <given-names>Tinghe</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/2636901/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Li</surname> <given-names>Weibing</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/445691/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Huang</surname> <given-names>Kai</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/729626/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
</contrib-group>
<aff><institution>School of Computer Science and Engineering, Sun Yat-sen University</institution>, <addr-line>Guangzhou</addr-line>, <country>China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Rui Li, Chinese Academy of Sciences (CAS), China</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Lin Hong, Technical University of Munich, Germany</p>
<p>Hu Cao, Technical University of Munich, Germany</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Kai Huang <email>huangk36&#x00040;mail.sysu.edu.cn</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>28</day>
<month>03</month>
<year>2024</year>
</pub-date>
<pub-date pub-type="collection">
<year>2024</year>
</pub-date>
<volume>18</volume>
<elocation-id>1375309</elocation-id>
<history>
<date date-type="received">
<day>23</day>
<month>01</month>
<year>2024</year>
</date>
<date date-type="accepted">
<day>11</day>
<month>03</month>
<year>2024</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2024 Hong, Li and Huang.</copyright-statement>
<copyright-year>2024</copyright-year>
<copyright-holder>Hong, Li and Huang</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license></permissions>
<abstract>
<sec>
<title>Introduction</title>
<p>Redundant robots offer greater flexibility compared to non-redundant ones but are susceptible to increased collision risks when the end-effector approaches the robot&#x00027;s own links. Redundant degrees of freedom (DoFs) present an opportunity for collision avoidance; however, selecting an appropriate inverse kinematics (IK) solution remains challenging due to the infinite possible solutions.</p></sec>
<sec>
<title>Methods</title>
<p>This study proposes a reinforcement learning (RL) enhanced pseudo-inverse approach to address self-collision avoidance in redundant robots. The RL agent is integrated into the redundancy resolution process of a pseudo-inverse method to determine a suitable IK solution for avoiding self-collisions during task execution. Additionally, an improved replay buffer is implemented to enhance the performance of the RL algorithm.</p></sec>
<sec>
<title>Results</title>
<p>Simulations and experiments validate the effectiveness of the proposed method in reducing the risk of self-collision in redundant robots.</p></sec>
<sec>
<title>Conclusion</title>
<p>The RL enhanced pseudo-inverse approach presented in this study demonstrates promising results in mitigating self-collision risks in redundant robots, highlighting its potential for enhancing safety and performance in robotic systems.</p></sec></abstract>
<kwd-group>
<kwd>reinforcement learning</kwd>
<kwd>inverse kinematics</kwd>
<kwd>redundant robots</kwd>
<kwd>self-collision avoidance</kwd>
<kwd>sim to real</kwd>
</kwd-group>
<counts>
<fig-count count="6"/>
<table-count count="8"/>
<equation-count count="16"/>
<ref-count count="36"/>
<page-count count="12"/>
<word-count count="8173"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1 Introduction</title>
<p>Kinematically redundant robots possess more degrees of freedom (DoFs) than required to perform a user-specified task. Therefore, redundant robots deliver more advantages in human-robot interactions. However, greater flexibility originated from redundant DoFs increases the risk of self-collision, especially when the human operator forces the end-effector to move close to the robot&#x00027;s own links. Moreover, the robot links are constantly moving, which makes self-collision avoidance more difficult.</p>
<p>The crux in the self-collision problem of redundant robots lies in the difficulty of determining appropriate inverse kinematics (IK) (Paul and Shimano, <xref ref-type="bibr" rid="B30">1978</xref>) solution. For non-redundant robotic arms, providing a specific position and orientation for the end effector typically results in no more than 32 IK solutions (Tsai and Morgan, <xref ref-type="bibr" rid="B34">1985</xref>). However, in the case of redundant robotic arms, there often correspond an infinite number of solutions. This makes it challenging to determine an appropriate IK solution to avoid self-collision.</p>
<p>Firstly, it is difficult to obtain the closed form solution of redundant robots directly (Zaplana and Basanez, <xref ref-type="bibr" rid="B36">2018</xref>) and it is also burdensome to meet the constraints of self-collision avoidance. Secondly, the IK solution should be consistent with the mechanical properties of the robot arm, i.e., the joint variables need to be bounded to ensure the smoothness of the motion. Finally, the solution should have some generality to migrate on multiple robots.</p>
<p>There are some limitations in the existing IK solutions. Some researchers have proposed numerical methods for solving inverse kinematics problems, such as the Jacobian matrix inversion-based solution (Colom&#x000E9; and Torras, <xref ref-type="bibr" rid="B14">2015</xref>), the transposed Jacobian matrix-based solution (Wolovich and Elliott, <xref ref-type="bibr" rid="B35">1984</xref>), and the damped least squares (DLS) solution (Nakamura and Hanafusa, <xref ref-type="bibr" rid="B29">1986</xref>). Based on the current joints&#x00027; status and the Denavit-Hartenberg (D-H) parameters (Denavit and Hartenberg, <xref ref-type="bibr" rid="B15">1955</xref>) of a given robot, the end-effector can be continuously controlled under the guidance of one predetermined path. Unfortunately, numerical methods only provide a feasible set of IK solutions, which cannot be selected in complex cases. In contrast, Heuristic-based methods transform the IK problem into an optimization problem and select adaptive solutions. Momani et. al used a genetic algorithm to solve the IK problem and obtains a continuous smooth solution (Momani et al., <xref ref-type="bibr" rid="B28">2016</xref>). In Rokbani and Alimi (<xref ref-type="bibr" rid="B31">2013</xref>) and Dereli and K&#x000F6;ker (<xref ref-type="bibr" rid="B16">2018</xref>), different variants of the particle swarm optimization were presented to solve the IK problem for redundant robots. In Dereli and K&#x000F6;ker (<xref ref-type="bibr" rid="B17">2020</xref>), a quantum particle swarm was proposed to compute the IK solutions of a 7-DoF robot. Koeker and Cakar improve the control accuracy of a robotic arm by combining neural networks, simulated annealing, and genetic algorithms (K&#x000F6;ker and &#x000C7;akar, <xref ref-type="bibr" rid="B25">2016</xref>). Although heuristic methods can address IK problems under constraints, the dynamic nature of the robotic arm&#x00027;s motion creates a constantly changing environment, making it challenging for heuristic methods to converge. Furthermore, heuristic approaches often focus on solving for a specific state of the environment, overlooking the temporal and spatial continuity of the robotic arm&#x00027;s movement.</p>
<p>In contrast to traditional methods, machine learning approaches offer greater adaptability to complex tasks in controlling robots. In Cao et al. (<xref ref-type="bibr" rid="B11">2022</xref>, <xref ref-type="bibr" rid="B10">2023a</xref>,<xref ref-type="bibr" rid="B12">b</xref>), authors have employed deep learning techniques to control robotic arms for grasping tasks. In a variety of machine learning approaches, Reinforcement learning (RL) (Sutton and Barto, <xref ref-type="bibr" rid="B33">2018</xref>) involves a method that an agent explores the environment to achieve a maximum reward. Therefore, RL is suitable for finding an appropriate IK solution for a redundant robot under the constraint of collision avoidance. In Bing et al. (<xref ref-type="bibr" rid="B6">2023a</xref>,<xref ref-type="bibr" rid="B7">b</xref>), the authors employed meta-RL to control robots in simulated environments to achieve specific objectives. In Bing et al. (<xref ref-type="bibr" rid="B5">2022b</xref>), the authors utilized RL to control the locomotion of a snake-like robot. However, in self-collision avoidance situations, there are challenges in sample acquisition. Direct control of robot joints by reinforcement learning agents may cause difficulties in obtaining successful samples. In contrast, when controlling robot joints using the traditional IK method, collision samples are rare, which makes it difficult for the agent to learn how to avoid collisions.</p>
<p>To address the issues mentioned above an RL-enhanced pseudo-inverse approach is proposed in this paper. The RL solver does not directly control the robot, but imposes an interference to the pseudo-inverse solver to avoid self-collision of the robot. The main contributions of this paper are listed as follows:</p>
<list list-type="bullet">
<list-item><p>Firstly, an RL-enhanced pseudo-inverse solution method is proposed. In this approach, the RL agent outputs interference. The pseudo-inverse solver incorporates these interference into the computation to obtain an IK solution for robotic positioning with self-collision avoidance.</p></list-item>
<list-item><p>A novel replay buffer is designed to adjust sample proportions under self-collision avoidance scenarios. This enhances the learning efficiency of the agent by elevating the diversity of samples in the buffer.</p></list-item>
<list-item><p>Finally, a simulated training and testing environment was established in CoppeliaSim, and the effectiveness of the proposed method is validated through simulations and experiments using the Frank Emika Panda robotic arm.</p></list-item>
</list></sec>
<sec id="s2">
<title>2 Related work</title>
<p>Collision avoidance is always one of the critical issues to be solved in robotic arm control. The method in Guo and Hsia (<xref ref-type="bibr" rid="B20">1990</xref>) and Cheng et al. (<xref ref-type="bibr" rid="B13">1998</xref>) maximizes the distance between the robot arm and the obstacle to avoid collisions, but it is unnecessary to always maximize the distance when the robot is far away from the obstacle. In Duguleana et al. (<xref ref-type="bibr" rid="B18">2012</xref>), the authors propose an improved quadratic programming(QP) problem formulation, representation of the collision-free scheme as a dynamically updated inequality constraint. Haviland and Corke (<xref ref-type="bibr" rid="B22">2021</xref>) present a motion controller which is wrapped into QP. The controller can avoid static and dynamic obstacles while moving to the desired end-effector pose.</p>
<p>As an intelligent learning method, RL does not require an accurate model or prior knowledge, thus providing a new solution to the complex redundant robot control problem. The authors of Al-Hafez and Steil (<xref ref-type="bibr" rid="B2">2021</xref>) take the concept of redundancy resolution and propose a policy search with redundant action bias to control the motion of the robotic arm and avoid collisions by maximizing the distance between the linkage and the obstacle. In Li et al. (<xref ref-type="bibr" rid="B26">2022</xref>) the authors propose a framework that employs deep reinforcement learning (DRL) to find the most efficient path in Cartesian space and to compute the most energy-efficient solution for robot IK. In Bing et al. (<xref ref-type="bibr" rid="B4">2022a</xref>, <xref ref-type="bibr" rid="B8">2023c</xref>), the authors trained robotic arms to evade obstacles and grasp targets using a method based on Hindsight Goal Generation.</p>
<p>In Martin and Mill&#x000E1;n (<xref ref-type="bibr" rid="B27">1997</xref>) the authors employ proximity sensors and reinforcement learning methods to solve the self-collision problem of redundant robotic arms. However, this approach is specific to two-dimensional robotic arms and difficult to extend to three-dimensional space. In Agarwal et al. (<xref ref-type="bibr" rid="B1">2016</xref>) and Schappler and Ortmaier (<xref ref-type="bibr" rid="B32">2021</xref>), the authors employed the null space projection to address singularity avoidance in three-axis planar robots and six-axis serial robots, respectively. In comparison to singularity avoidance, self-collision avoidance requires more consideration of the robotic arm&#x00027;s structure, as arms with different structures have varying joint positions during motion, making it more challenging to predict the location of the links.</p>
<p>In summary, the issue of self-collision in robotic arms becomes increasingly significant with the growth of joint complexity, and there is currently limited attention given to this problem. Traditional collision avoidance methods typically focus on obstacles with fixed or uniform motion or impose restrictions on the number of joint angles of the robotic arm. The proposed method in this paper aims to avoid irregularly moving robotic arm links and is insensitive to both the number and structure of joints in redundant robotic arms.</p></sec>
<sec id="s3">
<title>3 Mathematical model</title>
<p>Generally, the control of a robotic arm is associated with a mapping from the work space to the joint space. However, it is difficult to directly calculate the relationship between the change in end-effector&#x00027;s pose and the change in joints&#x00027; states. Therefore, a common approach is to map the change in pose to the change of joint velocity, and the work space and joint space are related by a Jacobian matrix in the mapping.</p>
<p>Set the desired velocity of the end-effector of the robot arm to be <italic>&#x01E8B;</italic>, which is a 6-dimensional vector (three translations and three rotations), and let <italic>J</italic> denote the Jacobian matrix. The joint velocity of the robot arm is an <italic>n</italic>-dimensional vector, referred to as <inline-formula><mml:math id="M1"><mml:mover accent="true"><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover></mml:math></inline-formula>, where <italic>n</italic> represents the number of DoFs of the robot. The velocity of the end-effector <italic>&#x01E8B;</italic> could be obtained from <inline-formula><mml:math id="M2"><mml:mover accent="true"><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover></mml:math></inline-formula> and <italic>J</italic> in <xref ref-type="disp-formula" rid="E1">Equation (1)</xref>:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M3"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>&#x01E8B;</mml:mi><mml:mo>=</mml:mo><mml:mi>J</mml:mi><mml:mover accent="true"><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Then, based on the pseudo-inverse method, it yields <xref ref-type="disp-formula" rid="E2">Equation (2)</xref>:</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M4"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtable style="text-align:axis;" equalrows="false" columnlines="none" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:mo class="qopname">min</mml:mo><mml:mo>&#x02225;</mml:mo><mml:mover accent="true"><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mo class="qopname">&#x002D9;</mml:mo></mml:mover><mml:msup><mml:mrow><mml:mo>&#x02225;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext class="textrm" mathvariant="normal">subject to&#x000A0;</mml:mtext><mml:mi>&#x01E8B;</mml:mi><mml:mo>=</mml:mo><mml:mi>J</mml:mi><mml:mover accent="true"><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The pseudo-inverse method uses the minimum joint velocity as the optimization objective to improve the motion efficiency. However, this optimization objective cannot satisfy the need for self-collision avoidance. As a result, a controllable interference <inline-formula><mml:math id="M5"><mml:mover accent="true"><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover></mml:math></inline-formula> is added, where <inline-formula><mml:math id="M6"><mml:mover accent="true"><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover></mml:math></inline-formula> represents a vector with the same dimension as <inline-formula><mml:math id="M7"><mml:mover accent="true"><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover></mml:math></inline-formula> in <xref ref-type="disp-formula" rid="E3">Equation (3)</xref>:</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M8"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtable style="text-align:axis;" equalrows="false" columnlines="none" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:mo class="qopname">min</mml:mo><mml:mo>&#x02225;</mml:mo><mml:mover accent="true"><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mo class="qopname">&#x002D9;</mml:mo></mml:mover><mml:mo>&#x0002B;</mml:mo><mml:mover accent="true"><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo class="qopname">&#x002D9;</mml:mo></mml:mover><mml:msup><mml:mrow><mml:mo>&#x02225;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext class="textrm" mathvariant="normal">subject to&#x000A0;</mml:mtext><mml:mi>&#x01E8B;</mml:mi><mml:mo>=</mml:mo><mml:mi>J</mml:mi><mml:mover accent="true"><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>According to the Lagrange multiplier method (Boyd and Vandenberghe, <xref ref-type="bibr" rid="B9">2004</xref>), it leads to</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M9"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mover accent="true"><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>J</mml:mi><mml:msup><mml:mrow><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x01E8B;</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>J</mml:mi><mml:mover accent="true"><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:mover accent="true"><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Define <italic>J</italic><sup>&#x02020;</sup> &#x0003D; <italic>J</italic><sup><italic>T</italic></sup>(<italic>JJ</italic><sup><italic>T</italic></sup>)<sup>&#x02212;1</sup>. Since <italic>JJ</italic><sup>&#x02020;</sup> &#x0003D; <italic>I</italic>,<italic>J</italic><sup>&#x02020;</sup> is the right pseudo-inverse of the Jacobian matrix <italic>J</italic>. Therefore, <xref ref-type="disp-formula" rid="E4">Equation (4)</xref> is equivalent to</p>
<disp-formula id="E5"><label>(5)</label><mml:math id="M10"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mover accent="true"><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02020;</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x01E8B;</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>J</mml:mi><mml:mover accent="true"><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:mover accent="true"><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The right pseudo-inverse of the Jacobian matrix can be solved with any of the optimization solvers without affecting the computation of <inline-formula><mml:math id="M11"><mml:mover accent="true"><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover></mml:math></inline-formula>. The existence of the pseudo-inverse solution with interference is not affected since the inverse matrix is replaced by the pseudo-inverse and <italic>J</italic><sup>&#x02020;</sup> necessarily exists.</p>
<p>Lemma 1. After adding the interference, the resulting <inline-formula><mml:math id="M12"><mml:mover accent="true"><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover></mml:math></inline-formula> still allows the end-effector to move with the desired velocity <italic>&#x01E8B;</italic>.</p>
<p><italic>Proof</italic>. Multiplying <italic>J</italic> left on both sides of <xref ref-type="disp-formula" rid="E5">Equation (5)</xref>, then get <xref ref-type="disp-formula" rid="E6">Equation (6)</xref>:</p>
<disp-formula id="E6"><label>(6)</label><mml:math id="M13"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtable style="text-align:axis;" equalrows="false" columnlines="none" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:mi>J</mml:mi><mml:mover accent="true"><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mi>J</mml:mi><mml:msup><mml:mrow><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02020;</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x01E8B;</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>J</mml:mi><mml:mover accent="true"><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:mi>J</mml:mi><mml:mover accent="true"><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mi>&#x01E8B;</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>J</mml:mi><mml:mover accent="true"><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover><mml:mo>-</mml:mo><mml:mi>J</mml:mi><mml:mover accent="true"><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mi>&#x01E8B;</mml:mi><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>This proves that <inline-formula><mml:math id="M14"><mml:mover accent="true"><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover></mml:math></inline-formula> with interference <inline-formula><mml:math id="M15"><mml:mover accent="true"><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover></mml:math></inline-formula> can generate the desired velocity for robot motion control.</p></sec>
<sec id="s4">
<title>4 Proposed RL enhanced controller</title>
<p>The proposed controller contains two solvers, the RL solver and the pseudo-inverse solver. The RL solver accepts observation from the environment and returns interference <inline-formula><mml:math id="M16"><mml:mover accent="true"><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover></mml:math></inline-formula>. The pseudo-inverse solver accepts the Jacobian matrix <italic>J</italic> from the environment and calculates the velocity <inline-formula><mml:math id="M17"><mml:mover accent="true"><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover></mml:math></inline-formula> of robot joints in the current state by combining interference <inline-formula><mml:math id="M18"><mml:mover accent="true"><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover></mml:math></inline-formula> according to the method in <xref ref-type="disp-formula" rid="E5">Equation (5)</xref>. A series of actions will be obtained through the controller. The structure diagram of the controller is shown in <xref ref-type="fig" rid="F1">Figure 1</xref>. The pseudo-inverse solver has been explained in Section 3. Next, we will present the key elements of the RL solver and the RL network structure.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Structure of RL enhanced pseudo-inverse controller.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-18-1375309-g0001.tif"/>
</fig>
<sec>
<title>4.1 Observation space</title>
<p>The RL agent receives the environment information through the observation space at each step. In RL, the correct choice of observation space parameters is crucial as the agent needs the correct set of information to understand the causality of a given reward based on the behavior.</p>
<p>The observation space <italic>o</italic><sup><italic>t</italic></sup> is given in <xref ref-type="table" rid="T1">Table 1</xref>. &#x003D5; is an n-dimensional vector to present the current robot joints&#x00027; angle. Define <italic>&#x01E57;</italic> as the difference of coordinates between the target point and the current robot end-effector. The target point position coordinates is denoted as <italic>&#x01E57;</italic><sub><italic>tar</italic></sub>, and the end-effector position is denoted as <italic>&#x01E57;</italic><sub><italic>ee</italic></sub>. <italic>&#x01E57;</italic>, <italic>&#x01E57;</italic><sub><italic>tar</italic></sub>, and <italic>&#x01E57;</italic><sub><italic>ee</italic></sub>, are 3-dimensional vectors. The 3 dimensions correspond to the coordinates of the Cartesian coordinate system in the <italic>x</italic>, <italic>y</italic> and <italic>z</italic> directions. <italic>&#x01E57;</italic> can be obtained from</p>
<disp-formula id="E7"><label>(7)</label><mml:math id="M19"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>&#x01E57;</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x01E57;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>&#x01E57;</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>The observation space <italic>o</italic><sup><italic>t</italic></sup> of the RL controller.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919497;color:#ffffff">
<th valign="top" align="left"><bold>Symbols</bold></th>
<th valign="top" align="left"><bold>Description</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">&#x003D5;</td>
<td valign="top" align="left">Current joints angle</td>
</tr> <tr>
<td valign="top" align="left"><italic>&#x01E57;</italic></td>
<td valign="top" align="left">Difference between target and current position</td>
</tr> <tr>
<td valign="top" align="left"><italic>&#x01E59;</italic></td>
<td valign="top" align="left">Difference between initial and current rotation</td>
</tr></tbody>
</table>
</table-wrap>
<p>Define <italic>&#x01E59;</italic> as the angle difference between the end-effector at the beginning and the current position. The initial end-effector rotation is denoted as <italic>&#x01E59;</italic><sub><italic>init</italic></sub>, and the current rotation is denoted as <italic>&#x01E59;</italic><sub><italic>cur</italic></sub>, then obtain the rotation difference from <xref ref-type="disp-formula" rid="E8">Equation (8)</xref>:</p>
<disp-formula id="E8"><label>(8)</label><mml:math id="M20"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>&#x01E59;</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x01E59;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>&#x01E59;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>u</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Similarly, <italic>&#x01E59;</italic>, <italic>&#x01E59;</italic><sub><italic>init</italic></sub>, and <italic>&#x01E59;</italic><sub><italic>cur</italic></sub> are all 3-dimensional vectors that represent the rotation about the x, y, z axes in the Cartesian coordinate system.</p>
<p>In summary, an overall (<italic>n</italic>&#x0002B;6)-DoFs observation space is used in this work. In the simulation of this paper, the position and rotation information of the end-effector are obtained from the simulator directly. For the robotic arm in a physical environment, this information can be estimated using forward kinematics.</p></sec>
<sec>
<title>4.2 Action space</title>
<p>At time <italic>t</italic>, the final output of the controller is an <italic>n</italic>-dimensional vector <italic>a</italic><sup><italic>t</italic></sup>, denoted as <inline-formula><mml:math id="M21"><mml:mover accent="true"><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover></mml:math></inline-formula> in the <xref ref-type="disp-formula" rid="E5">Equation (5)</xref>, where <italic>n</italic> represents the DoF of the robotic arm. The values of <italic>a</italic><sup><italic>t</italic></sup> fall within the range of [-1, 1] and are transformed linearly to correspond to the velocities of the respective joints. The RL solver produces the interference vector <italic>i</italic><sup><italic>t</italic></sup>, which is also an <italic>n</italic>-dimensional vector, representing the interference <inline-formula><mml:math id="M22"><mml:mover accent="true"><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover></mml:math></inline-formula> in the <xref ref-type="disp-formula" rid="E5">Equation (5)</xref> at time <italic>t</italic>. It is worth noting that this paper primarily focuses on the robot&#x00027;s kinematics, and joint torques are beyond the scope of this study.</p></sec>
<sec>
<title>4.3 Single-step reward</title>
<p>When the RL solver introduces excessive disturbance to the Pseudo-inverse solver, it can result in deviations of the end effector&#x00027;s motion from the target trajectory or unexpected rotations. For instance, when the joint motion of the robotic arm exceeds limits, it may cause a significant deviation of the end effector from the designated motion trajectory.</p>
<p>To prevent the occurrence of the aforementioned phenomena, it is essential to calculate rewards for each step of the RL solver&#x00027;s output. The reward function for the single-step reward comprises translation and rotation components.</p>
<p>For the translation component, we take the current velocity vector of the end-effector, the angle &#x003B8; with the target direction vector, and compute cos(&#x003B8;). In the following steps, we take the single-step translation reward as cos(&#x003B8;)&#x02212;1 to ensure it is negative. When the end-effector moves exactly in the specified direction, the reward is set as the maximum value of 0. The translation component is denoted as <italic>R</italic><sub><italic>l</italic></sub>, it could be found that,</p>
<disp-formula id="E9"><label>(9)</label><mml:math id="M23"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x01E57;</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mi>&#x01E57;</mml:mi><mml:mo>|</mml:mo><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:mrow></mml:mfrac><mml:mo>-</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>v</italic><sub><italic>ee</italic></sub> is a 3-dimensional vector to represent the current translation velocity of the end-effector.</p>
<p>For the rotation component, as this paper primarily focuses on position inverse kinematics issues, we take the default rotation of the end-effector to make a difference with the current rotation. The 2-norm is taken for the resulting vector. Finally, the obtained value is divided by a factor <italic>k</italic> to balance the value with the translation component. With the translation component, a non-positive value is taken for the obtained value, which is rewarded as a maximum value of 0, when the end-effector remains theinitial rotation. Let the rotation component be <italic>R</italic><sub><italic>r</italic></sub>, it can be computed as follows,</p>
<disp-formula id="E10"><label>(10)</label><mml:math id="M24"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>-</mml:mo><mml:mi>&#x01E59;</mml:mi><mml:mo>/</mml:mo><mml:mi>k</mml:mi><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>&#x01E59;</italic> be computed from <xref ref-type="disp-formula" rid="E7">Equation (7)</xref>. In this paper, <italic>k</italic> is set as 100. The coefficient <italic>k</italic> balances the translation reward and rotation reward, avoiding that one reward is too large and agent ignores the other one.</p>
<p>Combining <xref ref-type="disp-formula" rid="E9">Equations (9</xref>, <xref ref-type="disp-formula" rid="E10">10)</xref>, yields the reward function for each step</p>
<disp-formula id="E11"><label>(11)</label><mml:math id="M25"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>When the end effector moves in the vicinity of the robotic arm according to the specified trajectory, the single-step reward approaches zero. However, if the robotic arm&#x00027;s joints exceed limits or other geometric structural issues impede its normal motion, the single-step reward significantly decreases. This encourages the RL solver to avoid situations where the robotic arm becomes stuck.</p>
<p>It is noteworthy that the rewards returned by <xref ref-type="disp-formula" rid="E11">Equation (11)</xref> are for each step and do not encompass the rewards for each episode. Episode rewards will be explained in the next subsection.</p></sec>
<sec>
<title>4.4 Episode rewards</title>
<p>The single-step reward calculation involves the rewards received by the agent for each action taken. Episode rewards, on the other hand, are computed based on the outcomes after the completion of an episode. Since the position relationship between adjacent state-action pairs is lost during retraining after replays are placed into the buffer, it is necessary to finalize the reward allocation for each episode before adding the replay to the buffer. In this paper, a Monte Carlo-like method is employed, where the final reward accumulated at the end of an episode is propagated backward and assigned to all steps within that episode.</p>
<p>Assume there are <italic>k</italic> replays in one episode, meaning that this episode takes <italic>k</italic> steps. <italic>R</italic><sub><italic>j</italic></sub> is the <italic>j</italic>th step reward in episode buffer, then adjusted the rewards in episode buffer as</p>
<disp-formula id="E12"><label>(12)</label><mml:math id="M26"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x003B3;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>-</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msup><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>R</italic><sub><italic>a</italic></sub> is the adjusted reward. <italic>R</italic><sub><italic>end</italic></sub> is the end reward of an episode, based on if the episode results in success or failure, its value is represented by <xref ref-type="disp-formula" rid="E13">Equation (13)</xref>.</p>
<disp-formula id="E13"><label>(13)</label><mml:math id="M27"><mml:mrow><mml:msub><mml:mi>R</mml:mi><mml:mi>e</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo> <mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mi>w</mml:mi><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>h</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mi>t</mml:mi><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>g</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mtd><mml:mtd><mml:mrow><mml:mi>o</mml:mi><mml:mi>t</mml:mi><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow> </mml:mrow></mml:mrow></mml:math></disp-formula>
<p>where <italic>R</italic><sub><italic>pos</italic></sub>&#x0003E;<italic>R</italic><sub><italic>neg</italic></sub>, it indicates that the reward obtained upon successfully reaching the target position exceeds the reward obtained upon failure.</p>
<p>The intuition behind <xref ref-type="disp-formula" rid="E12">Equation (12)</xref> is to reinforce the correlation between adjacent replays. If an action results in the robotic arm moving toward a collision direction, the reward for the corresponding state-action pair is correspondingly reduced.</p>
<p>While it is possible to enhance training performance by magnifying the reward/punishment values at the conclusion of each episode and transmitting this value through the Bellman equation, it is observed that directly amplifying these values during training often led to frequent training failures. The approach outlined in <xref ref-type="disp-formula" rid="E12">Equation (12)</xref>, however, facilitates more successful training. We hypothesize that the use of <xref ref-type="disp-formula" rid="E12">Equation (12)</xref> disperses the reward/punishment values across multiple records, preventing issues associated with excessively large gradients.</p></sec>
<sec>
<title>4.5 Dynamic balancing replay buffer</title>
<p>The role of the replay buffer in RL is to improve the utilization of samples and to enable the agent to train offline. The rewards generated by the interaction between the agent and the environment are stored in the replay buffer.</p>
<p>The usual approach is to add the rewards returned by the environment directly to the replay buffer, then randomly select some replays from the buffer to train the agent. However, there are two challenges in this paper.</p>
<p>One of the challenges is that the movement of the robot is continuous, and collisions occur not triggered by only one action. The usual approach destroys the cause-and-effect relationship between adjacent actions.</p>
<p>The other challenge is that during the training of the combined controller, the domination of any solver causes reward/penalty sparsity. For example, when the RL solver dominates, the untrained RL solver makes the robot always collide and causes it difficult to converge. On the other hand, when the pseudo-inverse solver dominates, the robot rarely collides, which leads the RL solver difficult to get trained because of the lack of collision samples.</p>
<p>The first challenge was addressed in the preceding subsection through an episode-based approach. As for the second challenge, the proposed solution in this paper involves the introduction of a dynamic balancing mechanism for the replay buffer.</p>
<p>Let <italic>info</italic> return at the end of an episode. Return <italic>True</italic> when the end-effector successfully reaches the target, otherwise return <italic>False</italic>.</p>
<p>Let the total number of steps from successful episodes in the current replay buffer to <italic>n</italic><sub><italic>s</italic></sub> and the total number of steps from failed episodes to <italic>n</italic><sub><italic>f</italic></sub>.</p>
<p>The replays in episode buffer will be added to the replay buffer only when:</p>
<list list-type="order">
<list-item><p><italic>info</italic> &#x0003D; <italic>True and n</italic><sub><italic>s</italic></sub> &#x02264; <italic>n</italic><sub><italic>f</italic></sub>, or</p></list-item>
<list-item><p><italic>info</italic> &#x0003D; <italic>False and n</italic><sub><italic>s</italic></sub>&#x0003E;<italic>n</italic><sub><italic>f</italic></sub></p></list-item>
</list>
<p>This dynamic balancing mechanism ensures that the number of successful and failed steps in the replay buffer is similar, thus avoiding the problem of difficult convergence due to the lack of samples.</p></sec>
<sec>
<title>4.6 Network and training</title>
<p>Given the input (observation <italic>o</italic><sup><italic>t</italic></sup>) and the output (action <italic>a</italic><sup><italic>t</italic></sup>), the details of the network structure are introduced. We train our agent using the TD3 (Fujimoto et al., <xref ref-type="bibr" rid="B19">2018</xref>) based on the Actor-Critic architecture, thus requiring two Critic networks and one Actor network. Each Critic network contains two fully connected hidden layers that act as non-linear function approximators of Q value. The dimension of the input layer is the sum of the dimensions of <italic>o</italic><sup><italic>t</italic></sup> and <italic>a</italic><sup><italic>t</italic></sup>. Both hidden layers have 128 PReLU (He et al., <xref ref-type="bibr" rid="B23">2015</xref>) units. The last layer outputs a Q value. The Actor network also contains two fully connected hidden layers. The input layer has the same dimensions as <italic>o</italic><sup><italic>t</italic></sup>. Both two hidden layers have 128 ReLU units, with the last layer outputting the interference <inline-formula><mml:math id="M28"><mml:mover accent="true"><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover></mml:math></inline-formula>.</p>
<p>Theoretically, any deterministic policy RL algorithm can be applied to our method, but the RL algorithm with a stochastic policy may cause the robot arm to jitter, such as the SAC (Haarnoja et al., <xref ref-type="bibr" rid="B21">2018</xref>) we have tested which shown in Section 5.</p></sec></sec>
<sec id="s5">
<title>5 Simulation and experiment</title>
<p>In this section, simulative and experimental validations were conducted to assess the effectiveness of the proposed approach in avoiding self-collision for redundant robotic arms. In both simulation and experimentation, the end effector&#x00027;s motion trajectory was defined to compel its movement in proximity to the robotic arm&#x00027;s own structure. The experimental scenarios simulated situations that might occur when an operator directly manipulates the end effector of the robotic arm.</p>
<p>Coppeliasim is a kind of mainstream robot simulator, which can simulate the motion and collision detection of robot arms. In the simulation, a built-in Franka Panda robotic arm with 7-DoFs are employed. The initial joint angles are set to [0&#x000B0;, &#x02212;17&#x000B0;, 0&#x000B0;, &#x02212;126&#x000B0;, 0&#x000B0;, 114&#x000B0;, 45&#x000B0;]. The initial coordinate position of the end-effector is [0.499<italic>m</italic>, 0<italic>m</italic>, 1.189<italic>m</italic>].</p>
<p>The goal of the simulation and experiment is to guide the end-effector motion to the target point. The target points are generated in a hemispherical space of 0.5<italic>m</italic> radius around the first joint of the robot arm. A series of path points are generated at 0.01 m intervals along a straight line between the initial position and the target point position to force the end-effector to move near the robot arm itself. When the distance between the end effector and the target is less than 0.001 meters, the end effector is considered to have reached its destination. Millimeter-level accuracy is deemed sufficiently precise given the range of motion of the robotic arm in this experiment. Additionally, the methods employed in this study are iterative, and demanding excessive precision would result in the control algorithm consuming an impractical number of steps during the final convergence process. In the simulation, the built-in collision detector in Coppeliasim is employed to detect whether the robotic arm has collided. PyRep (James et al., <xref ref-type="bibr" rid="B24">2019</xref>) is employed to set up the reinforcement learning environment.</p>
<p>The platform configuration of training and simulation is listed as follows: CPU: i9-9900K; GPU: 2080ti; Memory: 64G. Coppeliasim version is 4.1 EDU and the operating system is Ubuntu 18.04 LTS.</p>
<sec>
<title>5.1 Convergence validation simulation</title>
<p>This simulation compares the convergence with and without the use of our improved replay buffer. The algorithm has been trained for 1000 episodes, which is about 200,000 steps. The parameters employed during training are outlined in <xref ref-type="table" rid="T2">Table 2</xref>. For the sake of stability during training, Stochastic Gradient Descent (SGD) was utilized instead of momentum-based optimizers. The results shown in <xref ref-type="fig" rid="F2">Figure 2</xref> represent the upper and lower bounds and average values for 25 replicate simulations.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Convergence verification parameter.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919497;color:#ffffff">
<th valign="top" align="left"><bold>Parameters</bold></th>
<th valign="top" align="center"><bold>Value</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Episodes</td>
<td valign="top" align="center">1, 000</td>
</tr> <tr>
<td valign="top" align="left">Critic learning rate</td>
<td valign="top" align="center">2 &#x000D7; 10<sup>&#x02212;4</sup></td>
</tr> <tr>
<td valign="top" align="left">Actor learning rate</td>
<td valign="top" align="center">1 &#x000D7; 10<sup>&#x02212;4</sup></td>
</tr></tbody>
</table>
</table-wrap>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Convergence verification under initialization conditions dominated by RL solver.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-18-1375309-g0002.tif"/>
</fig>
<p>As can be seen from the Critic convergence curve, the improved replay buffer makes the algorithm converge more stably. On the other hand, the unimproved replay buffer affected Critic&#x00027;s convergence and even led to failure to converge. Since Critic estimates Q values of state-action pairs, it is verified that our improved replay buffer improves the accuracy of the agent&#x00027;s estimation of the environment.</p>
<p>While the lower Actor loss is better, the unimproved replay buffer leads to an inaccurate estimate of Critic. Thus the lower loss under the unimproved replay buffer is not convincing. In contrast, our improved replay buffer exhibits more stability compared to the unimproved replay buffer.</p>
<p>Benefiting from better stability, our improved replay buffer allows for higher rewards per step and lower volatility during multiple training sessions.</p></sec>
<sec>
<title>5.2 Single target positions arrival experiment</title>
<p>In this simulation, 1071 final target points are evenly distributed in the hemispherical space around the robot arm. In each episode, the end-effector of the robotic arm starts from the same initial position and follows a moving target to reach the final target point. The moving target moves in a straight line, forcing the end-effector to move close to the arm&#x00027;s own structure.</p>
<p>Six different methods are compared in this simulation. To compare methods without collision avoidance policies, PI (pseudo-inverse) and TJ (Jacobian Transpose) methods are introduced in this simulation as comparisons. NEO (Haviland and Corke, <xref ref-type="bibr" rid="B22">2021</xref>) is introduced into the comparison as a non-learning-based obstacle avoidance method. The RL method uses only the TD3 algorithm as a comparison of not improving on the replay buffer. HER (Andrychowicz et al., <xref ref-type="bibr" rid="B3">2017</xref>) is introduced as a method that improved the replay buffer in this comparison. Replay buffer in HER method uses future choosing strategy. It is difficult to predict the timing of collision occurrence in the simulation, thus the update of the replay buffer in HER method is to encourage the tracking of target points by the agent. OURS method is the enhanced pseudo-inverse method of reinforcement learning with an improved replay buffer proposed in this paper.</p>
<p>The max number of steps for each episode is 1000. There are three outcomes for each episode, successful arrival at the target position (Success), exceeding the step limit but no collision (Run out), and Collision. The results of this simulation are summarized in <xref ref-type="fig" rid="F3">Figure 3</xref> and <xref ref-type="table" rid="T3">Table 3</xref>. The number of the three results in a single simulation is counted in the Success, Run out, and Collision columns respectively. Avg steps column counts the average number of steps spent in the episode of Success. The avg reward column counts the average reward per step. In this simulation, the reward is calculated according to <xref ref-type="disp-formula" rid="E11">Equation (11)</xref>, reflecting the tracking performance of the end-effector on the target point under the control of different methods.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Ability of the end-effector to reach a single target position under control of different approaches. Yellow dots indicate successful arrival at the target. Green dots indicate that the number of steps is used up and the target is not reached (except NEO). Purple dots indicate collisions (except NEO).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-18-1375309-g0003.tif"/>
</fig>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Details of the end-effector reaching single position under the control of different methods.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919497;color:#ffffff">
<th valign="top" align="left"><bold>Methods</bold></th>
<th valign="top" align="center"><bold>Success</bold></th>
<th valign="top" align="center"><bold>Run out</bold></th>
<th valign="top" align="center"><bold>Collision</bold></th>
<th valign="top" align="center"><bold>Avg steps</bold></th>
<th valign="top" align="center"><bold>Avg reward</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">PI</td>
<td valign="top" align="center">1,025</td>
<td valign="top" align="center">5</td>
<td valign="top" align="center">41</td>
<td valign="top" align="center">97.68441</td>
<td valign="top" align="center">-0.04706</td>
</tr> <tr>
<td valign="top" align="left">TJ</td>
<td valign="top" align="center">961</td>
<td valign="top" align="center">33</td>
<td valign="top" align="center">77</td>
<td valign="top" align="center">514.761</td>
<td valign="top" align="center">-0.18328</td>
</tr> <tr>
<td valign="top" align="left">NEO</td>
<td valign="top" align="center">1,023</td>
<td valign="top" align="center">48</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">222.2605</td>
<td valign="top" align="center">-0.10953</td>
</tr> <tr>
<td valign="top" align="left">RL</td>
<td valign="top" align="center">1,021</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">50</td>
<td valign="top" align="center">219.8609</td>
<td valign="top" align="center">-0.05474</td>
</tr> <tr>
<td valign="top" align="left">HER</td>
<td valign="top" align="center">1,014</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">57</td>
<td valign="top" align="center">162.2745</td>
<td valign="top" align="center">-0.0493</td>
</tr> <tr>
<td valign="top" align="left">OURS</td>
<td valign="top" align="center">1,068</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">234.6667</td>
<td valign="top" align="center">-0.06235</td>
</tr></tbody>
</table>
</table-wrap>
<p>In <xref ref-type="fig" rid="F3">Figure 3</xref>, yellow dots indicate the positions where the end-effector can reach the target successfully. Green dots indicate positions that cannot be reached before the steps run out. Purple dots indicate collisions when reaching these positions. The NEO method did not collide, so points indicating Run out are marked in purple for clarity in observation. The figure shows that moving the end-effector from the initial position to the back of the robot arm is challenging for the controller. The RL method without an improved replay buffer does not have an advantage over the traditional numerical method. The optimization-based collision avoidance approach NEO has difficulty in finding a suitable solution for avoiding irregularly moving robotic arm links, especially when a well-defined path is given.</p>
<p>According to the data in <xref ref-type="table" rid="T3">Table 3</xref>, it can be seen that our method has the highest arrival rate (99.72%), but the Avg reward is slightly lower. This could be attributed to a decrease in the end-effector movement speed under Agent control (resulting in a higher average number of steps compared to PI) and slight deviations from the planned trajectory, leading to a reduction in rewards. NEO is able to avoid collisions completely in this simulation but has more Run out cases. In the simulation, it was observed that NEO causes the robot arm to get stuck in certain poses and cannot continue tracking the target. Also, the NEO method prefers to find the trajectory freely, so it cannot track the target well under strict constraints on the trajectory, which is reflected by having a lower Avg reward. Attributed to encouraging target tracking, HER has lower Avg steps and higher Avg reward but does not do better in avoiding collisions and reaching the final target.</p>
<p><xref ref-type="fig" rid="F4">Figures 4A</xref>, <xref ref-type="fig" rid="F4">B</xref> show how our method avoids self-collisions during tracing the target. The blue curve in the figure shows the trajectory of the end-effector. It can be seen that the end-effector maintains a smooth motion path under both methods. In <xref ref-type="fig" rid="F4">Figure 4A</xref>, the PI method is employed to control the motion of the robotic arm. This resulted in a collision between the end-effector and the robot arm linkage at the red circle. In <xref ref-type="fig" rid="F4">Figure 4B</xref>, our method controls the robot arm to rotate its own mechanism to move away from the motion path of the end-effector. Because of this behavior, our method can avoid self-collisions.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Path under different control methods. The cyan line depicts the motion trajectory of the end-effector, while the red sphere indicates the target position of the end-effector. Red circles highlight locations where collisions occur under the control of the pseudoinverse method. <bold>(A)</bold> PI posture. <bold>(B)</bold> Ours posture.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-18-1375309-g0004.tif"/>
</fig></sec>
<sec>
<title>5.3 Three target positions arrival simulation</title>
<p>In each episode of this simulation, three final target positions are randomly generated. The controller needs to guide the end-effector to reach the three positions consecutively. Unlike the single target experiment, the initial of each segment of the moving robot arm cannot be predicted. The experiment is executed for 1,000 episodes. The upper limit of steps per episode is 3,000. The performance of the six methods is compared. The target positions are pre-generated to ensure that each method faces the same challenge.</p>
<p>The results of the simulation are shown in <xref ref-type="table" rid="T4">Table 4</xref>. Similar to Section 5.2, the controller guides the end-effector to reach the three final target positions in sequence and is recorded as Success. Failure to reach any target position or collision in the middle of the episode is recorded as run out or collision. The table also shows the average steps per episode and the average reward per step.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Details of the end-effector reaching three different positions under the control of different methods.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919497;color:#ffffff">
<th valign="top" align="left"><bold>Methods</bold></th>
<th valign="top" align="center"><bold>Success</bold></th>
<th valign="top" align="center"><bold>Run out</bold></th>
<th valign="top" align="center"><bold>Collision</bold></th>
<th valign="top" align="center"><bold>Avg step</bold></th>
<th valign="top" align="center"><bold>Avg reward</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">PI</td>
<td valign="top" align="center">778</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">222</td>
<td valign="top" align="center">260.327</td>
<td valign="top" align="center">-0.08718</td>
</tr> <tr>
<td valign="top" align="left">TJ</td>
<td valign="top" align="center">796</td>
<td valign="top" align="center">112</td>
<td valign="top" align="center">92</td>
<td valign="top" align="center">1294.823</td>
<td valign="top" align="center">-0.14426</td>
</tr> <tr>
<td valign="top" align="left">NEO</td>
<td valign="top" align="center">742</td>
<td valign="top" align="center">160</td>
<td valign="top" align="center">98</td>
<td valign="top" align="center">528.74</td>
<td valign="top" align="center">-0.10586</td>
</tr> <tr>
<td valign="top" align="left">RL</td>
<td valign="top" align="center">751</td>
<td valign="top" align="center">23</td>
<td valign="top" align="center">226</td>
<td valign="top" align="center">463.561</td>
<td valign="top" align="center">-0.08597</td>
</tr> <tr>
<td valign="top" align="left">HER</td>
<td valign="top" align="center">723</td>
<td valign="top" align="center">5</td>
<td valign="top" align="center">272</td>
<td valign="top" align="center">393.234</td>
<td valign="top" align="center">-0.07653</td>
</tr> <tr>
<td valign="top" align="left">Ours</td>
<td valign="top" align="center">931</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">69</td>
<td valign="top" align="center">504.456</td>
<td valign="top" align="center">-0.05887</td>
</tr></tbody>
</table>
</table-wrap>
<p>The data in the <xref ref-type="table" rid="T4">Table 4</xref> shows that our method has a higher successful arrival percentage than the method without the improved replay buffer, which indicates that the improved replay buffer enhances the performance of the RL algorithm. Our method also has an advantage over the traditional method, which indicates that the combined reinforcement learning and pseudo-inverse methods give the robot arm a better ability to avoid self-collision. The lower average reward for all methods compared to the single target position experiments is due to the longer average number of steps, thus generating more negative rewards.</p></sec>
<sec>
<title>5.4 Ablation experiments</title>
<p>The purpose of this section is to evaluate whether the improvements for different challenges improve performance. To this end, we repeat the experiments of the Section 5.2 section and keep all other parameter settings identical.</p>
<p>To verify the effectiveness of the reward adjustment in <xref ref-type="disp-formula" rid="E12">Equation (12)</xref>, we select different &#x003B3; for training and tested the training results in simulation. The results of the test are shown in <xref ref-type="table" rid="T5">Table 5</xref>.</p>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>Details of simulated data with different success and failed buffer ratios.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919497;color:#ffffff">
<th valign="top" align="left"><bold>&#x003B3;</bold></th>
<th valign="top" align="center"><bold>Success</bold></th>
<th valign="top" align="center"><bold>Run out</bold></th>
<th valign="top" align="center"><bold>Collision</bold></th>
<th valign="top" align="center"><bold>Avg step</bold></th>
<th valign="top" align="center"><bold>Avg reward</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">0.99</td>
<td valign="top" align="center">1,061</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">10</td>
<td valign="top" align="center">171.2754</td>
<td valign="top" align="center">-0.0412</td>
</tr> <tr>
<td valign="top" align="left">0.95</td>
<td valign="top" align="center">1,041</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">30</td>
<td valign="top" align="center">164.7703</td>
<td valign="top" align="center">-0.05292</td>
</tr> <tr>
<td valign="top" align="left">0.9</td>
<td valign="top" align="center">1,058</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">13</td>
<td valign="top" align="center">170.605</td>
<td valign="top" align="center">-0.04269</td>
</tr> <tr>
<td valign="top" align="left">0.7</td>
<td valign="top" align="center">1,061</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">10</td>
<td valign="top" align="center">172.6872</td>
<td valign="top" align="center">-0.05023</td>
</tr> <tr>
<td valign="top" align="left">0.5</td>
<td valign="top" align="center">1,050</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">20</td>
<td valign="top" align="center">168.5556</td>
<td valign="top" align="center">-0.04323</td>
</tr> <tr>
<td valign="top" align="left">0.2</td>
<td valign="top" align="center">1,061</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">10</td>
<td valign="top" align="center">172.6872</td>
<td valign="top" align="center">-0.05024</td>
</tr> <tr>
<td valign="top" align="left">0.1</td>
<td valign="top" align="center">1,042</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">29</td>
<td valign="top" align="center">166.549</td>
<td valign="top" align="center">-0.04163</td>
</tr> <tr>
<td valign="top" align="left">0.05</td>
<td valign="top" align="center">1,053</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">18</td>
<td valign="top" align="center">169.4585</td>
<td valign="top" align="center">-0.04364</td>
</tr> <tr>
<td valign="top" align="left">0.01</td>
<td valign="top" align="center">1,041</td>
<td valign="top" align="center">20</td>
<td valign="top" align="center">10</td>
<td valign="top" align="center">168.8571</td>
<td valign="top" align="center">-0.06217</td>
</tr></tbody>
</table>
</table-wrap>
<p>As can be seen from the table, the adjusted reward has a positive effect on avoiding self-collision of the robot arm, but the effect does not increase linearly with &#x003B3;. The success rate of tracking showed two peaks when &#x003B3; is close to 0.2 and 0.7. In addition, the algorithm shows better stability when &#x003B3; is set to 0.2 in repeated training.</p>
<p>To verify the effectiveness of the dynamic balancing mechanism, a set of simulations with different ratios of success and failed replay buffer are performed. The results of the simulation are shown in <xref ref-type="table" rid="T6">Table 6</xref>.</p>
<table-wrap position="float" id="T6">
<label>Table 6</label>
<caption><p>Details of simulated data with different success and failed buffer ratios.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919497;color:#ffffff">
<th valign="top" align="left"><bold>Percent</bold></th>
<th valign="top" align="center"><bold>Success</bold></th>
<th valign="top" align="center"><bold>Run out</bold></th>
<th valign="top" align="center"><bold>Collision</bold></th>
<th valign="top" align="center"><bold>Avg step</bold></th>
<th valign="top" align="center"><bold>Avg reward</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">No balance</td>
<td valign="top" align="center">1,042</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">29</td>
<td valign="top" align="center">166.3119</td>
<td valign="top" align="center">-0.04251</td>
</tr> <tr>
<td valign="top" align="left">50%</td>
<td valign="top" align="center">1,061</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">10</td>
<td valign="top" align="center">172.3688</td>
<td valign="top" align="center">-0.05038</td>
</tr> <tr>
<td valign="top" align="left">75%</td>
<td valign="top" align="center">1,056</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">15</td>
<td valign="top" align="center">170.0355</td>
<td valign="top" align="center">-0.04462</td>
</tr> <tr>
<td valign="top" align="left">100%</td>
<td valign="top" align="center">819</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">252</td>
<td valign="top" align="center">118.535</td>
<td valign="top" align="center">-0.04596</td>
</tr></tbody>
</table>
</table-wrap>
<p>This simulation compares the tracking of end-effectors with four different ratios. In the No balance simulation, there is no limit on the percentage of successful and failed replay, and the percentage of failed replay is about 32.37% (135,480 of 418,525). The remaining simulations limit the percentage of failed replay to about 50%, about 75%, and 100%.</p>
<p>As can be seen in <xref ref-type="table" rid="T6">Table 6</xref>, Balancing successful and failed replay by 50%-50% can effectively reduce the probability of self-collision of the robot arm, although this increases the average number of steps and slightly reduces the average reward. Slightly more failed replay (75%) also reduces the likelihood of self-collision, but not as effectively as keeping it at 50%. The probability of collision increases when using failed replays entirely, due to the lack of successful samples resulting in the agent&#x00027;s inability to properly evaluate the environment.</p>
<p>Two conclusions can be drawn from the simulation results as follows. A) The dynamic balance replay proposed in this paper is effective in avoiding self-collision of the robotic arm. B) Appropriate discarding of some replay may have a positive impact on the behavioral strategy of the agent.</p>
<p>To examine the impact of distinct weights for <italic>R</italic><sub><italic>l</italic></sub> and <italic>R</italic><sub><italic>r</italic></sub> on the agent as stipulated in <xref ref-type="disp-formula" rid="E11">Equation (11)</xref>, the reward function is configured as <xref ref-type="disp-formula" rid="E14">Equation (14)</xref></p>
<disp-formula id="E14"><label>(14)</label><mml:math id="M29"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>2</mml:mn><mml:mo>-</mml:mo><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The performance of the proposed apporach is evaluated under varying &#x003B1; values, and the results are presented in <xref ref-type="table" rid="T7">Table 7</xref>.</p>
<table-wrap position="float" id="T7">
<label>Table 7</label>
<caption><p>Details of simulated data with different success and failed buffer ratios.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919497;color:#ffffff">
<th valign="top" align="left"><bold>Ratio &#x003B1;</bold></th>
<th valign="top" align="center"><bold>Success</bold></th>
<th valign="top" align="center"><bold>Run out</bold></th>
<th valign="top" align="center"><bold>Collision</bold></th>
<th valign="top" align="center"><bold>Avg step</bold></th>
<th valign="top" align="center"><bold>Avg reward</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">0.25</td>
<td valign="top" align="center">1,029</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">41</td>
<td valign="top" align="center">173.6502</td>
<td valign="top" align="center">-0.05646</td>
</tr> <tr>
<td valign="top" align="left">0.5</td>
<td valign="top" align="center">1,038</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">33</td>
<td valign="top" align="center">174.3946</td>
<td valign="top" align="center">-0.05828</td>
</tr> <tr>
<td valign="top" align="left">1</td>
<td valign="top" align="center">1,056</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">15</td>
<td valign="top" align="center">173.4608</td>
<td valign="top" align="center">-0.05111</td>
</tr> <tr>
<td valign="top" align="left">1.5</td>
<td valign="top" align="center">1,032</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">39</td>
<td valign="top" align="center">172.9067</td>
<td valign="top" align="center">-0.05961</td>
</tr> <tr>
<td valign="top" align="left">1.75</td>
<td valign="top" align="center">1,027</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">44</td>
<td valign="top" align="center">172.9543</td>
<td valign="top" align="center">-0.05935</td>
</tr></tbody>
</table>
</table-wrap>
<p>From <xref ref-type="table" rid="T7">Table 7</xref>, it can be observed that the proposed method performs favorably when &#x003B1; is set to 1. Therefore, in this paper, &#x003B1; is chosen to be 1, as indicated in <xref ref-type="disp-formula" rid="E11">Equation (11)</xref>.</p></sec>
<sec>
<title>5.5 Motion smoothness demonstration</title>
<p>To verify that the method in this paper not only smooths the end-effector motion but also keeps the velocity of the robotic arm joints smooth. All joint velocities for 200 consecutive steps were collected to verify the smoothness of the robot arm joint motion.</p>
<p>Let the Action of step <italic>t</italic> be <inline-formula><mml:math id="M30"><mml:msup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. The acceleration &#x00227;<sup><italic>t</italic></sup> is calculated by approximating the velocity of two adjacent steps as <xref ref-type="disp-formula" rid="E15">Equation (15)</xref></p>
<disp-formula id="E15"><label>(15)</label><mml:math id="M31"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mi>&#x00227;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>-</mml:mo><mml:msup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mo>&#x002D9;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msup><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>To represent the acceleration as a single value, a first order norm for &#x00227;<sup><italic>t</italic></sup> has been taken. After that, divide it by the number of joints to get the average acceleration <inline-formula><mml:math id="M32"><mml:msup><mml:mrow><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> of the joints at step <italic>t</italic>. The formula is expressed as <xref ref-type="disp-formula" rid="E16">Equation (16)</xref></p>
<disp-formula id="E16"><label>(16)</label><mml:math id="M33"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo>|</mml:mo><mml:msup><mml:mrow><mml:mi>&#x00227;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msup><mml:mo>|</mml:mo><mml:mo>/</mml:mo><mml:mi>n</mml:mi><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>To visualize the results, the acceleration curves of Our method (Combined deterministic algorithm TD3 and pseudo-inverse) and SAC (Combined stochastic algorithm and pseudo-inverse) have been plotted. As a comparison, the acceleration curves of PI, DLS, and TJ are also plotted. The results are shown in <xref ref-type="fig" rid="F5">Figure 5</xref>, and detailed data are shown in <xref ref-type="table" rid="T8">Table 8</xref>.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Average acceleration of robot arm joints under different methods of control.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-18-1375309-g0005.tif"/>
</fig>
<table-wrap position="float" id="T8">
<label>Table 8</label>
<caption><p>The average acceleration of each joint of the robot arm under different methods.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919497;color:#ffffff">
<th valign="top" align="left"><bold>Methods</bold></th>
<th valign="top" align="center"><bold>Average</bold></th>
<th valign="top" align="center"><bold>Maximum</bold></th>
<th valign="top" align="center"><bold>Variances</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Ours</td>
<td valign="top" align="center">8.8e-3</td>
<td valign="top" align="center">0.17</td>
<td valign="top" align="center">2.8e-4</td>
</tr> <tr>
<td valign="top" align="left">SAC</td>
<td valign="top" align="center">0.22</td>
<td valign="top" align="center">0.71</td>
<td valign="top" align="center">2.3e-2</td>
</tr> <tr>
<td valign="top" align="left">PI</td>
<td valign="top" align="center">5.6e-3</td>
<td valign="top" align="center">0.05</td>
<td valign="top" align="center">3.7e-5</td>
</tr> <tr>
<td valign="top" align="left">DLS</td>
<td valign="top" align="center">2.4e-3</td>
<td valign="top" align="center">0.03</td>
<td valign="top" align="center">9.9e-6</td>
</tr> <tr>
<td valign="top" align="left">TJ</td>
<td valign="top" align="center">1.7e-4</td>
<td valign="top" align="center">5e-4</td>
<td valign="top" align="center">2.3e-8</td>
</tr></tbody>
</table>
</table-wrap>
<p>As can be seen in <xref ref-type="fig" rid="F5">Figure 5</xref>, the curves of both our method and the SAC method produce significant fluctuations. This indicates that the control of the reinforcement learning method caused the jitter of the robot arm joints. However, the acceleration range of our method is closer to that of PI, which is two orders of magnitude lower than that of the SAC method. This is because our controller employed a deterministic reinforcement learning algorithm and calculate the final result by numerical methods. The PI, DLS, and TJ methods have smoother acceleration profiles because they are calculated in a purely mathematical way, with large fluctuations only in the case of target changes.</p>
<p>In <xref ref-type="table" rid="T8">Table 8</xref>, the mean, maximum, and variance of the acceleration for each of the five methods in 200 steps are calculated. These data are used to reflect the variation of acceleration for the above methods.</p>
<p>Our method is closer to the numerical methods PI and DLS in terms of average acceleration and maximum acceleration as shown in <xref ref-type="table" rid="T8">Table 8</xref>. The average acceleration of our method is 3 orders of magnitude lower, while the variance of acceleration is 2 orders of magnitude lower than the SAC method. Therefore, when the robot arm is controlled by our method, not only the trajectory of the end-effector is smooth, but also the motion of the other joints of the robot arm is steady.</p>
<p>The TJ method has the best acceleration performance. However, as it is known from previous experiments, the TJ method requires more steps to reach the target, so its lower acceleration is due to its slow movement speed.</p></sec>
<sec>
<title>5.6 Experiment</title>
<p>In this experiment, our method is deployed on a real Franka Emika Panda robot. The goal of the experiment is to move the end-effector along a straight line to the rear of the robotic arm to examine the ability of the algorithm to avoid collisions. The initial position of the end-effector is [0.5, 0.0, 0.36], and the target arrival position is [-0.4, 0.13, 0.46], measured in meters. The motion trajectory is a straight line connecting the initial position to the target position. To ensure that the end-effector moves along the designated path, 383 waypoints were inserted along the motion trajectory. Our method provides inverse kinematics (IK) solutions at a frequency of 10Hz, while the Franka Panda robotic arm receives control signals at a frequency of 1,000 Hz. To accommodate this, we performed B-spline linear interpolation along the trajectory, resulting in a total of 38,500 points in the final path, including the starting and ending points.</p>
<p><xref ref-type="fig" rid="F6">Figure 6</xref> shows the motion process of the robotic arm from two views. Under the guidance of our controller, the robotic arm adjusts the linkage position and bypasses the end-effector to avoid collisions.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>The upper row is the right view of the motion process, the lower row is the front view of the motion process.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-18-1375309-g0006.tif"/>
</fig>
</sec></sec>
<sec sec-type="conclusions" id="s6">
<title>6 Conclusion</title>
<p>Introducing reinforcement learning into the field of robot control remains challenging. This encompasses searching for suitable solutions within complex solution spaces and ensuring smooth robot motions. This paper proposes a reinforcement learning-enhanced pseudo-inverse method for robotic arm control, aiming to maintain the smoothness of robot motions with self-collisions avoided. Although our current work is focuses solely on the inverse kinematics of the end effector position, there is potential for broader applications in the context of redundant robotic arms, such as in human-robot collaboration or medical scenarios.</p></sec>
<sec sec-type="data-availability" id="s7">
<title>Data availability statement</title>
<p>The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.</p></sec>
<sec sec-type="author-contributions" id="s8">
<title>Author contributions</title>
<p>TH: Writing &#x02013; original draft, Writing &#x02013; review &#x00026; editing. WL: Writing &#x02013; review &#x00026; editing. KH: Writing &#x02013; review &#x00026; editing.</p></sec>
</body>
<back>
<sec sec-type="funding-information" id="s9">
<title>Funding</title>
<p>The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. This work was supported in part by the National Natural Science Foundation of China under Grant 62206317, in part by the Guangdong Basic and Applied Basic Research Foundation under Grant 2022A1515012186, and in part by the Guangzhou Basic and Applied Basic Research Foundation under Grant 202201011523.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s10">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec><sec sec-type="supplementary-material" id="s11">
<title>Supplementary material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fnbot.2024.1375309/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fnbot.2024.1375309/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Video_1.MP4" id="SM1" mimetype="video/mp4" xmlns:xlink="http://www.w3.org/1999/xlink"/></sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Agarwal</surname> <given-names>A.</given-names></name> <name><surname>Nasa</surname> <given-names>C.</given-names></name> <name><surname>Bandyopadhyay</surname> <given-names>S.</given-names></name></person-group> (<year>2016</year>). <article-title>Dynamic singularity avoidance for parallel manipulators using a task-priority based control scheme</article-title>. <source>Mechan. Mach. Theory</source> <volume>96</volume>, <fpage>107</fpage>&#x02013;<lpage>126</lpage>. <pub-id pub-id-type="doi">10.1016/j.mechmachtheory.2015.07.013</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Al-Hafez</surname> <given-names>F.</given-names></name> <name><surname>Steil</surname> <given-names>J. J.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Redundancy resolution as action bias in policy search for robotic manipulation,&#x0201D;</article-title> in <source>Conference on Robot Learning</source> (<publisher-loc>Atlanta</publisher-loc>: <publisher-name>PMLR</publisher-name>), <fpage>981</fpage>&#x02013;<lpage>990</lpage>.</citation>
</ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Andrychowicz</surname> <given-names>M.</given-names></name> <name><surname>Wolski</surname> <given-names>F.</given-names></name> <name><surname>Ray</surname> <given-names>A.</given-names></name> <name><surname>Schneider</surname> <given-names>J.</given-names></name> <name><surname>Fong</surname> <given-names>R.</given-names></name> <name><surname>Welinder</surname> <given-names>P.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>&#x0201C;Hindsight experience replay,&#x0201D;</article-title> in <source>Advance Neural Information Processing System</source>, 30.</citation>
</ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bing</surname> <given-names>Z.</given-names></name> <name><surname>Brucker</surname> <given-names>M.</given-names></name> <name><surname>Morin</surname> <given-names>F. O.</given-names></name> <name><surname>Li</surname> <given-names>R.</given-names></name> <name><surname>Su</surname> <given-names>X.</given-names></name> <name><surname>Huang</surname> <given-names>K.</given-names></name> <etal/></person-group>. (<year>2022a</year>). <article-title>Complex robotic manipulation via graph-based hindsight goal generation</article-title>. <source>IEEE Trans. Neural Netw. Learn. Syst</source>. <volume>33</volume>, <fpage>7863</fpage>&#x02013;<lpage>7876</lpage>. <pub-id pub-id-type="doi">10.1109/TNNLS.2021.3088947</pub-id><pub-id pub-id-type="pmid">34181552</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bing</surname> <given-names>Z.</given-names></name> <name><surname>Cheng</surname> <given-names>L.</given-names></name> <name><surname>Huang</surname> <given-names>K.</given-names></name> <name><surname>Knoll</surname> <given-names>A.</given-names></name></person-group> (<year>2022b</year>). <article-title>Simulation to real: learning energy-efficient slithering gaits for a snake-like robot</article-title>. <source>IEEE Robot. Autom. Magaz</source>. <volume>29</volume>, <fpage>92</fpage>&#x02013;<lpage>103</lpage>. <pub-id pub-id-type="doi">10.1109/MRA.2022.3204237</pub-id></citation>
</ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bing</surname> <given-names>Z.</given-names></name> <name><surname>Knak</surname> <given-names>L.</given-names></name> <name><surname>Cheng</surname> <given-names>L.</given-names></name> <name><surname>Morin</surname> <given-names>F. O.</given-names></name> <name><surname>Huang</surname> <given-names>K.</given-names></name> <name><surname>Knoll</surname> <given-names>A.</given-names></name></person-group> (<year>2023a</year>). <article-title>&#x0201C;Meta-reinforcement learning in nonstationary and nonparametric environments,&#x0201D;</article-title> in <source>IEEE Transactions on Neural Networks and Learning Systems</source>, <fpage>1</fpage>&#x02013;<lpage>15</lpage>.<pub-id pub-id-type="pmid">37224358</pub-id></citation></ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bing</surname> <given-names>Z.</given-names></name> <name><surname>Lerch</surname> <given-names>D.</given-names></name> <name><surname>Huang</surname> <given-names>K.</given-names></name> <name><surname>Knoll</surname> <given-names>A.</given-names></name></person-group> (<year>2023b</year>). <article-title>Meta-reinforcement learning in non-stationary and dynamic environments</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell</source>. <volume>45</volume>, <fpage>3476</fpage>&#x02013;<lpage>3491</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2022.3185549</pub-id><pub-id pub-id-type="pmid">35737617</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bing</surname> <given-names>Z.</given-names></name> <name><surname>Zhou</surname> <given-names>H.</given-names></name> <name><surname>Li</surname> <given-names>R.</given-names></name> <name><surname>Su</surname> <given-names>X.</given-names></name> <name><surname>Morin</surname> <given-names>F. O.</given-names></name> <name><surname>Huang</surname> <given-names>K.</given-names></name> <etal/></person-group>. (<year>2023c</year>). <article-title>Solving robotic manipulation with sparse reward reinforcement learning via graph-based diversity and proximity</article-title>. <source>IEEE Trans. Industr. Electron</source>. <volume>70</volume>, <fpage>2759</fpage>&#x02013;<lpage>2769</lpage>. <pub-id pub-id-type="doi">10.1109/TIE.2022.3172754</pub-id></citation>
</ref>
<ref id="B9">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Boyd</surname> <given-names>S.</given-names></name> <name><surname>Vandenberghe</surname> <given-names>L.</given-names></name></person-group> (<year>2004</year>). <source>Convex Optimization</source>. <publisher-loc>Cambridge</publisher-loc>: <publisher-name>Cambridge University Press</publisher-name>.</citation>
</ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cao</surname> <given-names>H.</given-names></name> <name><surname>Chen</surname> <given-names>G.</given-names></name> <name><surname>Li</surname> <given-names>Z.</given-names></name> <name><surname>Feng</surname> <given-names>Q.</given-names></name> <name><surname>Lin</surname> <given-names>J.</given-names></name> <name><surname>Knoll</surname> <given-names>A.</given-names></name></person-group> (<year>2023a</year>). <article-title>Efficient grasp detection network with gaussian-based grasp representation for robotic manipulation</article-title>. <source>IEEE/ASME Trans. Mechatr</source>. <volume>28</volume>, <fpage>1384</fpage>&#x02013;<lpage>1394</lpage>. <pub-id pub-id-type="doi">10.1109/TMECH.2022.3224314</pub-id></citation>
</ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cao</surname> <given-names>H.</given-names></name> <name><surname>Chen</surname> <given-names>G.</given-names></name> <name><surname>Li</surname> <given-names>Z.</given-names></name> <name><surname>Hu</surname> <given-names>Y.</given-names></name> <name><surname>Knoll</surname> <given-names>A.</given-names></name></person-group> (<year>2022</year>). <article-title>Neurograsp: Multimodal neural network with euler region regression for neuromorphic vision-based grasp pose estimation</article-title>. <source>IEEE Trans. Instrum. Meas</source>. <volume>71</volume>, <fpage>1</fpage>&#x02013;<lpage>11</lpage>. <pub-id pub-id-type="doi">10.1109/TIM.2022.3179469</pub-id></citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cao</surname> <given-names>H.</given-names></name> <name><surname>Qu</surname> <given-names>Z.</given-names></name> <name><surname>Chen</surname> <given-names>G.</given-names></name> <name><surname>Li</surname> <given-names>X.</given-names></name> <name><surname>Thiele</surname> <given-names>L.</given-names></name> <name><surname>Knoll</surname> <given-names>A.</given-names></name></person-group> (<year>2023b</year>). <article-title>Ghostvit: Expediting vision transformers via cheap operations</article-title>. <source>IEEE Trans. Artif. Intellig</source>. <volume>2023</volume>, <fpage>1</fpage>&#x02013;<lpage>9</lpage>. <pub-id pub-id-type="doi">10.1109/TAI.2023.3326795</pub-id></citation>
</ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cheng</surname> <given-names>F. T.</given-names></name> <name><surname>Lu</surname> <given-names>Y. T.</given-names></name> <name><surname>Sun</surname> <given-names>Y. Y.</given-names></name></person-group> (<year>1998</year>). <article-title>Window-shaped obstacle avoidance for a redundant manipulator. <italic>IEEE Trans</italic>. Syst. <italic>Man</italic>. Cybern</article-title>. <source>B Cybern</source>. <volume>28</volume>, <fpage>806</fpage>. <pub-id pub-id-type="doi">10.1109/3477.735390</pub-id><pub-id pub-id-type="pmid">18255999</pub-id></citation></ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Colom&#x000E9;</surname> <given-names>A.</given-names></name> <name><surname>Torras</surname> <given-names>C.</given-names></name></person-group> (<year>2015</year>). <article-title>Closed-loop inverse kinematics for redundant robots: Comparative assessment and two enhancements</article-title>. <source>IEEE/ASME Trans. Mechatron</source>. <volume>20</volume>, <fpage>944</fpage>&#x02013;<lpage>955</lpage>. <pub-id pub-id-type="doi">10.1109/TMECH.2014.2326304</pub-id></citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Denavit</surname> <given-names>J.</given-names></name> <name><surname>Hartenberg</surname> <given-names>R. S.</given-names></name></person-group> (<year>1955</year>). <article-title>A kinematic notation for lower-pair mechanisms based on matrices</article-title>. <source>J. Appl, Mech</source>, <volume>22</volume>, <fpage>215</fpage>&#x02013;<lpage>221</lpage>. <pub-id pub-id-type="doi">10.1115/1.4011045</pub-id></citation>
</ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dereli</surname> <given-names>S.</given-names></name> <name><surname>K&#x000F6;ker</surname> <given-names>R.</given-names></name></person-group> (<year>2018</year>). <article-title>Iw-pso approach to the inverse kinematics problem solution of a 7-dof serial robot manipulator</article-title>. <source>Sigma J. Eng. Natural Sci</source>. <volume>36</volume>, <fpage>77</fpage>&#x02013;<lpage>85</lpage>.</citation>
</ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dereli</surname> <given-names>S.</given-names></name> <name><surname>K&#x000F6;ker</surname> <given-names>R.</given-names></name></person-group> (<year>2020</year>). <article-title>A meta-heuristic proposal for inverse kinematics solution of 7-dof serial robotic manipulator: quantum behaved particle swarm algorithm</article-title>. <source>Artif. Intellig. Rev</source>. <volume>53</volume>, <fpage>949</fpage>&#x02013;<lpage>964</lpage>. <pub-id pub-id-type="doi">10.1007/s10462-019-09683-x</pub-id></citation>
</ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Duguleana</surname> <given-names>M.</given-names></name> <name><surname>Barbuceanu</surname> <given-names>F. G.</given-names></name> <name><surname>Teirelbar</surname> <given-names>A.</given-names></name> <name><surname>Mogan</surname> <given-names>G.</given-names></name></person-group> (<year>2012</year>). <article-title>Obstacle avoidance of redundant manipulators using neural networks based reinforcement learning</article-title>. <source>Robot. Comp. Integrat. Manufact</source>. <volume>2</volume>, <fpage>28</fpage>. <pub-id pub-id-type="doi">10.1016/j.rcim.2011.07.004</pub-id></citation>
</ref>
<ref id="B19">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Fujimoto</surname> <given-names>S.</given-names></name> <name><surname>Hoof</surname> <given-names>H.</given-names></name> <name><surname>Meger</surname> <given-names>D.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Addressing function approximation error in actor-critic methods,&#x0201D;</article-title> in <source>Proceedings of the 35th International Conference on Machine Learning</source>, eds <person-group person-group-type="editor"><name><surname>Dy</surname> <given-names>J. G.</given-names></name> <name><surname>Krause</surname> <given-names>A.</given-names></name></person-group> (<publisher-loc>Stockholm</publisher-loc>: <publisher-name>PMLR</publisher-name>), <fpage>1582</fpage>&#x02013;<lpage>1591</lpage>. Available online at: <ext-link ext-link-type="uri" xlink:href="http://proceedings.mlr.press/v80/fujimoto18a.html">http://proceedings.mlr.press/v80/fujimoto18a.html</ext-link></citation>
</ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Guo</surname> <given-names>Z. Y.</given-names></name> <name><surname>Hsia</surname> <given-names>T. C.</given-names></name></person-group> (<year>1990</year>). <article-title>Joint trajectory generation for redundant robots in an environment with obstacles</article-title>. <source>IEEE Trans. Biomed. Eng</source>. <volume>35</volume>, <fpage>153</fpage>&#x02013;<lpage>60</lpage>. <pub-id pub-id-type="doi">10.1109/ROBOT.1990.125964</pub-id></citation>
</ref>
<ref id="B21">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Haarnoja</surname> <given-names>T.</given-names></name> <name><surname>Zhou</surname> <given-names>A.</given-names></name> <name><surname>Abbeel</surname> <given-names>P.</given-names></name> <name><surname>Levine</surname> <given-names>S.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,&#x0201D;</article-title> in <source>Proceedings of the 35th International Conference on Machine Learning</source> (<publisher-loc>Stockholm</publisher-loc>: <publisher-name>PMLR</publisher-name>), <fpage>1861</fpage>&#x02013;<lpage>1870</lpage>.</citation>
</ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Haviland</surname> <given-names>J.</given-names></name> <name><surname>Corke</surname> <given-names>P.</given-names></name></person-group> (<year>2021</year>). <article-title>Neo: a novel expeditious optimisation algorithm for reactive motion control of manipulators</article-title>. <source>IEEE Robot. Automat. Lett</source>. <volume>6</volume>, <fpage>1043</fpage>&#x02013;<lpage>1050</lpage>. <pub-id pub-id-type="doi">10.1109/LRA.2021.3056060</pub-id></citation>
</ref>
<ref id="B23">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Ren</surname> <given-names>S.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,&#x0201D;</article-title> in <source>Proceedings of the IEEE International Conference on Computer Vision</source> (<publisher-loc>Santiago</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1026</fpage>&#x02013;<lpage>1034</lpage>.</citation>
</ref>
<ref id="B24">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>James</surname> <given-names>S.</given-names></name> <name><surname>Freese</surname> <given-names>M.</given-names></name> <name><surname>Davison</surname> <given-names>A.</given-names></name></person-group> (<year>2019</year>). <article-title>Pyrep: Bringing v-rep to deep robot learning</article-title>. <source>arXiv</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1906.11176">http://arxiv.org/abs/1906.11176</ext-link></citation>
</ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>K&#x000F6;ker</surname> <given-names>R.</given-names></name> <name><surname>&#x000C7;akar</surname> <given-names>T.</given-names></name></person-group> (<year>2016</year>). <article-title>A neuro-genetic-simulated annealing approach to the inverse kinematics solution of robots: a simulation based study</article-title>. <source>Eng. Comput</source>. <volume>32</volume>, <fpage>553</fpage>&#x02013;<lpage>565</lpage>. <pub-id pub-id-type="doi">10.1007/s00366-015-0432-z</pub-id></citation>
</ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>X.</given-names></name> <name><surname>Liu</surname> <given-names>H.</given-names></name> <name><surname>Dong</surname> <given-names>M.</given-names></name></person-group> (<year>2022</year>). <article-title>A general framework of motion planning for redundant robot manipulator based on deep reinforcement learning</article-title>. <source>IEEE Trans. Indust. Informat</source>. <volume>8</volume>, <fpage>18</fpage>. <pub-id pub-id-type="doi">10.1109/TII.2021.3125447</pub-id></citation>
</ref>
<ref id="B27">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Martin</surname> <given-names>P.</given-names></name> <name><surname>Mill&#x000E1;n</surname> <given-names>J. R.</given-names></name></person-group> (<year>1997</year>). <article-title>&#x0201C;Combining reinforcement learning and differential inverse kinematics for collision-free motion of multilink manipulators,&#x0201D;</article-title> in <source>Biological and Artificial Computation: From Neuroscience to Technology, International Work-Conference on Artificial and Natural Neural Networks</source>, eds <person-group person-group-type="editor"><name><surname>Mira</surname> <given-names>J.</given-names></name> <name><surname>Moreno-D&#x000ED;az</surname> <given-names>R.</given-names></name> <name><surname>Cabestany</surname> <given-names>J.</given-names></name></person-group> (<publisher-loc>Canary Islands</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>1324</fpage>&#x02013;<lpage>1333</lpage>. <pub-id pub-id-type="doi">10.1007/BFb0032593</pub-id></citation>
</ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Momani</surname> <given-names>S.</given-names></name> <name><surname>Abo-Hammour</surname> <given-names>Z. S.</given-names></name> <name><surname>Alsmadi</surname> <given-names>O. M.</given-names></name></person-group> (<year>2016</year>). <article-title>Solution of inverse kinematics problem using genetic algorithms</article-title>. <source>Appl. Mathem. Informat. Sci</source>. <volume>10</volume>, <fpage>225</fpage>. <pub-id pub-id-type="doi">10.18576/amis/100122</pub-id></citation>
</ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nakamura</surname> <given-names>Y.</given-names></name> <name><surname>Hanafusa</surname> <given-names>H.</given-names></name></person-group> (<year>1986</year>). <article-title>Inverse kinematic solutions with singularity robustness for robot manipulator control</article-title>. <source>J. Dyn. Syst. Meas. Control</source> <volume>108</volume>, <fpage>163</fpage>&#x02013;<lpage>171</lpage>. <pub-id pub-id-type="doi">10.1115/1.3143764</pub-id></citation>
</ref>
<ref id="B30">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Paul</surname> <given-names>R.</given-names></name> <name><surname>Shimano</surname> <given-names>B.</given-names></name></person-group> (<year>1978</year>). <article-title>&#x0201C;Kinematic control equations for simple manipulators,&#x0201D;</article-title> in <source>IEEE Conference on Decision and Control including the 17th Symposium on Adaptive Processes</source> (<publisher-loc>San Diego, CA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1398</fpage>&#x02013;<lpage>1406</lpage>.</citation>
</ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rokbani</surname> <given-names>N.</given-names></name> <name><surname>Alimi</surname> <given-names>A. M.</given-names></name></person-group> (<year>2013</year>). <article-title>Inverse kinematics using particle swarm optimization, a statistical analysis</article-title>. <source>Procedia Eng</source>. <volume>64</volume>, <fpage>1602</fpage>&#x02013;<lpage>1611</lpage>. <pub-id pub-id-type="doi">10.1016/j.proeng.2013.09.242</pub-id></citation>
</ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schappler</surname> <given-names>M.</given-names></name> <name><surname>Ortmaier</surname> <given-names>T.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Singularity avoidance of task-redundant robots in pointing tasks: on nullspace projection and cardan angles as orientation coordinates,&#x0201D;</article-title> in <source>Proceedings of the 18th International Conference on Informatics in Control, Automation and Robotics</source>, eds <person-group person-group-type="editor"><name><surname>Gusikhin</surname> <given-names>O.</given-names></name> <name><surname>Nijmeijer</surname> <given-names>H.</given-names></name> <name><surname>Madani</surname> <given-names>K.</given-names></name></person-group> (SCITEPRESS), <fpage>338</fpage>&#x02013;<lpage>349</lpage>. <pub-id pub-id-type="doi">10.5220/0010621103380349</pub-id></citation>
</ref>
<ref id="B33">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Sutton</surname> <given-names>R. S.</given-names></name> <name><surname>Barto</surname> <given-names>A. G.</given-names></name></person-group> (<year>2018</year>). <source>Reinforcement Learning: An Introduction</source>. MIT press. Available online at: <ext-link ext-link-type="uri" xlink:href="https://www.worldcat.org/oclc/37293240">https://www.worldcat.org/oclc/37293240</ext-link></citation>
</ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tsai</surname> <given-names>L. W.</given-names></name> <name><surname>Morgan</surname> <given-names>A. P.</given-names></name></person-group> (<year>1985</year>). <article-title>Solving the kinematics of the most general six- and five-degree-of-freedom manipulators by continuation methods</article-title>. <source>J. Mech. Design</source> <volume>107</volume>, <fpage>189</fpage>&#x02013;<lpage>200</lpage>. <pub-id pub-id-type="doi">10.1115/1.3258708</pub-id></citation>
</ref>
<ref id="B35">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wolovich</surname> <given-names>W.</given-names></name> <name><surname>Elliott</surname> <given-names>H.</given-names></name></person-group> (<year>1984</year>). <article-title>&#x0201C;A computational technique for inverse kinematics,&#x0201D;</article-title> in The <italic>23rd IEEE Conference on Decision and Control</italic> (<publisher-loc>Las Vegas, NV</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1359</fpage>&#x02013;<lpage>1363</lpage>. <pub-id pub-id-type="doi">10.1109/CDC.1984.272258</pub-id></citation>
</ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zaplana</surname> <given-names>I.</given-names></name> <name><surname>Basanez</surname> <given-names>L.</given-names></name></person-group> (<year>2018</year>). <article-title>A novel closed-form solution for the inverse kinematics of redundant manipulators through workspace analysis</article-title>. <source>Mech. Mach. Theory</source> <volume>121</volume>, <fpage>829</fpage>&#x02013;<lpage>843</lpage>. <pub-id pub-id-type="doi">10.1016/j.mechmachtheory.2017.12.005</pub-id></citation>
</ref>
</ref-list>
</back>
</article> 