<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neurorobot.</journal-id>
<journal-title>Frontiers in Neurorobotics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neurorobot.</abbrev-journal-title>
<issn pub-type="epub">1662-5218</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnbot.2022.1082550</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>A multi-scale robotic tool grasping method for robot state segmentation masks</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Xue</surname> <given-names>Tao</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/1975552/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Zheng</surname> <given-names>Deshuai</given-names></name>
</contrib>
<contrib contrib-type="author">
<name><surname>Yan</surname> <given-names>Jin</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/1727282/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Liu</surname> <given-names>Yong</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
</contrib>
</contrib-group>
<aff><institution>School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing</institution>, <addr-line>Jiangsu</addr-line>, <country>China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Praveen Kumar Donta, Vienna University of Technology, Austria</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Rambabu Pemula, Raghu Engineering College, India; P. Suman Prakash, G. Pullaiah College of Engineering and Technology (GPCET), India</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Yong Liu &#x02709; <email>liuy1602&#x00040;njust.edu.cn</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>10</day>
<month>01</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>16</volume>
<elocation-id>1082550</elocation-id>
<history>
<date date-type="received">
<day>28</day>
<month>10</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>19</day>
<month>12</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2023 Xue, Zheng, Yan and Liu.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Xue, Zheng, Yan and Liu</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license></permissions>
<abstract>
<p>As robots begin to collaborate with humans in their daily work spaces, they need to have a deeper understanding of the tasks of using tools. In response to the problem of using tools in collaboration between humans and robots, we propose a modular system based on collaborative tasks. The first part of the system is designed to find task-related operating areas, and a multi-layer instance segmentation network is used to find the tools needed for the task, and classify the object itself based on the state of the robot in the collaborative task. Thus, we generate the state semantic region with the &#x0201C;leader-assistant&#x0201D; state. In the second part, in order to predict the optimal grasp and handover configuration, a multi-scale grasping network (MGR-Net) based on the mask of state semantic area is proposed, it can better adapt to the change of the receptive field caused by the state semantic region. Compared with the traditional method, our method has higher accuracy. The whole system also achieves good results on untrained real-world tool dataset we constructed. To further verify the effectiveness of our generated grasp representations, A robot platform based on Sawyer is used to prove the high performance of our system.</p></abstract>
<kwd-group>
<kwd>human-robot collaboration</kwd>
<kwd>instance segmentation</kwd>
<kwd>robotic grasping</kwd>
<kwd>grasp detection</kwd>
<kwd>robotic grasp platform</kwd>
</kwd-group>
<counts>
<fig-count count="7"/>
<table-count count="3"/>
<equation-count count="3"/>
<ref-count count="41"/>
<page-count count="12"/>
<word-count count="6301"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>With the increasingly serious aging of the population, how to provide effective homecare for the growing elderly population has ushered in new challenges and changes, especially the COVID-19 epidemic, which makes the need for homecare for the elderly extremely urgent. In order to prevent the elderly from using tools incorrectly and to ensure the safety of tools when using them, we effortlessly draw on our understanding of the functions that tools and their parts provide. Using vision, we can identify the function of the part, so we can find the right tool part for our operation. As robots like PR2, Asimo, and Baxter begin to collaborate with humans in homecare industry, they will also need us to have a more detailed understanding of the tools involved in the task.</p>
<p>When completing tasks through human-robot collaboration, robots are designed to provide more assistance to humans, rather than let the robot perform all tasks autonomously. There are two reasons for this. Firstly, the and level of knowledge and the training required for robots to complete tasks on their own are difficult to establish and collect. Secondly, despite the significant progress made in robotics such as manipulation (Kroemer et al., <xref ref-type="bibr" rid="B19">2015</xref>; Fu et al., <xref ref-type="bibr" rid="B13">2016</xref>), robots are still far from having the fine manipulation capabilities required for tasks such as furniture assembly (for example, using a screwdriver on small screws). Therefore, we hope that the robot can choose the behavior suitable for the robot, while letting the human worker perform the action more suitable for the human. For example, robots may provide supportive or transmit behaviors, such as stabilizing components or bringing heavy components required for assembly (Mangin et al., <xref ref-type="bibr" rid="B29">2017</xref>), while human workers can perform operations that require more adaptability to tasks, such as screwing screws. Therefore, in the task of using various tools through human-robot collaboration, how to understand the task requirements and assign them to different states of robots and humans to grasp tools is a very critical issue.</p>
<p>Brahmbhatt et al. (<xref ref-type="bibr" rid="B2">2019</xref>) used thermal camera to study human grasping contacts on 50 household objects textured with contact maps for two tasks. Fang et al. (<xref ref-type="bibr" rid="B11">2020</xref>) developed a learning-based approach for task-oriented grasping in simulation with reinforcement learning. Liu et al. (<xref ref-type="bibr" rid="B26">2020</xref>) considered a broad sense of context and proposed a data-driven approach to learn suitable semantic grasps. These methods are able to solve the problem of understanding task requirements related to grasp tools through pixel-level enlightening segmentation of a small group of known object categories (Do et al., <xref ref-type="bibr" rid="B8">2018</xref>). However, for collaborative tasks, there is still a lack of consideration for different states that lead to different tool grasping representation. In order to realize the understanding of tools according to different state definitions of robots, we constructed a tool classification dataset used to analyze the different states played by robots when grasping various tools.</p>
<p>We recruited some volunteers to take on different states in grasping the tools in the dataset. And we recorded the grasping areas corresponding to different states and counted these positions. We borrowed the idea of region classification and proposed the state semantics (grasp and handover) region, that is, different states often make people grasp different position of tools. Based on the knowledge of this region, we define two types of robot states: active operator and assisting passer, corresponding to the previous semantics &#x0201C;grasp&#x0201D; and &#x0201C;handover.&#x0201D;</p>
<p>The main contributions of our work mainly include the following four points:</p>
<list list-type="order">
<list-item><p>We proposed a modular system for multi-states tool grasping task under human-robot interaction, which can realize the collaborative grasping and interaction of humans and robots based on tasks.</p></list-item>
<list-item><p>A multi-layer instance segmentation network is proposed to complete the classification of operating areas for task-related tools. Therefore, in different tasks, we can find the most suitable grasping area for humans or robots in different states.</p></list-item>
<list-item><p>Considering that grasping based on the local semantic region of the tool will increase the variation range of the receptive field, we propose a multi-scale grasping convolutional network MGR-Net based on state semantics to improve the prediction accuracy of the network.</p></list-item>
<list-item><p>We collected real-world tool images through &#x0201C;realsense&#x0201D; camera as a test set, and the experimental results show that our method performs well on untrained real-world tool images. Furthermore, we used robotic platform based on Sawyer to validate our grasping representation.</p></list-item>
</list>
<p>The other chapters of this article are arranged as follows. In Section 2, we briefly review related literature. In Section 3, we detail the proposed grasping framework based on semantic state area. In Section 4, Our experimental results are presented. Finally, we conclude this work in Section 5.</p>
</sec>
<sec id="s2">
<title>2. Related work</title>
<p>Learning to use an item as a tool requires an understanding of what it helps to achieve, the properties of the tool that make it useful, and how the tool must be manipulated in order to achieve the goal. In order to further meet the operational requirements of our robots based on different states, the tool grasping tasks under different states can be divided into the following three aspects:</p>
<list list-type="order">
<list-item><p>Detection of tools related to different tasks.</p></list-item>
<list-item><p>Research on the properties of the tool itself.</p></list-item>
<list-item><p>Robotic grasping detection of tools.</p></list-item>
</list>
<sec>
<title>2.1. Task-related tool detection</title>
<p>The earliest classification of tasks is mostly to find corresponding task objects in multiple objects. With the great power of machine learning in classification, researchers find that novel objects grasp detection can be classified into two parts, which is graspable or ungraspable. SVM has been widely used in grasp feature classification (Fischinger et al., <xref ref-type="bibr" rid="B12">2015</xref>; Ten Pas and Platt, <xref ref-type="bibr" rid="B35">2018</xref>). Ten Pas and Platt (<xref ref-type="bibr" rid="B35">2018</xref>) used knowledge of the geometry of a good grasp to improve detection. Through sampling lots of hand configuration as the input features, they used the notion of an antipodal grasp to classify these grasp hypotheses. Deep learning methods are also been applied in grasp detection. Lenz et al. (<xref ref-type="bibr" rid="B23">2015</xref>) presented a two-step cascaded system with two deep networks and ran at 13.5 s per frame with an accuracy of 93.7%.</p>
<p>In order to better identify task-related tools among multiple types of tools and avoid the interference of irrelevant tools, instance segmentation methods are introduced to achieve more accurate tool detection accuracy. Top-down methods (He et al., <xref ref-type="bibr" rid="B17">2017</xref>; Chen et al., <xref ref-type="bibr" rid="B3">2020</xref>) solve the problem from the perspective of object detection. For example, first detecting an object, then segmenting it in the box. Recently, the anchor-free object detectors were used by some researchers and got good results (Tian et al., <xref ref-type="bibr" rid="B36">2019</xref>). Bottom-up methods (Liu et al., <xref ref-type="bibr" rid="B25">2017</xref>; Gao et al., <xref ref-type="bibr" rid="B14">2019</xref>) view the task as a label-then-cluster problem. These method learn the per-pixel embeddings and then cluster them into groups. The latest direct method (SOLO) (Wang et al., <xref ref-type="bibr" rid="B37">2020a</xref>) no longer relies on box detection or embedding learning, and deals with instance segmentation directly. Wang et al. (<xref ref-type="bibr" rid="B38">2020b</xref>) appreciate the basic concept of SOLO and further explore the direct instance segmentation solutions.</p>
</sec>
<sec>
<title>2.2. Tool attribute classification</title>
<p>The above methods can identify objects of known classes very well. However, in the case of using a spoon, the robot needs to know which part of the spoon to grasp and which part to hold the soup. Work on grasp affordances tends to focus on robust interactions between objects and the autonomous agent. It is typically limited to a single affordance per object. Moreover, affordance labels tend to be assigned arbitrarily instead of through data-driven techniques to collect human-acceptable interactions about grasping. Kr&#x000FC;ger et al. (<xref ref-type="bibr" rid="B20">2011</xref>) focus on relating abstractions of sensory-motor processes with object structures [e.g., object-action complexes (OACs)], and capture the interaction between objects and associated actions through an object affordance. Others use purely visual input to learn affordances to relate objects and actions through deep learning or supervised learning techniques (Hart et al., <xref ref-type="bibr" rid="B16">2015</xref>). Chu et al. (<xref ref-type="bibr" rid="B5">2019</xref>) presented a novel framework to predict the affordance of objects <italic>via</italic> semantic segmentation.</p>
<p>It is worth considering that in the interactive use of tools, robots not only need to find the task-related tools and operating areas, but also clarify the state of the robot at this time, whether it is the &#x0201C;leader&#x0201D; or the &#x0201C;assistant&#x0201D; of the task. However, the previous classification of tool attributes at this time is not sufficient to meet this goal, they only consider the case where the robot is a single operator. In order to solve this problem, based on the attributes generated by the classification of tool functions, we focus on the grasping operation during interactive tasks. Through data-driven technology, the functional attributes of the tool are combined with the state of the robot to find the optimal grasping area of the tool for the robot under different states.</p>
</sec>
<sec>
<title>2.3. Robotic grasping detection</title>
<p>Deep learning has been a hot topic of research since the advent of ImageNet success and the use of GPU&#x00027;s and other fast computational techniques. Also, the availability of affordable RGB-D sensors enabled the use of deep learning techniques to learn the features of objects directly from image data. Recent experimentations using deep neural networks (Schmidt et al., <xref ref-type="bibr" rid="B33">2018</xref>; Zeng et al., <xref ref-type="bibr" rid="B39">2018</xref>) proved that they were quite efficient when calculating stable grasp configurations. Guo et al. (<xref ref-type="bibr" rid="B15">2017</xref>) fused tactile and visual data to train hybrid deep architectures. Mahler et al. (<xref ref-type="bibr" rid="B28">2017</xref>) trained a Grasp Quality Convolutional Neural Network (GQ-CNN) with only synthetic data from Dex-Net 2.0 grasp planner dataset. Levine et al. (<xref ref-type="bibr" rid="B24">2018</xref>) presented a method for learning hand-eye coordination for robotic grasping from monocular images. They use a CNN for grasp success prediction, and a continuous servoing mechanism used this network to continuously control the manipulator. Antanas et al. (<xref ref-type="bibr" rid="B1">2019</xref>) proposed a probabilistic logic framework that is said to improve the grasping capability of a robot with the help of semantic object parts. This framework combines high-level reasoning with low-level grasping. The high-level reasoning leverages symbolic world knowledge through comprising object-task affordances, categories, and task-based information while low-level reasoning depends on visual shape features.</p>
<p>Most of these grasp synthesis approaches are enabled by representing the grasp as an oriented rectangle in the image (Dong et al., <xref ref-type="bibr" rid="B10">2021</xref>). Kumra et al. (<xref ref-type="bibr" rid="B21">2020</xref>) used an improved version of grasp representation, complemented by a novel convolutional network, which improves the accuracy of robotic grasping. Depierre et al. (<xref ref-type="bibr" rid="B7">2021</xref>) introduced a new loss function, which associates the regression of the grab parameters with the score of the grabability. Dong et al. (<xref ref-type="bibr" rid="B9">2022</xref>) used the transformer network as an encoder to obtain global context information. Shukla et al. (<xref ref-type="bibr" rid="B34">2022</xref>) proposed GI-NNet model based on inception module, it can achieve better results under limited data sets, but it is less adaptable to big data. These grasping methods tend to focus on the tool itself, ignoring the impact of different tasks on grasping. Especially in human-computer interaction tasks, different states prompt the robot to grasp different parts of the tool. In order to solve the problem of robot grasping under human-computer interaction, we modified the grasping representation of the tool based on the different state semantic regions of the tool. Through an improved grasping neural network, the accuracy of grasping detection is improved.</p>
</sec>
</sec>
<sec id="s3">
<title>3. Method</title>
<p>In this human-robot collaboration work, we consider the operating area of the tool when people are in the two different states of leader and assistant. And let our network learn this selection rule, so that when the robot assists the human or the robot operates under the guidance of the human, it can find the relevant task position as much as possible. In this paper, in order to study how to generate the robot grasp detection problem under different states, the following state semantic region classification and grasping detection framework of collaborative task are proposed, as shown in the <xref ref-type="fig" rid="F1">Figure 1</xref>.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Our MGR-Net based on state semantic regions.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-1082550-g0001.tif"/>
</fig>
<p>Our grasping detection network mainly consists of two parts. First, finding the task-related state semantic region of object. Second, finding the most suitable grasp configuration for robots or humans based on different state semantic regions.</p>
<sec>
<title>3.1. Grasp representation</title>
<p>In this work, we define the robot grasping detection problem as predicting unknown objects from the n-channel image of the scene and assigning states based on the task according to the provided task description, so as to carry out the corresponding grasping and execute it on the robot. Instead of the five-dimensional grip representation used in Kumra and Kanan (<xref ref-type="bibr" rid="B22">2017</xref>), we use an improved version similar to the grasp representation proposed by Morrison et al. (<xref ref-type="bibr" rid="B30">2020</xref>). Considering that the optimal grasping configuration of the robot will change in different state states, we incorporate the attribute of the state semantic area into the robot frame, and change the grasping posture to be expressed as:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>G</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>P</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x003B8;</mml:mi><mml:mo>,</mml:mo><mml:mi>W</mml:mi><mml:mo>,</mml:mo><mml:mi>Q</mml:mi><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Among them, <italic>P</italic> &#x0003D; (<italic>x, y, z</italic>) is the center position of the tool tip, &#x003B8; is the rotation of the tool around the z-axis, <italic>W</italic> is the required width of the tool, <italic>R</italic><sub><italic>s</italic></sub> represents the state semantic area, and <italic>Q</italic>|<italic>R</italic><sub><italic>s</italic></sub> represents the grasp score of the corresponding state area.</p>
<p>The grasp quality score <italic>Q</italic> is the grasp quality of each point in the image, and is expressed as a fractional value between 0 and 1, with values closer to 1 indicating a greater chance of successful grasping. &#x003B8; represents a measure of the amount of angular rotation at each point required to grasp the object of interest, expressed as a value in the range [<inline-formula><mml:math id="M2"><mml:mfrac><mml:mrow><mml:mo>-</mml:mo><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:math></inline-formula>, <inline-formula><mml:math id="M3"><mml:mfrac><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:math></inline-formula>]. <italic>W</italic> is the desired width, expressed as a measure of uniform depth, and expressed as a value in the range [0, <italic>W</italic><sub><italic>max</italic></sub>] pixels. <italic>W</italic><sub><italic>max</italic></sub> is the maximum width of the gripper.</p>
</sec>
<sec>
<title>3.2. Grasp detection network</title>
<sec>
<title>3.2.1. State semantic region</title>
<p>We input image <italic>F</italic><sub><italic>overall</italic></sub> to the first layer of tool segmentation network. Through the generated mask, we construct the input image <italic>F</italic><sub><italic>part</italic></sub> of the second layer of state semantic segmentation network. Based on the state that the robot assumes in the task, the second layer finally generates semantic regions related to the robot state. More descriptions of the tool datasets will be introduced in Section 4.1. The modules in the segmentation layer are shown in <xref ref-type="fig" rid="F2">Figure 2</xref>.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>State semantic region segmentation.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-1082550-g0002.tif"/>
</fig>
<p>Two segmentation layers is designed to achieve different purposes. The first layer of the overall segmentation layer finds out the mask of the task-related object in the multi-object environment, which includes two branches: (1) Category Branch is responsible for predicting the semantic category of the object. (2) Mask Branch is responsible for predicting the mask region of the object. The second layer further divides the task object based on the state to obtain the state semantic area of the object. The state semantic area mainly contained in this layer is the &#x0201C;grasp&#x0201D; area as the state of leader and the &#x0201C;handover&#x0201D; area as the state of assistant. The difference between this layer and the first layer is: (1) Category Branch is responsible for predicting the state semantic category of the task area of the object. (2) Mask Branch is responsible for predicting the mask of the semantic area of different states of the object. Each layer uses FPN behind the backbone network to cope with the size. After each layer of FPN, the above two parallel branches are connected to predict the category and position. The number of grids in each branch is correspondingly different. Small examples correspond to more grids.</p>
<p>Category Branch is responsible for predicting the semantic category of each task area of the object. Each grid predicts the category S&#x000D7;S&#x000D7;C. The mask branch is decomposed into mask kernel branch and mask feature branch, which correspond to the learning of the convolution kernel and the learning of features, respectively. The output of the two branches is finally combined into the output of the entire mask branch. For each grid, the kernel branch predicts the D-dimensional output, which represents the predicted weight of the convolution kernel, and D is the number of parameters. So for the number of grids of S&#x000D7;S, the output is S&#x000D7;S&#x000D7;D. Mask feature branch is used to learn the expression of features. Its input is the features of different levels extracted by backbone&#x0002B;FPN, and the output is the mask feature of H&#x000D7;W&#x000D7;E, denoted by F.</p>
</sec>
<sec>
<title>3.2.2. Grasp detection</title>
<p>Feature output is similar to Kumra et al. (<xref ref-type="bibr" rid="B21">2020</xref>), and also contains three different prediction maps (<italic>Q</italic>|<italic>R</italic>, angle, width) represented by the grasping posture, as shown in the <xref ref-type="fig" rid="F1">Figure 1</xref>. But the difference is that since our grasping posture contains the content of the state assignment area, our grasping score is also closely related to the character area.</p>
<p>The input image and the state semantic region mask corresponding to the task are sent to the convolutional layer together. The convolutional layer consists of conv2d layer, batch normalization (BN) layer and relu layer. The output of the convolutional layer is fed to 3 GB-Block layers (C1&#x02013;C3), the first two GR-Block layer contains a Block and Downsampling, as shown in the <xref ref-type="fig" rid="F1">Figure 1</xref>. We designed this Block from Liu et al. (<xref ref-type="bibr" rid="B27">2022</xref>). Three conv2d layers are used in Block with different kernel functions, and Layer Norm replaces Batch Norm for better effect. Since we focus on the semantic area above the object rather than the object itself, the change in the size of the object will increase the difficulty of detection. We use three Block of different sizes to obtain different receptive fields to improve the detection accuracy. A downsampling module is to connect two Block of different sizes, as shown in the <xref ref-type="fig" rid="F1">Figure 1</xref>. After that, in order to more easily interpret and preserve the spatial characteristics of the image after the convolution operation, we use five deconvolutional layers to upsample the image. Therefore, we get the same size image at the output as the input. Grasp representation is generated as network output from the deconvolutional layer.</p>
</sec>
<sec>
<title>3.2.3. Loss function</title>
<p>For each input image <italic>p</italic>, combined with the local attribute region image <italic>p</italic><sub><italic>k</italic></sub> generated by its different state semantic regions <italic>M</italic>, our grasping network is optimized by the following loss function:</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M4"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>s</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>G</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mover accent="true"><mml:mrow><mml:msub><mml:mrow><mml:mi>G</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">^</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>s</italic><sub><italic>i</italic></sub> is given by:</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M5"><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:mn>0.5</mml:mn><mml:mo>&#x000B7;</mml:mo><mml:msup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mover accent='true'><mml:mrow><mml:msub><mml:mi>G</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:mo stretchy='false'>&#x0005E;</mml:mo></mml:mover></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x02212;</mml:mo><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mrow><mml:msub><mml:mi>G</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup><mml:mo>,</mml:mo><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mi>i</mml:mi><mml:mi>f</mml:mi><mml:mo>&#x0007C;</mml:mo><mml:msub><mml:mover accent='true'><mml:mrow><mml:msub><mml:mi>G</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:mo stretchy='false'>&#x0005E;</mml:mo></mml:mover><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x02212;</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>G</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mo>&#x0003C;</mml:mo><mml:mn>1</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>&#x0007C;</mml:mo><mml:msub><mml:mover accent='true'><mml:mrow><mml:msub><mml:mi>G</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:mo stretchy='false'>&#x0005E;</mml:mo></mml:mover><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x02212;</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>G</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:msub><mml:mrow></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mn>0.5</mml:mn><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mi>o</mml:mi><mml:mi>t</mml:mi><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mi>w</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:math></disp-formula>
<p><italic>G</italic><sub><italic>k</italic></sub> is the grasp generated by the network corresponding to <italic>p</italic><sub><italic>k</italic></sub> and <inline-formula><mml:math id="M6"><mml:mover accent="true"><mml:mrow><mml:msub><mml:mrow><mml:mi>G</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy='false'>^</mml:mo></mml:mover></mml:math></inline-formula> is the ground truth grasp.</p>
</sec>
</sec>
</sec>
<sec id="s4">
<title>4. Experiment</title>
<p>We implemented our detection network in PyTorch and the computer configuration used in the experiment is intel core I7-8700 CPU and NVIDIA 2080ti GPU. The following experimental part mainly contains three pieces.</p>
<sec>
<title>4.1. Dataset</title>
<p>In order to meet the image input required by our network, we constructed a dataset of collaboration tools. We selected 6,000 tool images from IIT-AFF Dataset (Nguyen et al., <xref ref-type="bibr" rid="B32">2017</xref>), UMD Dataset (Myers et al., <xref ref-type="bibr" rid="B31">2014</xref>), Cornell Grasp Dataset and Jacquard Grasping Dataset (Depierre et al., <xref ref-type="bibr" rid="B6">2018</xref>). We resize the images in the tool dataset to the same size. This tool dataset is used for two networks. One is mainly used for the classification of the object task area. At this time, 90% of the images in the dataset are used as the training set, and the rest are the test set. Another use is tool grasp detection based on the robot&#x00027;s state. The training set at this time comes from the jacquard part of the tool dataset, there are 4,000 images, and the remaining jacquard images are used as the test set together with other parts of the dataset. The extended version of Cornell Grasp Dataset comprises of 1,035 RGB-D images with a resolution of 640 &#x000D7; 480 pixels of 240 different real objects with 5,110 positive and 2,909 negative grasps. The annotated ground truth consists of several grasp rectangles representing grasping possibilities per object. The Jacquard Grasping Dataset is built on a subset of ShapeNet which is a large CAD models dataset. It consists of 54 k RGB-D images and annotations of successful grasping positions based on grasp attempts performed in a simulated environment. In total, it has 1.1 M grasp examples.</p>
</sec>
<sec>
<title>4.2. Task area</title>
<p>In this section, we mainly discuss the results of semantic region classification. Different states are given to the robot according to the task, and the robot has a more specific functional area classification for the tool. As shown in <xref ref-type="fig" rid="F3">Figure 3</xref>, when the robot acts as the &#x0201C;leader,&#x0201D; the tools are classified according to their affordance. Such classification enables the robot to grasp more accurately, and avoids damage to the object or the gripper caused by the wrong grasping position. When the robot acts as an &#x0201C;assistant,&#x0201D; it always expects the human to grasp the most suitable position for grasping. Therefore, the robot needs to avoid this grasping area as much as possible and find a suitable area for handover. Through the delivery of the robot, human can always grasp the tool most efficiently and safely. For example, when passing scissors, such classification can avoid being accidentally injured by scissors due to people&#x00027;s carelessness.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Segmentation results based on &#x0201C;leader&#x0201D; and &#x0201C;assistant&#x0201D; state.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-1082550-g0003.tif"/>
</fig>
<p>To further test the effectiveness of our two-layer segmentation network, we compare it with other methods on the IIF-AFF Dataset, as shown in the <xref ref-type="table" rid="T1">Table 1</xref>. Among them, grasp&#x00023;2 and handover&#x00023;2 represent the classification results when the robot is &#x0201C;assistant.&#x0201D; It can be seen that our network still has high accuracy.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Performance on IIT-AFF dataset.</p></caption>
<table frame="box" rules="all">
<tbody>
<tr>
<td style="background-color:#919497; color:#ffffff;"/>
<td valign="top" align="center" style="background-color:#919497; color:#ffffff;"><bold>DeepLab (Chen et al.</bold>, <xref ref-type="bibr" rid="B4"><bold>2017</bold></xref><bold>)</bold></td>
<td valign="top" align="center" style="background-color:#919497; color:#ffffff;"><bold>Affordance-net (Do et al.</bold>, <xref ref-type="bibr" rid="B8"><bold>2018</bold></xref><bold>)</bold></td>
<td valign="top" align="center" style="background-color:#919497; color:#ffffff;"><bold>RAN-ResNet50 (Zhao et al.</bold>, <xref ref-type="bibr" rid="B40"><bold>2020</bold></xref><bold>)</bold></td>
<td valign="top" align="center" style="background-color:#919497; color:#ffffff;"><bold>Our method</bold></td>
</tr> <tr>
<td valign="top" align="left">Contain</td>
<td valign="top" align="center">68.84</td>
<td valign="top" align="center">79.61</td>
<td valign="top" align="center">80.20</td>
<td valign="top" align="center">87.10</td>
</tr> <tr>
<td valign="top" align="left">Cut</td>
<td valign="top" align="center">55.23</td>
<td valign="top" align="center">75.68</td>
<td valign="top" align="center">78.04</td>
<td valign="top" align="center">72.80</td>
</tr> <tr>
<td valign="top" align="left">Display</td>
<td valign="top" align="center">61.00</td>
<td valign="top" align="center">77.81</td>
<td valign="top" align="center">79.14</td>
<td valign="top" align="center">91.20</td>
</tr> <tr>
<td valign="top" align="left">Engine</td>
<td valign="top" align="center">63.05</td>
<td valign="top" align="center">77.50</td>
<td valign="top" align="center">81.22</td>
<td valign="top" align="center">85.50</td>
</tr> <tr>
<td valign="top" align="left">Grasp&#x00023;1</td>
<td valign="top" align="center">54.31</td>
<td valign="top" align="center">68.48</td>
<td valign="top" align="center">71.59</td>
<td valign="top" align="center">82.60</td>
</tr> <tr>
<td valign="top" align="left">Hit</td>
<td valign="top" align="center">58.43</td>
<td valign="top" align="center">70.75</td>
<td valign="top" align="center">88.52</td>
<td valign="top" align="center">91.00</td>
</tr> <tr>
<td valign="top" align="left">Pound</td>
<td valign="top" align="center">54.25</td>
<td valign="top" align="center">69.57</td>
<td valign="top" align="center">76.91</td>
<td valign="top" align="center">81.90</td>
</tr> <tr>
<td valign="top" align="left">Support</td>
<td valign="top" align="center">54.28</td>
<td valign="top" align="center">69.81</td>
<td valign="top" align="center">80.12</td>
<td valign="top" align="center">78.90</td>
</tr> <tr>
<td valign="top" align="left">Grasp&#x00023;2</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">79.27</td>
<td valign="top" align="center">88.86</td>
</tr> <tr>
<td valign="top" align="left">Handover&#x00023;2</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">77.96</td>
<td valign="top" align="center">80.08</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>4.3. Grasp detection metric</title>
<p>In order to better compare our results with the results of previous researchers, we refer to the comparison scale in Jiang et al. (<xref ref-type="bibr" rid="B18">2011</xref>) and make some optimizations. Since our grasp is aimed at a smaller task area, we set the iou value between ground truth grasp rectangle and the predicted grasp rectangle to two types: (1) The iou value is &#x0003E;25% for rough grasping. (2) The iou value is &#x0003E;50% for stable and accurate grasping. In addition, The offset between the grasp orientation of the predicted grasp rectangle and the ground truth rectangle is &#x0003C;30<sup>&#x000B0;</sup>.</p>
</sec>
<sec>
<title>4.4. Grasp detection</title>
<p>We discuss the results of our experiments here. We evaluate MGR-Net on our tools datasets, and demonstrate that our model is able to adapt to various types of tool objects. In addition, our method can not only grasp the whole object, but also understand the robot operation information contained in the task and grasp a certain area of the tool, so as to help people safely grasp the target tool. <xref ref-type="fig" rid="F4">Figure 4</xref> shows the qualitative results obtained on previously unseen tools.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Qualitative results on different datasets.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-1082550-g0004.tif"/>
</fig>
<p>The <xref ref-type="table" rid="T2">Table 2</xref> shows the changes in the overall grasp due to the improvement of the network module. After obtaining the grasping representation of the tool through our detection network. Based on the robot platform, we use Sawyer robot to verify the grasping representation. Since the coordinate relationship between the camera and the robot is known, we transform the grasp representation from the image space to the robot coordinate system. <xref ref-type="fig" rid="F5">Figure 5</xref> shows the process of our verification through Sawyer robot, where <xref ref-type="fig" rid="F5">Figures 5A, D</xref> are the result graphs generated by our capture of the detection network. After the camera space is converted to the robot space, Sawyer reaches the designated position and closes the gripper, as shown in <xref ref-type="fig" rid="F5">Figures 5B, E</xref>. <xref ref-type="fig" rid="F5">Figures 5C, F</xref> lift the object upward to prove whether our grasp is successful or not. We used 20 unseen real tools. Each test object contains five different positions and directions and the grasp accuracy is 92%. The experiment proves the effectiveness of our method.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Ablation study.</p></caption>
<table frame="box" rules="all">
<thead>
<tr>
<th valign="top" align="left" style="background-color:#919497; color:#ffffff;"><bold>Network structure</bold></th>
<th valign="top" align="center" style="background-color:#919497; color:#ffffff;"><bold>Accuracy (25%)</bold></th>
<th valign="top" align="center" style="background-color:#919497; color:#ffffff;"><bold>Accuracy (50%)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Residual block</td>
<td valign="top" align="center">0.95</td>
<td valign="top" align="center">0.83</td>
</tr> <tr>
<td valign="top" align="left">Only block</td>
<td valign="top" align="center">0.95</td>
<td valign="top" align="center">0.84</td>
</tr> <tr>
<td valign="top" align="left">GR-block</td>
<td valign="top" align="center">0.96</td>
<td valign="top" align="center">0.87</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Verification through robot platform. <bold>(A, D)</bold> The results of grasp detection. <bold>(B, E)</bold> The robot grasping tools. <bold>(C, F)</bold> The robot lifting tools to indicate whether the grasping is successful or not.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-1082550-g0005.tif"/>
</fig>
</sec>
<sec>
<title>4.5. Comparison of different approaches</title>
<p>Considering that the traditional method does not involve the content of the state task area, we regard the entire object as an area with a grasp attribute, that is, the mask is the entire tool. We compared the accuracy of our network with the results of previous experiments on the Jacquard dataset (as shown in <xref ref-type="table" rid="T3">Table 3</xref>). It can be seen that the more accurate what needs to be captured, the more obvious the superiority of our method is. To further test the effectiveness of our grasping network, we tested it on a dataset of tools constructed by ourselves. Tool images are captured by a realsense camera. It is worth mentioning that our training set does not contain images from our homemade dataset. We have compared with Kumra et al. (<xref ref-type="bibr" rid="B21">2020</xref>) and Shukla et al. (<xref ref-type="bibr" rid="B34">2022</xref>), as shown in <xref ref-type="fig" rid="F6">Figures 6</xref>, <xref ref-type="fig" rid="F7">7</xref>. It can be seen from the <xref ref-type="fig" rid="F6">Figure 6</xref> that in the untrained real images with uneven lighting, our method can more accurately find the grasp configuration of objects, and adopt a suitable size of the grasp box. For example, when grasping a cup, a small frame is generated at the handle of the cup to avoid the collision between the gripper and the rest of the cup. <xref ref-type="fig" rid="F7">Figure 7</xref> shows the strong anti-interference ability of our method and proves the necessity of generating object mask.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>We compared our grasp network with other work.</p></caption>
<table frame="box" rules="all">
<thead><tr>
<th valign="top" align="left" style="background-color:#919497; color:#ffffff;"><bold>References</bold></th>
<th valign="top" align="center" style="background-color:#919497; color:#ffffff;"><bold>Accuracy (25%)</bold></th>
<th valign="top" align="center" style="background-color:#919497; color:#ffffff;"><bold>Accuracy (50%)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Depierre et al. (<xref ref-type="bibr" rid="B6">2018</xref>)</td>
<td valign="top" align="center">0.74</td>
<td valign="top" align="center">&#x02013;</td>
</tr> <tr>
<td valign="top" align="left">Zhou et al. (<xref ref-type="bibr" rid="B41">2018</xref>)</td>
<td valign="top" align="center">0.92</td>
<td valign="top" align="center">&#x02013;</td>
</tr> <tr>
<td valign="top" align="left">Kumra et al. (<xref ref-type="bibr" rid="B21">2020</xref>)</td>
<td valign="top" align="center">0.94</td>
<td valign="top" align="center">0.72</td>
</tr> <tr>
<td valign="top" align="left">Depierre et al. (<xref ref-type="bibr" rid="B7">2021</xref>)</td>
<td valign="top" align="center">0.86</td>
<td valign="top" align="center">&#x02013;</td>
</tr> <tr>
<td valign="top" align="left">Shukla et al. (<xref ref-type="bibr" rid="B34">2022</xref>)</td>
<td valign="top" align="center">0.90</td>
<td valign="top" align="center">0.69</td>
</tr> <tr>
<td valign="top" align="left">Ours</td>
<td valign="top" align="center">0.95</td>
<td valign="top" align="center">0.77</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Untrained single tool images.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-1082550-g0006.tif"/>
</fig>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Untrained multi-tools images.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-1082550-g0007.tif"/>
</fig>
</sec>
</sec>
<sec sec-type="conclusions" id="s5">
<title>5. Conclusion</title>
<p>We presented a modular solution for tool usage issues in the context of human-robot interaction. A multi-layer instance segmentation network helps robots understand the regional attributes and semantics of objects under different states. Based on the state assigned to the robot based on the task, it is able to grasp or handover novel objects using our convolutional neural network MGR-Net that uses <italic>n</italic>-channel input data to generate images that can be used to infer grasp rectangles for each pixel in an image.</p>
<p>We validate our proposed system on our robotics platform. The results demonstrate that our system can perform accurate grasps for previously unseen objects with different state, even our method is able to adapt to changes in lighting conditions to a certain extent.</p>
<p>We hope to extend our solution to more complex object environments, such as where tools overlap and occlude each other. Besides, combining multiple visual angles to improve the success rate of grasping should also be considered in our later work.</p>
</sec>
<sec sec-type="data-availability" id="s6">
<title>Data availability statement</title>
<p>The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/supplementary material.</p>
</sec>
<sec sec-type="author-contributions" id="s7">
<title>Author contributions</title>
<p>TX proposed the method and designed experiments to verify the method, and then wrote this article. DZ and JY assisted in the experiment. YL reviewed and improved the manuscript. All authors contributed to the article and approved the submitted version.</p>
</sec>
</body>
<back>
<sec sec-type="funding-information" id="s8">
<title>Funding</title>
<p>This work was supported in part by National Natural Science Fund of China (Grant No. 61473155) and Primary Research &#x00026; Development Plan of Jiangsu Province (Grant No. BE2017301).</p>
</sec>
<ack><p>Appreciations are given to the editors and reviewers of the journal.</p>
</ack>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Antanas</surname> <given-names>L.</given-names></name> <name><surname>Moreno</surname> <given-names>P.</given-names></name> <name><surname>Neumann</surname> <given-names>M.</given-names></name> <name><surname>de Figueiredo</surname> <given-names>R. P.</given-names></name> <name><surname>Kersting</surname> <given-names>K.</given-names></name> <name><surname>Santos-Victor</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>Semantic and geometric reasoning for robotic grasping: a probabilistic logic approach</article-title>. <source>Auton. Robots</source> <volume>43</volume>, <fpage>1393</fpage>&#x02013;<lpage>1418</lpage>. <pub-id pub-id-type="doi">10.1007/s10514-018-9784-8</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Brahmbhatt</surname> <given-names>S.</given-names></name> <name><surname>Ham</surname> <given-names>C.</given-names></name> <name><surname>Kemp</surname> <given-names>C. C.</given-names></name> <name><surname>Hays</surname> <given-names>J.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;ContactDB: analyzing and predicting grasp contact via thermal imaging,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, <fpage>8709</fpage>&#x02013;<lpage>8719</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2019.00891</pub-id></citation>
</ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>H.</given-names></name> <name><surname>Sun</surname> <given-names>K.</given-names></name> <name><surname>Tian</surname> <given-names>Z.</given-names></name> <name><surname>Shen</surname> <given-names>C.</given-names></name> <name><surname>Huang</surname> <given-names>Y.</given-names></name> <name><surname>Yan</surname> <given-names>Y.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Blendmask: top-down meets bottom-up for instance segmentation,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, <fpage>8573</fpage>&#x02013;<lpage>8581</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR42600.2020.00860</pub-id></citation>
</ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>L.-C.</given-names></name> <name><surname>Papandreou</surname> <given-names>G.</given-names></name> <name><surname>Kokkinos</surname> <given-names>I.</given-names></name> <name><surname>Murphy</surname> <given-names>K.</given-names></name> <name><surname>Yuille</surname> <given-names>A. L.</given-names></name></person-group> (<year>2017</year>). <article-title>Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell</source>. <volume>40</volume>, <fpage>834</fpage>&#x02013;<lpage>848</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2017.2699184</pub-id><pub-id pub-id-type="pmid">28463186</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chu</surname> <given-names>F.-J.</given-names></name> <name><surname>Xu</surname> <given-names>R.</given-names></name> <name><surname>Vela</surname> <given-names>P. A.</given-names></name></person-group> (<year>2019</year>). <article-title>Learning affordance segmentation for real-world robotic manipulation via synthetic images</article-title>. <source>IEEE Robot. Autom. Lett</source>. <volume>4</volume>, <fpage>1140</fpage>&#x02013;<lpage>1147</lpage>. <pub-id pub-id-type="doi">10.1109/LRA.2019.2894439</pub-id></citation>
</ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Depierre</surname> <given-names>A.</given-names></name> <name><surname>Dellandr&#x000E9;a</surname> <given-names>E.</given-names></name> <name><surname>Chen</surname> <given-names>L.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Jacquard: a large scale dataset for robotic grasp detection,&#x0201D;</article-title> in <source>2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</source>, <fpage>3511</fpage>&#x02013;<lpage>3516</lpage>. <pub-id pub-id-type="doi">10.1109/IROS.2018.8593950</pub-id></citation>
</ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Depierre</surname> <given-names>A.</given-names></name> <name><surname>Dellandr&#x000E9;a</surname> <given-names>E.</given-names></name> <name><surname>Chen</surname> <given-names>L.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Scoring graspability based on grasp regression for better grasp prediction,&#x0201D;</article-title> in <source>2021 IEEE International Conference on Robotics and Automation (ICRA)</source>, <fpage>4370</fpage>&#x02013;<lpage>4376</lpage>. <pub-id pub-id-type="doi">10.1109/ICRA48506.2021.9561198</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Do</surname> <given-names>T.-T.</given-names></name> <name><surname>Nguyen</surname> <given-names>A.</given-names></name> <name><surname>Reid</surname> <given-names>I.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;AffordanceNet: an end-to-end deep learning approach for object affordance detection,&#x0201D;</article-title> in <source>2018 IEEE International Conference on Robotics and Automation (ICRA)</source>, <fpage>5882</fpage>&#x02013;<lpage>5889</lpage>. <pub-id pub-id-type="doi">10.1109/ICRA.2018.8460902</pub-id></citation>
</ref>
<ref id="B9">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Dong</surname> <given-names>M.</given-names></name> <name><surname>Bai</surname> <given-names>Y.</given-names></name> <name><surname>Wei</surname> <given-names>S.</given-names></name> <name><surname>Yu</surname> <given-names>X.</given-names></name></person-group> (<year>2022</year>). <article-title>&#x0201C;Robotic grasp detection based on transformer,&#x0201D;</article-title> in <source>International Conference on Intelligent Robotics and Applications</source> (<publisher-loc>Springer</publisher-loc>), <fpage>437</fpage>&#x02013;<lpage>448</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-031-13841-6_40</pub-id></citation>
</ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dong</surname> <given-names>M.</given-names></name> <name><surname>Wei</surname> <given-names>S.</given-names></name> <name><surname>Yu</surname> <given-names>X.</given-names></name> <name><surname>Yin</surname> <given-names>J.</given-names></name></person-group> (<year>2021</year>). <article-title>Mask-GD segmentation based robotic grasp detection</article-title>. <source>Comput. Commun</source>. <volume>178</volume>, <fpage>124</fpage>&#x02013;<lpage>130</lpage>. <pub-id pub-id-type="doi">10.1016/j.comcom.2021.07.012</pub-id></citation>
</ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fang</surname> <given-names>K.</given-names></name> <name><surname>Zhu</surname> <given-names>Y.</given-names></name> <name><surname>Garg</surname> <given-names>A.</given-names></name> <name><surname>Kurenkov</surname> <given-names>A.</given-names></name> <name><surname>Mehta</surname> <given-names>V.</given-names></name> <name><surname>Fei-Fei</surname> <given-names>L.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>Learning task-oriented grasping for tool manipulation from simulated self-supervision</article-title>. <source>Int. J. Robot. Res</source>. <volume>39</volume>, <fpage>202</fpage>&#x02013;<lpage>216</lpage>. <pub-id pub-id-type="doi">10.1177/0278364919872545</pub-id></citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fischinger</surname> <given-names>D.</given-names></name> <name><surname>Weiss</surname> <given-names>A.</given-names></name> <name><surname>Vincze</surname> <given-names>M.</given-names></name></person-group> (<year>2015</year>). <article-title>Learning grasps with topographic features</article-title>. <source>Int. J. Robot. Res</source>. <volume>34</volume>, <fpage>1167</fpage>&#x02013;<lpage>1194</lpage>. <pub-id pub-id-type="doi">10.1177/0278364915577105</pub-id></citation>
</ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fu</surname> <given-names>J.</given-names></name> <name><surname>Levine</surname> <given-names>S.</given-names></name> <name><surname>Abbeel</surname> <given-names>P.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;One-shot learning of manipulation skills with online dynamics adaptation and neural network priors,&#x0201D;</article-title> in <source>2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</source>, <fpage>4019</fpage>&#x02013;<lpage>4026</lpage>. <pub-id pub-id-type="doi">10.1109/IROS.2016.7759592</pub-id></citation>
</ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gao</surname> <given-names>N.</given-names></name> <name><surname>Shan</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>Y.</given-names></name> <name><surname>Zhao</surname> <given-names>X.</given-names></name> <name><surname>Yu</surname> <given-names>Y.</given-names></name> <name><surname>Yang</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>&#x0201C;SSAP: single-shot instance segmentation with affinity pyramid,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source>, <fpage>642</fpage>&#x02013;<lpage>651</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2019.00073</pub-id></citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Guo</surname> <given-names>D.</given-names></name> <name><surname>Sun</surname> <given-names>F.</given-names></name> <name><surname>Liu</surname> <given-names>H.</given-names></name> <name><surname>Kong</surname> <given-names>T.</given-names></name> <name><surname>Fang</surname> <given-names>B.</given-names></name> <name><surname>Xi</surname> <given-names>N.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;A hybrid deep architecture for robotic grasp detection,&#x0201D;</article-title> in <source>2017 IEEE International Conference on Robotics and Automation (ICRA)</source>, 1609&#x02013;1614. <pub-id pub-id-type="doi">10.1109/ICRA.2017.7989191</pub-id></citation>
</ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hart</surname> <given-names>S.</given-names></name> <name><surname>Dinh</surname> <given-names>P.</given-names></name> <name><surname>Hambuchen</surname> <given-names>K.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;The affordance template ROS package for robot task programming,&#x0201D;</article-title> in <source>2015 IEEE International Conference on Robotics and Automation (ICRA)</source>, 6227&#x02013;6234. <pub-id pub-id-type="doi">10.1109/ICRA.2015.7140073</pub-id></citation>
</ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Gkioxari</surname> <given-names>G.</given-names></name> <name><surname>Doll&#x000E1;r</surname> <given-names>P.</given-names></name> <name><surname>Girshick</surname> <given-names>R.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Mask R-CNN,&#x0201D;</article-title> in <source>Proceedings of the IEEE International Conference on Computer Vision</source>, <fpage>2961</fpage>&#x02013;<lpage>2969</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2017.322</pub-id></citation>
</ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jiang</surname> <given-names>Y.</given-names></name> <name><surname>Moseson</surname> <given-names>S.</given-names></name> <name><surname>Saxena</surname> <given-names>A.</given-names></name></person-group> (<year>2011</year>). <article-title>&#x0201C;Efficient grasping from RGBD images: learning using a new rectangle representation,&#x0201D;</article-title> in <source>2011 IEEE International Conference on Robotics and Automation</source>, <fpage>3304</fpage>&#x02013;<lpage>3311</lpage>. <pub-id pub-id-type="doi">10.1109/ICRA.2011.5980145</pub-id></citation>
</ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kroemer</surname> <given-names>O.</given-names></name> <name><surname>Daniel</surname> <given-names>C.</given-names></name> <name><surname>Neumann</surname> <given-names>G.</given-names></name> <name><surname>Van Hoof</surname> <given-names>H.</given-names></name> <name><surname>Peters</surname> <given-names>J.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Towards learning hierarchical skills for multi-phase manipulation tasks,&#x0201D;</article-title> in <source>2015 IEEE International Conference on Robotics and Automation (ICRA)</source>, <fpage>1503</fpage>&#x02013;<lpage>1510</lpage>. <pub-id pub-id-type="doi">10.1109/ICRA.2015.7139389</pub-id></citation>
</ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kr&#x000FC;ger</surname> <given-names>N.</given-names></name> <name><surname>Geib</surname> <given-names>C.</given-names></name> <name><surname>Piater</surname> <given-names>J.</given-names></name> <name><surname>Petrick</surname> <given-names>R.</given-names></name> <name><surname>Steedman</surname> <given-names>M.</given-names></name> <name><surname>W&#x000F6;rg&#x000F6;tter</surname> <given-names>F.</given-names></name> <etal/></person-group>. (<year>2011</year>). <article-title>Object-action complexes: Grounded abstractions of sensory-motor processes</article-title>. <source>Robot. Auton. Syst</source>. <volume>59</volume>, <fpage>740</fpage>&#x02013;<lpage>757</lpage>. <pub-id pub-id-type="doi">10.1016/j.robot.2011.05.009</pub-id></citation>
</ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kumra</surname> <given-names>S.</given-names></name> <name><surname>Joshi</surname> <given-names>S.</given-names></name> <name><surname>Sahin</surname> <given-names>F.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Antipodal robotic grasping using generative residual convolutional neural network,&#x0201D;</article-title> in <source>2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</source>, <fpage>9626</fpage>&#x02013;<lpage>9633</lpage>. <pub-id pub-id-type="doi">10.1109/IROS45743.2020.9340777</pub-id><pub-id pub-id-type="pmid">36015978</pub-id></citation></ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kumra</surname> <given-names>S.</given-names></name> <name><surname>Kanan</surname> <given-names>C.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Robotic grasp detection using deep convolutional neural networks,&#x0201D;</article-title> in <source>2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</source>, <fpage>769</fpage>&#x02013;<lpage>776</lpage>. <pub-id pub-id-type="doi">10.1109/IROS.2017.8202237</pub-id></citation>
</ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lenz</surname> <given-names>I.</given-names></name> <name><surname>Lee</surname> <given-names>H.</given-names></name> <name><surname>Saxena</surname> <given-names>A.</given-names></name></person-group> (<year>2015</year>). <article-title>Deep learning for detecting robotic grasps</article-title>. <source>Int. J. Robot. Res</source>. <volume>34</volume>, <fpage>705</fpage>&#x02013;<lpage>724</lpage>. <pub-id pub-id-type="doi">10.1177/0278364914549607</pub-id></citation>
</ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Levine</surname> <given-names>S.</given-names></name> <name><surname>Pastor</surname> <given-names>P.</given-names></name> <name><surname>Krizhevsky</surname> <given-names>A.</given-names></name> <name><surname>Ibarz</surname> <given-names>J.</given-names></name> <name><surname>Quillen</surname> <given-names>D.</given-names></name></person-group> (<year>2018</year>). <article-title>Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection</article-title>. <source>Int. J. Robot. Res</source>. <volume>37</volume>, <fpage>421</fpage>&#x02013;<lpage>436</lpage>. <pub-id pub-id-type="doi">10.1177/0278364917710318</pub-id></citation>
</ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>S.</given-names></name> <name><surname>Jia</surname> <given-names>J.</given-names></name> <name><surname>Fidler</surname> <given-names>S.</given-names></name> <name><surname>Urtasun</surname> <given-names>R.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;SGN: sequential grouping networks for instance segmentation,&#x0201D;</article-title> in <source>Proceedings of the IEEE International Conference on Computer Vision</source>, <fpage>3496</fpage>&#x02013;<lpage>3504</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2017.378</pub-id></citation>
</ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>W.</given-names></name> <name><surname>Daruna</surname> <given-names>A.</given-names></name> <name><surname>Chernova</surname> <given-names>S.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Cage: context-aware grasping engine,&#x0201D;</article-title> in <source>2020 IEEE International Conference on Robotics and Automation (ICRA)</source>, <fpage>2550</fpage>&#x02013;<lpage>2556</lpage>. <pub-id pub-id-type="doi">10.1109/ICRA40945.2020.9197289</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>Z.</given-names></name> <name><surname>Mao</surname> <given-names>H.</given-names></name> <name><surname>Wu</surname> <given-names>C.-Y.</given-names></name> <name><surname>Feichtenhofer</surname> <given-names>C.</given-names></name> <name><surname>Darrell</surname> <given-names>T.</given-names></name> <name><surname>Xie</surname> <given-names>S.</given-names></name></person-group> (<year>2022</year>). <article-title>&#x0201C;A convNet for the 2020s,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, <fpage>11976</fpage>&#x02013;<lpage>11986</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR52688.2022.01167</pub-id></citation>
</ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mahler</surname> <given-names>J.</given-names></name> <name><surname>Liang</surname> <given-names>J.</given-names></name> <name><surname>Niyaz</surname> <given-names>S.</given-names></name> <name><surname>Laskey</surname> <given-names>M.</given-names></name> <name><surname>Doan</surname> <given-names>R.</given-names></name> <name><surname>Liu</surname> <given-names>X.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Dex-Net 2.0: deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics</article-title>. <source>arXiv preprint arXiv:1703.09312</source>. <pub-id pub-id-type="doi">10.15607/RSS.2017.XIII.058</pub-id></citation>
</ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mangin</surname> <given-names>O.</given-names></name> <name><surname>Roncone</surname> <given-names>A.</given-names></name> <name><surname>Scassellati</surname> <given-names>B.</given-names></name></person-group> (<year>2017</year>). <article-title>How to be helpful? Implementing supportive behaviors for human-robot collaboration</article-title>. <source>arXiv preprint arXiv:1710.11194</source>.</citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Morrison</surname> <given-names>D.</given-names></name> <name><surname>Corke</surname> <given-names>P.</given-names></name> <name><surname>Leitner</surname> <given-names>J.</given-names></name></person-group> (<year>2020</year>). <article-title>Learning robust, real-time, reactive robotic grasping</article-title>. <source>Int. J. Robot. Res</source>. <volume>39</volume>, <fpage>183</fpage>&#x02013;<lpage>201</lpage>. <pub-id pub-id-type="doi">10.1177/0278364919859066</pub-id></citation>
</ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Myers</surname> <given-names>A.</given-names></name> <name><surname>Kanazawa</surname> <given-names>A.</given-names></name> <name><surname>Fermuller</surname> <given-names>C.</given-names></name> <name><surname>Aloimonos</surname> <given-names>Y.</given-names></name></person-group> (<year>2014</year>). <article-title>&#x0201C;Affordance of object parts from geometric features,&#x0201D;</article-title> in <source>Workshop on Vision meets Cognition, CVPR, Vol. 9</source>. <pub-id pub-id-type="doi">10.1109/ICRA.2015.7139369</pub-id></citation>
</ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nguyen</surname> <given-names>A.</given-names></name> <name><surname>Kanoulas</surname> <given-names>D.</given-names></name> <name><surname>Caldwell</surname> <given-names>D. G.</given-names></name> <name><surname>Tsagarakis</surname> <given-names>N. G.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Object-based affordances detection with convolutional neural networks and dense conditional random fields,&#x0201D;</article-title> in <source>2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</source>, <fpage>5908</fpage>&#x02013;<lpage>5915</lpage>. <pub-id pub-id-type="doi">10.1109/IROS.2017.8206484</pub-id></citation>
</ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schmidt</surname> <given-names>P.</given-names></name> <name><surname>Vahrenkamp</surname> <given-names>N.</given-names></name> <name><surname>W&#x000E4;chter</surname> <given-names>M.</given-names></name> <name><surname>Asfour</surname> <given-names>T.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Grasping of unknown objects using deep convolutional neural networks based on depth images,&#x0201D;</article-title> in <source>2018 IEEE International Conference on Robotics and Automation (ICRA)</source>, <fpage>6831</fpage>&#x02013;<lpage>6838</lpage>. <pub-id pub-id-type="doi">10.1109/ICRA.2018.8463204</pub-id></citation>
</ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shukla</surname> <given-names>P.</given-names></name> <name><surname>Pramanik</surname> <given-names>N.</given-names></name> <name><surname>Mehta</surname> <given-names>D.</given-names></name> <name><surname>Nandi</surname> <given-names>G.</given-names></name></person-group> (<year>2022</year>). <article-title>Generative model based robotic grasp pose prediction with limited dataset</article-title>. <source>Appl. Intell</source>. <volume>52</volume>, <fpage>9952</fpage>&#x02013;<lpage>9966</lpage>. <pub-id pub-id-type="doi">10.1007/s10489-021-03011-z</pub-id></citation>
</ref>
<ref id="B35">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ten Pas</surname> <given-names>A.</given-names></name> <name><surname>Platt</surname> <given-names>R.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Using geometry to detect grasp poses in 3D point clouds,&#x0201D;</article-title> in <source>Robotics Research</source> (<publisher-loc>Springer</publisher-loc>), <fpage>307</fpage>&#x02013;<lpage>324</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-51532-8_19</pub-id></citation>
</ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tian</surname> <given-names>Z.</given-names></name> <name><surname>Shen</surname> <given-names>C.</given-names></name> <name><surname>Chen</surname> <given-names>H.</given-names></name> <name><surname>He</surname> <given-names>T.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;FCOS: fully convolutional one-stage object detection,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source>, <fpage>9627</fpage>&#x02013;<lpage>9636</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2019.00972</pub-id></citation>
</ref>
<ref id="B37">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>X.</given-names></name> <name><surname>Kong</surname> <given-names>T.</given-names></name> <name><surname>Shen</surname> <given-names>C.</given-names></name> <name><surname>Jiang</surname> <given-names>Y.</given-names></name> <name><surname>Li</surname> <given-names>L.</given-names></name></person-group> (<year>2020a</year>). <article-title>&#x0201C;Solo: segmenting objects by locations,&#x0201D;</article-title> in <source>European Conference on Computer Vision</source> (<publisher-loc>Springer</publisher-loc>), <fpage>649</fpage>&#x02013;<lpage>665</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-58523-5_38</pub-id></citation>
</ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>X.</given-names></name> <name><surname>Zhang</surname> <given-names>R.</given-names></name> <name><surname>Kong</surname> <given-names>T.</given-names></name> <name><surname>Li</surname> <given-names>L.</given-names></name> <name><surname>Shen</surname> <given-names>C.</given-names></name></person-group> (<year>2020b</year>). <article-title>Solov2: dynamic and fast instance segmentation</article-title>. <source>Adv. Neural Inform. Process. Syst</source>. <volume>33</volume>, <fpage>17721</fpage>&#x02013;<lpage>17732</lpage>.</citation>
</ref>
<ref id="B39">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zeng</surname> <given-names>A.</given-names></name> <name><surname>Song</surname> <given-names>S.</given-names></name> <name><surname>Yu</surname> <given-names>K.-T.</given-names></name> <name><surname>Donlon</surname> <given-names>E.</given-names></name> <name><surname>Hogan</surname> <given-names>F. R.</given-names></name> <name><surname>Bauza</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>&#x0201C;Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching,&#x0201D;</article-title> in <source>2018 IEEE International Conference on Robotics and Automation (ICRA)</source>, <fpage>3750</fpage>&#x02013;<lpage>3757</lpage>. <pub-id pub-id-type="doi">10.1109/ICRA.2018.8461044</pub-id></citation>
</ref>
<ref id="B40">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>X.</given-names></name> <name><surname>Cao</surname> <given-names>Y.</given-names></name> <name><surname>Kang</surname> <given-names>Y.</given-names></name></person-group> (<year>2020</year>). <article-title>Object affordance detection with relationship-aware network</article-title>. <source>Neural Comput. Appl</source>. <volume>32</volume>, <fpage>14321</fpage>&#x02013;<lpage>14333</lpage>. <pub-id pub-id-type="doi">10.1007/s00521-019-04336-0</pub-id></citation>
</ref>
<ref id="B41">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhou</surname> <given-names>X.</given-names></name> <name><surname>Lan</surname> <given-names>X.</given-names></name> <name><surname>Zhang</surname> <given-names>H.</given-names></name> <name><surname>Tian</surname> <given-names>Z.</given-names></name> <name><surname>Zhang</surname> <given-names>Y.</given-names></name> <name><surname>Zheng</surname> <given-names>N.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Fully convolutional grasp detection network with oriented anchor box,&#x0201D;</article-title> in <source>2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</source>, <fpage>7223</fpage>&#x02013;<lpage>7230</lpage>. <pub-id pub-id-type="doi">10.1109/IROS.2018.8594116</pub-id></citation>
</ref>
</ref-list> 
</back>
</article> 