<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neurorobot.</journal-id>
<journal-title>Frontiers in Neurorobotics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neurorobot.</abbrev-journal-title>
<issn pub-type="epub">1662-5218</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnbot.2022.844753</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Spatial relation learning in complementary scenarios with deep neural networks</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Lee</surname> <given-names>Jae Hee</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1614122/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Yao</surname> <given-names>Yuan</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/679920/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>&#x000D6;zdemir</surname> <given-names>Ozan</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1376165/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Li</surname> <given-names>Mengdi</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1792722/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Weber</surname> <given-names>Cornelius</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/731/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Liu</surname> <given-names>Zhiyuan</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/811355/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Wermter</surname> <given-names>Stefan</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/21776/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Knowledge Technology Group, Department of Informatics, University of Hamburg</institution>, <addr-line>Hamburg</addr-line>, <country>Germany</country></aff>
<aff id="aff2"><sup>2</sup><institution>Natural Language Processing Lab, Department of Computer Science and Technology, Tsinghua University</institution>, <addr-line>Beijing</addr-line>, <country>China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Akira Taniguchi, Ritsumeikan University, Japan</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Nikhil Krishnaswamy, Colorado State University, United States; Sina Ardabili, University of Mohaghegh Ardabili, Iran</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Jae Hee Lee <email>jae.hee.lee&#x00040;uni-hamburg.de</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>28</day>
<month>07</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>16</volume>
<elocation-id>844753</elocation-id>
<history>
<date date-type="received">
<day>28</day>
<month>12</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>28</day>
<month>06</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2022 Lee, Yao, &#x000D6;zdemir, Li, Weber, Liu and Wermter.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Lee, Yao, &#x000D6;zdemir, Li, Weber, Liu and Wermter</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license></permissions>
<abstract>
<p>A cognitive agent performing in the real world needs to learn relevant concepts about its environment (e.g., objects, color, and shapes) and react accordingly. In addition to learning the concepts, it needs to learn <italic>relations</italic> between the concepts, in particular spatial relations between objects. In this paper, we propose three approaches that allow a cognitive agent to learn spatial relations. First, using an embodied model, the agent learns to reach toward an object based on simple instructions involving left-right relations. Since the level of realism and its complexity does not permit large-scale and diverse experiences in this approach, we devise as a second approach a simple visual dataset for geometric feature learning and show that recent reasoning models can learn directional relations in different frames of reference. Yet, embodied and simple simulation approaches together still do not provide sufficient experiences. To close this gap, we thirdly propose utilizing knowledge bases for disembodied spatial relation reasoning. Since the three approaches (i.e., embodied learning, learning from simple visual data, and use of knowledge bases) are complementary, we conceptualize a cognitive architecture that combines these approaches in the context of spatial relation learning.</p></abstract>
<kwd-group>
<kwd>spatial relation learning</kwd>
<kwd>deep neural networks</kwd>
<kwd>hybrid architecture</kwd>
<kwd>embodied language learning</kwd>
<kwd>distant supervision</kwd>
<kwd>frame of reference</kwd>
</kwd-group>
<contract-num rid="cn001">TRR-169</contract-num>
<contract-num rid="cn002">62061136001</contract-num>
<contract-sponsor id="cn001">Deutsche Forschungsgemeinschaft<named-content content-type="fundref-id">10.13039/501100001659</named-content></contract-sponsor>
<contract-sponsor id="cn002">National Natural Science Foundation of China<named-content content-type="fundref-id">10.13039/501100001809</named-content></contract-sponsor>
<counts>
<fig-count count="9"/>
<table-count count="2"/>
<equation-count count="0"/>
<ref-count count="79"/>
<page-count count="0"/>
<word-count count="11143"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Spatial concepts and relations are essential for agents perceiving and acting in the physical space. Because of the ubiquitous nature of spatial concepts and relations, it is plausible from a developmental point of view to believe that they are &#x0201C;among the first to be formed in natural cognitive agents&#x0201D; (Freksa, <xref ref-type="bibr" rid="B21">2004</xref>). Endowing an artificial cognitive agent with the capability to reliably handle spatial concepts and relations can thus be regarded as an important task in AI and, in particular, in deep learning, which has become a predominant paradigm in AI (LeCun et al., <xref ref-type="bibr" rid="B38">2015</xref>).</p>
<p>In this paper, we present three different but complementary approaches to spatial relation learning with deep neural networks and propose a way to integrate them. In the first approach, a robotic agent collects experiences in its environment, learning about space in an <italic>embodied</italic> way. This approach allows the agent to ground the embodied experience similar to how humans would do and help the agent learn more accurate linguistic concepts suitable for human-robot interaction (Bisk et al., <xref ref-type="bibr" rid="B6">2020</xref>). In such a learning setup, however, the variety of experiences is limited due to multiple factors such as exploration costs, limited complexity of explored environments, robot limitations in sensory, processing, and physical capabilities, which are not only present in the real physical environments, but also to a lesser extent in simulated environments. To increase the variety of experiences, further approaches are required.</p>
<p>The second approach utilizes computer-generated large-scale <italic>image data</italic> for spatial relation learning. An immediate advantage of this approach is that data generation is cheaper than in the first approach, such that training a large model with millions of samples is possible. This allows learning of complex relations, which can depend on different frames of reference. However, concerning detail, the simplified sensory input is insufficient for embodied multimodal learning. Furthermore, concerning variety, the number of automatically generated relations is still not on par with the variety of relations encountered in the real world.</p>
<p>In the third approach, a diverse and large amount of structured <italic>data from knowledge bases</italic> is used, which can be manually curated, crowd-sourced, or extracted from text resources on the web. This kind of data reflects human knowledge and experiences in unlimited domains beyond any specific scenarios. A limitation of this approach is that it accesses primarily semantic information from which spatial relations need to be inferred<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref> and they often do not involve directional relations<xref ref-type="fn" rid="fn0002"><sup>2</sup></xref>.</p>
<p>These different, complementary approaches of data access have fostered the development of distinct tasks and of distinct classes of models: typical <italic>embodied</italic> models process sequences of multimodal data and output actions for robot control; models using disembodied <italic>image data</italic> are frequently used for classification; models using disembodied <italic>data from knowledge bases</italic> process symbolic information and are often used for inference and reasoning. We argue that all three approaches, although not directly compatible, are necessary to solve real-world tasks that involve spatial relation learning (cf. <xref ref-type="fig" rid="F1">Figure 1</xref>). In this paper, we provide an example model for each of the three approaches. Moreover, we sketch a concept for their integration into a unified neural architecture for spatial relation learning.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Spatial relations between objects can be obtained in different ways. Consider the instruction to the robot: &#x0201C;Take the cup to the left of the fruit bowl to water the plant.&#x0201D; Prior embodied experiences are needed for grounding the instruction in the real world. From its camera image, the robot can infer that there are cups <monospace>on</monospace> the table, but it needs to resolve &#x0201C;<monospace>to</monospace> <monospace>the left</monospace> of the fruit bowl&#x0201D; to use the correct cup. To infer that the plant, which is not in the robot&#x00027;s field of view, is <monospace>on</monospace> the windowsill, the robot can use prior knowledge, e.g., retrieved from a knowledge base, since it is a typical location for a plant.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-844753-g0001.tif"/>
</fig>
<p>Our contributions in this paper can be summarized as follows:</p>
<list list-type="order">
<list-item><p>We test an <italic>embodied</italic> language learning model on a realistic scenario with a 3D dataset including spatial relations (Section 3).</p></list-item>
<list-item><p>We present a new <italic>image data</italic> set and evaluate state-of-the-art models on spatial relation learning (Section 4).</p></list-item>
<list-item><p>We propose a way to apply a relation learning approach that uses <italic>data from knowledge bases</italic> to learning spatial relations (Section 5).</p></list-item>
<list-item><p>We provide a concept for integrating the three approaches and discuss further extension possibilities (Section 6).</p></list-item>
</list>
</sec>
<sec id="s2">
<title>2. Related work</title>
<p>In this section, we discuss previous work that is relevant for this paper, where we discuss models for spatial relation learning and embodied language learning. We introduce datasets that involve spatial relations and contrast them with the <italic>Qualitative Directional Relation Learning</italic> (QDRL) dataset that we propose in this paper.</p>
<sec>
<title>2.1. Embodied learning models</title>
<p>For embodied learning, an embodied agent (e.g., a robot) needs to act using its whole body or parts thereof (e.g., arms, hands, etc.) in an environment. As we are interested in spatial relation learning within the scope of this paper, and since it is a subset of language learning, in this part, we refer to the models that learn language in an embodied fashion. Specifically, we focus on embodied language learning with object manipulation. For a detailed and extensive review on language and robots, please refer to Tellex et al. (<xref ref-type="bibr" rid="B66">2020</xref>) where NLP-based robotic learning approaches are compared and categorized based on their technicalities and the problems that they address.</p>
<p>Early robotic language learning approaches focused on mapping human language input to formal language which could be interpreted by a robot (Dzifcak et al., <xref ref-type="bibr" rid="B17">2009</xref>; Kollar et al., <xref ref-type="bibr" rid="B35">2010</xref>; Matuszek et al., <xref ref-type="bibr" rid="B48">2012</xref>, <xref ref-type="bibr" rid="B49">2013</xref>). Dzifcak et al. (<xref ref-type="bibr" rid="B17">2009</xref>) introduced an integrated robotic architecture that parsed natural language directions created from a limited vocabulary in order to execute actions and achieve goals in an office environment by using formal logic. Similarly, Kollar et al. (<xref ref-type="bibr" rid="B35">2010</xref>) proposed an embodied spatial learning system that learned how to navigate in a building according to given human language input by mapping the natural language directions into formal language clauses and grounding them in the environment to find the most probable paths. Further, Matuszek et al. (<xref ref-type="bibr" rid="B49">2013</xref>) introduced an approach that could parse natural language commands to a formal robot control language (RCL) in order to map directions to executable sequences of actions depending on the world state in a navigation setup. Moreover, Matuszek et al. (<xref ref-type="bibr" rid="B48">2012</xref>) put forward a joint multimodal approach that flexibly learned novel grounded object attributes in the scene based on the linguistic and visual input using an online probabilistic learning algorithm. These early works were all symbolic learning approaches, while we are interested in neural network-based learning approaches in this paper.</p>
<p>As embodied language learning usually involves executing actions according to language input or describing those actions using language, it generally requires not only language and visual perception but also proprioception. Recently, numerous studies have been reported in which different objects are manipulated by a robot for embodied language learning (Hatori et al., <xref ref-type="bibr" rid="B24">2017</xref>; Shridhar and Hsu, <xref ref-type="bibr" rid="B60">2018</xref>; Yamada et al., <xref ref-type="bibr" rid="B72">2018</xref>; Heinrich et al., <xref ref-type="bibr" rid="B26">2020</xref>; Shao et al., <xref ref-type="bibr" rid="B59">2020</xref>). Hatori et al. (<xref ref-type="bibr" rid="B24">2017</xref>) present a multimodal neural architecture which is composed of object recognition and language processing modules intended to learn the mapping between object names and actual objects as well as their attributes such as color, texture, or size for moving them to different boxes, especially in cluttered settings. Shridhar and Hsu (<xref ref-type="bibr" rid="B60">2018</xref>) introduce the INGRESS (interactive visual grounding of referring expressions) approach that has two streams, namely self-reference (describing the object with inherent characteristics in isolation) and relation (describing the object according to its spatial relation to other objects), and that can generate language expressions from input images to be compared with input commands to locate objects in question to pick them with the robotic arm. Yamada et al. (<xref ref-type="bibr" rid="B72">2018</xref>) propose the paired recurrent autoencoders (PRAE) model, which fuses language and action modalities in the latent feature space <italic>via</italic> a shared loss, for bidirectional translation between predefined language descriptions and simple robotic manipulation actions on objects. Heinrich et al. (<xref ref-type="bibr" rid="B26">2020</xref>) propose a biologically inspired crossmodal neural network approach, the adaptive multiple timescale recurrent neural network (adaptive MTRNN), which enables the robot to acquire language by listening to commands while interacting with objects in a playground environment. Shao et al. (<xref ref-type="bibr" rid="B59">2020</xref>) put forward a robot learning framework that combines a neural network with reinforcement learning, which accepts a linguistic instruction and a scene image as input and produces a motion trajectory, trained to obtain concepts of manipulation by watching video demonstrations from humans.</p>
</sec>
<sec>
<title>2.2. Spatial relation learning datasets</title>
<p>Spatial relation learning can be understood as a subproblem of visual relationship detection (VRD) (Lu et al., <xref ref-type="bibr" rid="B43">2016</xref>; Krishna et al., <xref ref-type="bibr" rid="B36">2017</xref>) that has as its task predicting the subject-predicate-object (SPO) triples from images. As the SPO triples are often biased toward frequent scenarios (e.g., a book on a table), datasets such as the UnRel Dataset (Peyre et al., <xref ref-type="bibr" rid="B54">2017</xref>) and the SpatialSense dataset (Yang K. et al., <xref ref-type="bibr" rid="B74">2019</xref>) were proposed to reduce the effect of the dataset bias. A task that is more general than visual relation detection and implicitly requires spatial relation learning is visual question answering (VQA), whose goal is to answer questions on a given image (Antol et al., <xref ref-type="bibr" rid="B2">2015</xref>; Goyal et al., <xref ref-type="bibr" rid="B23">2017</xref>; Wu et al., <xref ref-type="bibr" rid="B70">2017</xref>).</p>
<p>Existing datasets for VRD and VQA do not distinguish between different frames of reference, which aggravates not only the difficulty of spatial relation prediction but also the difficulty of analyzing the performance of the models. To overcome the limitations of the existing datasets, in Section 4 we propose the <italic>Qualitative Directional Relation Learning</italic> (QDRL) dataset for analyzing the model performance on spatial relation learning in different frames of reference. Similar to the existing visual reasoning datasets CLEVR (Johnson et al., <xref ref-type="bibr" rid="B31">2017</xref>), ShapeWorld (Kuhnle and Copestake, <xref ref-type="bibr" rid="B37">2017</xref>), and SQOOP (Bahdanau et al., <xref ref-type="bibr" rid="B4">2019</xref>), QDRL is a generated dataset that allows for controlled evaluations of the models. But different from the former three datasets, whose spatial relations are exclusively based on an <italic>absolute</italic> frame of reference, QDRL also allows us to test model performance concerning <italic>intrinsic</italic> and <italic>relative</italic> frames of reference.</p>
</sec>
<sec>
<title>2.3. Spatial relation learning models</title>
<p>One of the early approaches to learning spatial relations is the connectionist model proposed in Regier (<xref ref-type="bibr" rid="B56">1992</xref>), which was developed as a part of the L<sub>0</sub> project (Feldman et al., <xref ref-type="bibr" rid="B19">1996</xref>). As an early connectionist model it is characterized by its involvement of several hand-engineered components, e.g., the object boundaries and orientations of the objects are preprocessed and not learned from data. In Collell and Moens (<xref ref-type="bibr" rid="B13">2018</xref>), the authors propose a model that predicts the location and the size of an object based on another object that is in relation to it. The model uses bounding boxes and does not distinguish between left and right for location and size prediction. For general VRD and VQA problems, most models rely on the cues from the language models they employ and use the bounding box information (Wu et al., <xref ref-type="bibr" rid="B70">2017</xref>; Lu et al., <xref ref-type="bibr" rid="B44">2019</xref>; Tan and Bansal, <xref ref-type="bibr" rid="B65">2019</xref>). Contrary to the VRD and VQA models, models for visual reasoning such as FiLM (Perez et al., <xref ref-type="bibr" rid="B53">2018</xref>) and MAC (Hudson and Manning, <xref ref-type="bibr" rid="B28">2018</xref>) do not rely on bounding boxes or pretrained language models. Furthermore, these two models do not assume any task-specific knowledge, which is for example exploited by neuro-symbolic approaches or neural module networks (Andreas et al., <xref ref-type="bibr" rid="B1">2016</xref>; Yi et al., <xref ref-type="bibr" rid="B77">2018</xref>).</p>
</sec>
</sec>
<sec id="s3">
<title>3. Embodied spatial relation learning</title>
<p>Having a body and acting in the environment is essential for cognition as human cognition relies upon having embodied context-dependent sensorimotor action capabilities in the environment, i.e., perception and action are inseparable in experienced cognition (Varela et al., <xref ref-type="bibr" rid="B69">2017</xref>), since humans perceive the world through a variety of sensors and act in the world with their motor functions (Arbib et al., <xref ref-type="bibr" rid="B3">1986</xref>). Similarly, machines cannot truly infer true meanings from words without experiencing the real world with vision, touch, and other sensors (Arbib et al., <xref ref-type="bibr" rid="B3">1986</xref>). Therefore, embodiment is also a necessary condition in spatial relation learning: by having an embodied agent situated in the environment we can learn grounded meanings of spatial relations such as left or right.</p>
<p>A simple robotic scenario generally involves a robot manipulating a few objects on a table. The robot may either execute actions according to given commands in textual/audio form or translate actions to commands. This requires a crossmodal architecture that involves multiple modalities like vision, language, and proprioception. Using multiple modalities helps for the case of spatial relation learning since seeing the object to be manipulated (vision), grounding commands that are associated with actions (language) and registering joint angle trajectories (proprioception) are all different interpretations of the world. For example, when executing a command such as &#x0201C;push the <monospace>left</monospace> object,&#x0201D; both seeing the objects on the table and moving the arm of the robot in the correct trajectory with learned joint angles support learning the position &#x0201C;<monospace>left</monospace>.&#x0201D;</p>
<sec>
<title>3.1. A bidirectional embodied model</title>
<p>A bidirectional embodied model, such as the PRAE (paired recurrent autoencoders; Yamada et al., <xref ref-type="bibr" rid="B72">2018</xref>), is attractive to approach grounding of language, since it is able to both execute simple robot actions given language descriptions and to generate language descriptions given executed and visually perceived actions. In our recent extension of the model in a robotic scenario (&#x000D6;zdemir et al., <xref ref-type="bibr" rid="B51">2021</xref>), schematically shown in <xref ref-type="fig" rid="F2">Figure 2</xref>, two cubes of different colors are placed on a table at which the NICO robot (Kerzel et al., <xref ref-type="bibr" rid="B34">2017</xref>) is seated to interact with them (see <xref ref-type="fig" rid="F3">Figure 3</xref>). Given proprioceptive and visual input, the approach is capable of translating robot actions to textual descriptions. The proposed Paired Variational Autoencoders (PVAE) extension allows to associate each robot action with eight description alternatives, and provides one-to-many mapping, by using Stochastic Gradient Variational Bayes (SGVB).</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Bidirectional embodied model.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-844753-g0002.tif"/>
</fig>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>The NICO robot (Kerzel et al., <xref ref-type="bibr" rid="B34">2017</xref>) in the simulation environment (&#x000D6;zdemir et al., <xref ref-type="bibr" rid="B51">2021</xref>). <bold>(Left)</bold> NICO is sliding the right cube. <bold>(Right)</bold> NICO is pulling the left cube. In both segments, NICO&#x00027;s field of view is shown in the top right insets.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-844753-g0003.tif"/>
</fig>
<p>The model consists of two autoencoders: a language and an action VAE. The language VAE learns descriptions while the action VAE learns joint angle values conditioned on the visual input. After encoding, the encoded representations are used to extract latent representations by randomly sampling from a Gaussian distribution. A binding loss brings the two VAEs closer by reducing the distance between two latent variables. Additionally, we introduced a channel-separated CAE (convolutional autoencoder), for the PVAE approach, for extracting visual features from the egocentric scene images (&#x000D6;zdemir et al., <xref ref-type="bibr" rid="B51">2021</xref>). The channel separation refers to training the same CAE once for each RGB channel and concatenating the features extracted from the middle layer of the CAE for each color channel to arrive at the combined visual features. The PVAE with channel-separated CAE visual feature extraction outperforms the standard PRAE (Yamada et al., <xref ref-type="bibr" rid="B72">2018</xref>) in the one-to-many translation of actions into language commands. The approach is significantly more successful in the case of three color alternatives per cube and also with six color alternatives compared to PRAE. Our findings suggest that variational autoencoders facilitate better one-to-many action-to-description translation and address the linguistic ambiguity between an action and its probable descriptions in the simple scenario shown in <xref ref-type="fig" rid="F3">Figure 3</xref>. Moreover, channel separation in visual feature extraction leads to a more accurate recognition of object colors.</p>
</sec>
<sec>
<title>3.2. Embodied spatial relation learning dataset</title>
<p>The previously mentioned works on bidirectional autoencoders do not experiment with the models&#x00027; spatial relation learning capabilities. The instructions that the model processes are composed of three words with the first word indicating the type of action (push, pull, or slide), the second the cube color (six color alternatives) and the last the speed at which the action is performed (slowly or fast). The command &#x0201C;slide yellow slowly&#x0201D; and &#x0201C;pull pink fast&#x0201D; are example descriptions used for the model (cf. <xref ref-type="fig" rid="F3">Figure 3</xref>). Therefore, the corpus includes 36 possible sentences (3 action &#x000D7; 6 color &#x000D7; 2 speed) without the alternative words and 288 possible sentences are created by replacing each word with an alternative (36 &#x000D7; 2<sup>3</sup>). Moreover, the dataset consists of 12 action types (e.g., push left, pull right etc.) and 12 cube arrangements (e.g., pink-yellow, red-green etc.), thus of 144 patterns (12 action type &#x000D7; 12 arrangement).</p>
<p>We extend this corpus by adding &#x0201C;<monospace>left</monospace>&#x0201C; or &#x0201C;<monospace>right</monospace>&#x0201C; as a new term to each description. Therefore, above example descriptions become &#x0201C;slide <monospace>right</monospace> yellow slowly&#x0201D; and &#x0201C;pull <monospace>left</monospace> pink slowly,&#x0201D; respectively&#x02014;the descriptions are composed of four words.<xref ref-type="fn" rid="fn0003"><sup>3</sup></xref> Color words may also be omitted so that the model needs to rely on the spatial specification. For simplicity, the cubes are placed on two fixed positions and the two cubes on the table are never of the same color. We have trained the model with the modified descriptions using the same hyperparameters as in &#x000D6;zdemir et al. (<xref ref-type="bibr" rid="B51">2021</xref>) for 15,000 iterations with a learning rate of 10<sup>&#x02212;4</sup> and batch size of 100<xref ref-type="fn" rid="fn0004"><sup>4</sup></xref>.</p>
</sec>
<sec>
<title>3.3. Results of the PVAE model</title>
<p>To translate actions to descriptions, we use the action encoder and language decoder: given joint angle values and visual features, we expect the model to produce the correct descriptions. For the bidirectional aspect of PVAE, we also test the language-to-action translation capability. For this task, we give as input one of the eight alternative descriptions for each pattern (action-description-arrangement combination) and we expect the model to predict the corresponding joint angle values. To that end, we use the language encoder and action decoder of PVAE. Both tasks are evaluated using the same trained model.</p>
<p>The results are as follows:</p>
<list list-type="bullet">
<list-item><p>PVAE is able to translate from actions to descriptions with 100% accuracy for all 144 patterns, including 108 training and 36 test patterns (see <xref ref-type="table" rid="T1">Table 1</xref>). This matches the results reported in &#x000D6;zdemir et al. (<xref ref-type="bibr" rid="B51">2021</xref>).</p>
</list-item>
<list-item><p>The predicted joint angle values are tightly close to the original values, as can be seen in <xref ref-type="fig" rid="F4">Figure 4</xref> with qualitative results and in <xref ref-type="table" rid="T1">Table 1</xref> with average quantitative results in terms of the normalized root-mean-square error (nRMSE) between the original and predicted joint trajectories. Therefore, we expect the robot to execute correct actions according to the given instructions.</p>
</list-item>
</list>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Performance of PVAE on bidirectional translation.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Translation type</bold></th>
<th valign="top" align="left"><bold>Evaluation measure</bold></th>
<th valign="top" align="center"><bold>Train (%)</bold></th>
<th valign="top" align="center"><bold>Test (%)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Action &#x02192; Language</td>
<td valign="top" align="left">Description accuracy&#x02191;</td>
<td valign="top" align="center" style="background-color:#9bd3ae">100</td>
<td valign="top" align="center" style="background-color:#9bd3ae">100</td>
</tr>
<tr>
<td valign="top" align="left">Language &#x02192; Action</td>
<td valign="top" align="left">nRMSE&#x02193;</td>
<td valign="top" align="center" style="background-color:#9bd3ae">0.53</td>
<td valign="top" align="center" style="background-color:#9bd3ae">0.55</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Green background indicates good performance.</p>
</table-wrap-foot>
</table-wrap>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Examples of original and predicted joint angle trajectories for four different actions. The predicted values are generated by PVAE, given language descriptions and conditioned on visual input. Solid lines show the ground truth, while the dashed lines, which are often hidden by the solid lines, show the predicted joint angle values. The titles denote the action types, e.g., &#x0201C;PULL-R-SLOW&#x0201D; means pulling the <monospace>right</monospace> object slowly. The ground truth action trajectories with joint angle values were generated with an inverse kinematics solver in the simulation environment (&#x000D6;zdemir et al., <xref ref-type="bibr" rid="B51">2021</xref>).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-844753-g0004.tif"/>
</fig>
<p>It is arbitrary to describe an action with both the relative position and color of the object being manipulated in this scenario due to the cube arrangements. However, when two cubes of the same color are present on the table, adding relative position information into descriptions is necessary to avoid confusion&#x02014;we do not test this since the dataset (&#x000D6;zdemir et al., <xref ref-type="bibr" rid="B51">2021</xref>) does not involve two cubes of the same color simultaneously on the table. Furthermore, we can also be sure that the same action-to-description translation performance could be achieved by removing the color term from the descriptions as the position information of the object can be extracted through proprioception only, i.e., joint angle values, without the need for the vision modality. This is because the position of the cube being handled can be inferred from the proprioception as action types include the position (left or right) rather than the color of the cube.</p>
<p>For practical reasons, we do not simulate the robot with predicted joint angle trajectories. Due to certain subtleties in object manipulation (contact point etc.), there may be divergences in the simulated kinematics where the objects are moved toward compared to original trajectories. Note further that, compared to a human of the same size, the arm movements of our robot (i.e., the NICO robot; Kerzel et al., <xref ref-type="bibr" rid="B34">2017</xref>) are more constrained due to fewer degrees of freedom, short arms, self-obstruction by the limbs, and by its inflexible trunk. We, therefore, set up only a simple scenario with left-right relation in the robot&#x00027;s egocentric frame of reference. In the following section, we tackle more complex spatial problems with multiple frames of reference.</p>
</sec>
</sec>
<sec id="s4">
<title>4. Spatial relation learning using image data</title>
<p>Investigating how well neural networks learn the geometric features underlying different spatial relations is an important step toward building robust deep learning models for learning spatial relations. In this section, we propose a new dataset that we call the <italic>Qualitative Directional Relation Learning</italic> (QDRL) dataset, which allows for testing the performance of deep learning models on directional relational learning tasks. We evaluate the performance of representative end-to-end neural models on the QDRL dataset concerning different frames of reference and their generalizability to unseen entity-relation combinations (also known as compositional generalizability).</p>
<sec>
<title>4.1. Directional relation learning</title>
<p>Humans adopt different strategies when giving instructions to robots, where different frames of reference play a role (Tenbrink et al., <xref ref-type="bibr" rid="B67">2002</xref>). There are three kinds of frames of reference according to Levinson (<xref ref-type="bibr" rid="B39">1996</xref>). In an <italic>absolute</italic> frame of reference, the location of an entity is given by a fixed frame of reference shared by all entities (cf. <bold>Figure 7</bold>). In an <italic>intrinsic</italic> frame of reference, each object determines the reference frame given by its orientation (cf. <bold>Figure 7</bold>). In a <italic>relative</italic> frame of reference, the direction between two entities determines the frame of reference for locating another third entity (cf. <bold>Figure 7</bold>).</p>
<p>In this section, we evaluate two deep learning models, FiLM (Perez et al., <xref ref-type="bibr" rid="B53">2018</xref>) and MAC (Hudson and Manning, <xref ref-type="bibr" rid="B28">2018</xref>). Schematically, <xref ref-type="fig" rid="F5">Figure 5</xref> shows that they take as input a raw RGB image and a question as a sequence of strings. These are turned into vectors <italic>v</italic> and <italic>q</italic> using a convolutional neural network (CNN) and a recurrent neural network (RNN), respectively. They produce a text answer as output, which here reduces to <monospace>true</monospace> or <monospace>false</monospace>. The two models differ in how they process <italic>v</italic> and <italic>q</italic>. As generic visual reasoning models they are fully differentiable and do not assume any task-specific knowledge (e.g., bounding boxes or the structure of the question).</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Spatial relation learning model.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-844753-g0005.tif"/>
</fig>
<p>The FiLM network processes <italic>v</italic> through a sequence of ResNet (He et al., <xref ref-type="bibr" rid="B25">2016</xref>) blocks where the output values before the final ReLu activation function in each block are affinely transformed. The parameters for the affine transformations are obtained from the question vector <italic>q</italic> through a linear transformation. This way, FiLM allows the question to modulate what information passes through each ResNet block function, which helps sequential reasoning.</p>
<p>The main idea of the MAC network is to model a reasoning process by keeping a sequence of control operations and a recurrent memory, where the control operations decide what information to retrieve from the image, and the memory retains information relevant for each reasoning step.</p>
</sec>
<sec>
<title>4.2. The qualitative directional relation learning dataset</title>
<p>The Qualitative Directional Relation Learning (QDRL) dataset we propose consists of (image, question, answer) triples<xref ref-type="fn" rid="fn0005"><sup>5</sup></xref>. Here, the <italic>question</italic> is a simple statement about the spatial relation between the objects in the image and is of the form (<italic>head, relation, tail</italic>), e.g., (rabbit, <monospace>left_of</monospace>, cat). The <italic>answer</italic> can be either <monospace>true</monospace> or <monospace>false</monospace> and depends on the adopted frame of reference, and the distribution of the truth values of the answers is balanced, such that no bias can be exploited. The <italic>image</italic> is of size 128 &#x000D7; 128 with a black background and contains non-overlapping entities of size 24 &#x000D7; 24. As entities we chose face emojis that have clear front sides and facilitate detecting orientations. The samples are generated as follows. First, a fixed number <italic>n</italic> of emoji names are randomly chosen from 38 possible emoji names. Then a head entity <italic>h</italic>, a tail entity <italic>t</italic>, a relation <italic>r</italic> and an answer <italic>a</italic> are randomly selected so as to form an (<italic>h, r, t</italic>) question triple and the ground-truth answer <italic>a</italic>. To prepare a corresponding image, the <italic>n</italic> entities are randomly rotated and placed in the image until the constraint [(<italic>h, r, t</italic>), <italic>a</italic>] is satisfied. An example with ground truth answers concerning different frames of reference is given in <xref ref-type="fig" rid="F6">Figure 6</xref>.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>An example of a QDRL dataset sample. Given an image, in an absolute and an intrinsic frame of reference a question about the image is a triple (entity1, <monospace>relation</monospace>, entity2), and in a relative frame of reference, a question about the image is a quadruple (entity1, <monospace>relation</monospace>, entity2, entity3). As can be seen in the ground truth answers to the question, different frames of reference (FoR) lead to different answers.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-844753-g0006.tif"/>
</fig>
<p>As directional relations we use {<monospace>above</monospace>, <monospace>below</monospace>, <monospace>left_of</monospace>, <monospace>right_of</monospace>} for absolute and intrinsic frames of reference and {<monospace>in_front_of</monospace>, <monospace>behind</monospace>, <monospace>left_of</monospace>, <monospace>right_of</monospace>} for a relative frame of reference. Examples of the directional relations, where frames of reference are taken into consideration, are given in <xref ref-type="fig" rid="F7">Figure 7</xref>.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>The three frames of reference according to Levinson (<xref ref-type="bibr" rid="B39">1996</xref>). <bold>(Left)</bold> In an <italic>absolute frame of reference</italic>, the location of an entity is given by a fixed frame of reference shared by all entities (<monospace>above</monospace> is fixed to the north, here, of the cat). <bold>(Middle)</bold> In an <italic>intrinsic frame of reference</italic>, an object has its own frame of reference given by its orientation (here, the cat is oriented toward the northeast). <bold>(Right)</bold> In a <italic>relative frame of reference</italic>, the direction given by two entities (here, the direction from the rabbit to the cat) determines the frame of reference for locating another third entity (here, dog).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-844753-g0007.tif"/>
</fig>
<p>The QDRL dataset encourages a neural network model to learn the (oriented) bounding box of the reference entity as it induces the decision boundaries for different relations, where the kind of bounding box a model has to learn depends on the given frame of reference. In an absolute frame of reference, a model has to learn the axis-aligned bounding box of the reference entity. In an intrinsic frame of reference, a model has to additionally learn the orientation of the reference entity and the bounding box that is aligned to that orientation. In a relative frame bsof reference, a model has to determine the centers of the reference entity and the source entity that &#x0201C;sees&#x0201D; the reference entity and align the bounding box to the direction from the center of the source entity to the center of the reference entity.</p>
</sec>
<sec>
<title>4.3. Experiments on the QDRL dataset</title>
<p>In this section, we evaluate the performance of the FiLM and the MAC networks on the QDRL dataset with respect to different frames of reference and their compositional generalizability, i.e., their generalizability to unseen entity-relation combinations. To this end, we train the two models on 1,000,000 (image, question, answer) triples and validate on 10 10,000 (image, question, answer)-triples, where we vary the following parameters for each experiment: (i) frame of reference and (ii) for absolute and intrinsic frames of reference the number of entities in each scene (&#x02208;{2, 5}).</p>
<p>In addition to the standard validation set, to test how the models generalize compositionally, we hold out a subset <italic>S</italic> of 18 entities from the 32 entities appearing in the training set and make sure that every question in the training set involves at least an entity that is not in <italic>S</italic>. We then create a dataset consisting of 10,000 (image, question, answer) triples exclusively with the entities from <italic>S</italic> and call it the <italic>compositional validation set</italic>. This way it is guaranteed that the set of questions in the training set has no overlap with the set of the questions in the compositional validation set. This allows us to test whether a model is able to learn to disentangle entities and relations as well as to learn the syntactic structures, such that they can deal with unseen combinations of entities and relations.</p>
<p>All model hyperparameters, except for the number of FiLM blocks (&#x02208;{2, 4, 6}) and the MAC cells (&#x02208;{2, 8}) that we optimize, are taken from Bahdanau et al. (<xref ref-type="bibr" rid="B4">2019</xref>). For training, we choose 32 as the batch size and apply early stopping based on the model&#x00027;s performance on the validation set.</p>
</sec>
<sec>
<title>4.4. Results by FiLM and MAC models</title>
<p>In <xref ref-type="table" rid="T2">Table 2</xref>, we report the accuracy results of the experiments. From the table, we can observe the following.</p>
<list list-type="bullet">
<list-item><p>Learning directional relations in an intrinsic frame of reference and a relative frame of reference is more challenging than in an absolute reference, which intuitively makes sense as the models have the extra burden to learn the orientations.</p></list-item>
<list-item><p>All models achieve relatively high performance on the validation set, which indicates that both FiLM and MAC have sufficient capacity to learn the training distribution.</p></list-item>
<list-item><p>Regarding the compositional generalization set, for the FiLM model the difficulty of the tasks increases in the order of absolute, intrinsic, and relative frame of reference, whereas the MAC model is not affected by the frames of reference, and consistently outperforms the FiLM model. The performance gap between MAC and FiLM is significant in the case of relative frames of reference.</p></list-item>
<list-item><p>The MAC model shows overall a smaller gap between the performances on the validation set and the compositional validation set. Even though MAC does not perform better than FiLM on the validation set, its performances on the compositional validation set are consistently better than those of FiLM.</p></list-item>
</list>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Accuracies of FiLM and MAC networks on the QDRL dataset.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>FoR</bold><xref ref-type="table-fn" rid="TN1"><sup><italic>a</italic></sup></xref></th>
<th valign="top" align="left">&#x00023; <bold>Ents</bold><xref ref-type="table-fn" rid="TN2"><sup><italic>b</italic></sup></xref></th>
<th valign="top" align="center" colspan="2" style="border-bottom: thin solid #000000;"><bold>FiLM</bold></th>
<th valign="top" align="center" colspan="2" style="border-bottom: thin solid #000000;"><bold>MAC</bold></th>
</tr>
<tr>
<th/>
<th/>
<th valign="top" align="left"><bold>Val</bold><xref ref-type="table-fn" rid="TN3"><sup><italic>c</italic></sup></xref></th>
<th valign="top" align="center"><bold>Comp</bold><xref ref-type="table-fn" rid="TN4"><sup><italic>d</italic></sup></xref></th>
<th valign="top" align="center"><bold>Val</bold></th>
<th valign="top" align="center"><bold>Comp</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left" style="border-bottom: thin solid #000000;">Absolute</td>
<td valign="top" align="left" style="border-bottom: thin solid #000000;">2</td>
<td valign="top" align="left" style="background-color:#9bd3ae; border-bottom: thin solid #000000">0.996</td>
<td valign="top" align="center" style="background-color:#9bd3ae; border-bottom: thin solid #000000">0.912</td>
<td valign="top" align="center" style="background-color:#9bd3ae; border-bottom: thin solid #000000">0.985</td>
<td valign="top" align="center" style="background-color:#9bd3ae; border-bottom: thin solid #000000">0.929</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">5</td>
<td valign="top" align="left" style="background-color:#9bd3ae">0.996</td>
<td valign="top" align="center" style="background-color:#9bd3ae">0.933</td>
<td valign="top" align="center" style="background-color:#9bd3ae">0.992</td>
<td valign="top" align="center" style="background-color:#9bd3ae">0.958</td>
</tr>
<tr>
<td valign="top" align="left" style="border-bottom: thin solid #000000;">Intrinsic</td>
<td valign="top" align="left" style="border-bottom: thin solid #000000;">2</td>
<td valign="top" align="left" style="background-color:#9bd3ae; border-bottom: thin solid #000000">0.979</td>
<td valign="top" align="center" style="background-color:#fdd09e; border-bottom: thin solid #000000">0.882</td>
<td valign="top" align="center" style="background-color:#9bd3ae; border-bottom: thin solid #000000">0.973</td>
<td valign="top" align="center" style="background-color:#9bd3ae; border-bottom: thin solid #000000">0.927</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">5</td>
<td valign="top" align="left" style="background-color:#9bd3ae">0.978</td>
<td valign="top" align="center" style="background-color:#fdd09e">0.862</td>
<td valign="top" align="center" style="background-color:#9bd3ae">0.967</td>
<td valign="top" align="center" style="background-color:#9bd3ae">0.937</td>
</tr>
<tr>
<td valign="top" align="left">Relative</td>
<td valign="top" align="left">3</td>
<td valign="top" align="left" style="background-color:#9bd3ae">0.978</td>
<td valign="top" align="center" style="background-color:#f8aa8f">0.745</td>
<td valign="top" align="center" style="background-color:#9bd3ae">0.978</td>
<td valign="top" align="center" style="background-color:#9bd3ae">0.975</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="TN1"><label>a</label><p>Frame of reference.</p></fn>
<fn id="TN2"><label>b</label><p>&#x00023; Entities.</p></fn>
<fn id="TN3"><label>c</label><p>Validation set.</p></fn>
<fn id="TN4"><label>d</label><p>Compositional validation set. Green background indicates good performance, red indicates worse performance.</p></fn>
</table-wrap-foot>
</table-wrap>
<p>These results demonstrate the good ability of neural networks to learn spatial relations in diverse frames of reference. However, due to the simplicity of the simulated dataset, it will be necessary to test the models&#x00027; capabilities on more realistic 3D data (cf. Section 3). Since it is difficult to model the prior knowledge about spatial relations in the real world, in the following section we will consider the possibility of making use of existing knowledge bases.</p>
</sec>
</sec>
<sec id="s5">
<title>5. Spatial relation learning using knowledge bases</title>
<p>People have created several large-scale commonsense knowledge bases to store relational knowledge about objects in structured triples (Speer et al., <xref ref-type="bibr" rid="B63">2017</xref>; Ji et al., <xref ref-type="bibr" rid="B30">2021</xref>; Nayak et al., <xref ref-type="bibr" rid="B50">2021</xref>), such as (person, <monospace>riding</monospace>, horse) and (plant, <monospace>on</monospace>, windowsill). Intuitively, relational triples in commonsense knowledge bases store expected prior relations between objects, which can provide useful disembodied learning signals for relation detectors. Combined with object detectors, the relation detectors can produce structured graph representations of the scene, which can be useful for robots to obtain a deep understanding of the environment and perform subsequent interactions.</p>
<p>To leverage commonsense knowledge bases for visual relation detection, we have proposed the visual distant supervision technique in Yao et al. (<xref ref-type="bibr" rid="B76">2021</xref>). Visual distant supervision aligns commonsense knowledge bases with unlabeled images to automatically create distantly labeled relation data, which can be used to train any visual relation detectors. The underlying assumption is that the relations between two objects in an image tend to be the same as their relations in the knowledge bases. As shown in the example in <xref ref-type="fig" rid="F8">Figure 8</xref>, since the object pair (bowl, cup) is labeled with relation <monospace>beside</monospace> in knowledge bases, an image with object pair (bowl, cup) will have <monospace>beside</monospace> as a candidate relation for the pair. In this way, visual distant supervision can train visual relation detectors without any human-labeled relation data, achieving strong performance compared to semi-supervised relation detectors that utilize several seed human annotations for each relation (Chen et al., <xref ref-type="bibr" rid="B10">2019</xref>).</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>Visual distant supervision (Yao et al., <xref ref-type="bibr" rid="B76">2021</xref>) retrieves plausible relations between the detected objects (only a selection of bounding boxes and relations is shown). Correct relation labels are highlighted in bold and green thick arrows.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-844753-g0008.tif"/>
</fig>
<p>However, the assumption of distant supervision inevitably introduces noise in its automatic label generation, such as the relation label <monospace>beside</monospace> for the object pair (bowl, plant) in <xref ref-type="fig" rid="F8">Figure 8</xref>. The reason is that distant supervision only depends on object categories for relation label generation, without considering the complete image content or spatial layout. To alleviate the noise in distant supervision, we have proposed a denoising framework that iteratively refines the probabilistic relation labels based on the EM optimization method (Yao et al., <xref ref-type="bibr" rid="B76">2021</xref>). When human-labeled relation data is available, pretraining on distantly labeled data can also bring in improvements over fully supervised relation detectors.</p>
<p>Despite its effectiveness in learning relations, distant supervision is not always useful for spatial relation learning (e.g., there is no prior knowledge about whether a cup with water should be to the <monospace>left</monospace> or to the <monospace>right</monospace> of the fruit bowl in <xref ref-type="fig" rid="F8">Figure 8</xref>). However, some relations have implicit spatial information, which can potentially be useful for spatial relation learning. For example, the relation <monospace>riding</monospace> implies the spatial relation <monospace>on</monospace>, where this implication can be obtained from linguistic knowledge bases, such as WordNet (Fellbaum, <xref ref-type="bibr" rid="B20">1998</xref>). Based on the implications, relation representations learned <italic>via</italic> distant supervision can be transferred to help spatial relation learning. Effectively leveraging distant supervision for spatial relation learning is, therefore, an important research problem.</p>
</sec>
<sec id="s6">
<title>6. Concept of an integrated architecture</title>
<p>The previous sections presented complementary models for spatial reasoning: a model to collect embodied, but costly, experiences; a model for plentiful, but oversimplifying, simulations; and a knowledge base enriched, but disembodied, technique. To achieve the intelligent behavior of an AI agent, the merits of such models must be combined. However, neural models mostly cannot be trivially combined by using a modular setup with well-defined interfaces. Since our models have overlapping functionality, their combination needs to be designed in the architecture and by joint training of the architecture components. In the area of multi-task learning, there have been recent attempts to tackle multiple datasets and tasks, combining multiple inputs and outputs, by a single model (Kaiser et al., <xref ref-type="bibr" rid="B32">2017</xref>; Pramanik et al., <xref ref-type="bibr" rid="B55">2019</xref>; Lu et al., <xref ref-type="bibr" rid="B45">2020</xref>). The conjecture is that while multiple tasks are concurrently learned, learning one task can help the others. To transfer knowledge or skills, parts of the neural architecture are shared between the tasks.</p>
<p><xref ref-type="fig" rid="F9">Figure 9</xref> shows a concept for our proposition that follows a bidirectional model architecture (cf. Section 3), which enables tasks in two directions: The task to act, given language instructions, is best performed by embodied learning in a realistic 3D simulation (green arrows indicate the direction of the information flow). The task to produce language descriptions, given (visual) sensor input (red pathway), lends itself to using simulated visual data containing geometric relations that can be easily produced in large quantities (cf. Section 4). The representations on the central part benefit from joint training by forming a joint abstract representation of entities, which are independent of the input modality. The bidirectionality of the model ensures compatibility with both directions, while a large overlap in the joint central part should ensure that extensive spatial relation learning on large datasets can help the other tasks.</p>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption><p>Concept of an integrated architecture for spatial relationship learning. Our presented models cover only input-to-output. The loop closure with the environment depicted here indicates an extension, such as dialogue with a human. Smaller loops on the decoders indicate low-level feedback-driven behaviors such as reaching a target object or producing a sentence.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-16-844753-g0009.tif"/>
</fig>
<p>Disembodied knowledge (cf. Section 5), e.g., from a knowledge base, enters the model as another input (blue arrows in <xref ref-type="fig" rid="F9">Figure 9</xref>). The recognition of concepts, given language and sensor input, will activate related disembodied knowledge to help to finish the task, which can be achieved by first retrieving related relational triples from the knowledge base, and then obtaining enhanced representations of the language and sensor input with the retrieved knowledge. For example, when a robot is asked to &#x0201C;fetch the cup,&#x0201D; related relational triples will be retrieved, such as (cup, <monospace>on</monospace>, table), represented into embeddings, and integrated into the representation of the instruction so that the robot will expect to find the cup on the table first (see also <xref ref-type="fig" rid="F1">Figure 1</xref>).</p>
<p>A practical methodology to incorporate disembodied knowledge into an integrated model is using a graph neural network (GNN) (Gori et al., <xref ref-type="bibr" rid="B22">2005</xref>; Liu and Zhou, <xref ref-type="bibr" rid="B42">2020</xref>). The disembodied knowledge is represented in the form of a graph structure where nodes capture the concepts and edges capture the existing relations between the nodes. Nodes hold a vector, while interactions over the edges are represented as neural networks, which share weights in case of same relation types. To compute a target function, vectors on each node are iteratively updated. The structure of the GNN can be derived from knowledge bases such as ConceptNet or Visual Genome, and the structure will be typically sparse, i.e., only a small proportion of node pairs will be connected. The trainable GNN parameters, including the input and readout connections that connect the GNN layer to the main neural model architecture (blue arrows in <xref ref-type="fig" rid="F9">Figure 9</xref>), can be trained as part of the integrated architecture. This requires a dataset, in which the model function can benefit from the GNN, such as commonsense reasoning (Talmor et al., <xref ref-type="bibr" rid="B64">2019</xref>). In our integrated model, the graph neural network would first need relevant nodes activated, which correspond to items from the visual input or to words from the language input. Such a mapping could be established by supervised pretraining. The GNN converts commonsense knowledge regarding object-to-object relations, which is encoded in its structure, to be used by the distributed representations of our neural model. Thereby the external knowledge gets fused with representations obtained from the instruction and visual signals in order to enhance spatial reasoning (Yang J. et al., <xref ref-type="bibr" rid="B73">2019</xref>).</p>
<p>While our models are trained in a supervised way from pairs of input and output vectors, interacting with the environment means that actions are iteratively performed by an embodied learning agent (<xref ref-type="fig" rid="F9">Figure 9</xref> shows the environment in the loop). There are many approaches to train the action policy of an agent, including supervised learning (Shah et al., <xref ref-type="bibr" rid="B58">2021</xref>), imitation learning (Chevalier-Boisvert et al., <xref ref-type="bibr" rid="B11">2019</xref>; Chaplot et al., <xref ref-type="bibr" rid="B7">2020</xref>; Shridhar et al., <xref ref-type="bibr" rid="B61">2021</xref>), and reinforcement learning (Hermann et al., <xref ref-type="bibr" rid="B27">2017</xref>; Chaplot et al., <xref ref-type="bibr" rid="B8">2018</xref>; Li et al., <xref ref-type="bibr" rid="B41">2021</xref>). Among these approaches, reinforcement learning is most versatile because it does not require human-labeled data for all situations, but the agent can learn its action policy by interacting with the environment and only occasionally receiving rewards (purple dashed arrow in <xref ref-type="fig" rid="F9">Figure 9</xref>). The reward function is typically designed manually based on the domain knowledge of the target task, or it can be an intrinsic reward function (Pathak et al., <xref ref-type="bibr" rid="B52">2017</xref>).</p>
</sec>
<sec sec-type="discussion" id="s7">
<title>7. Discussion</title>
<sec>
<title>7.1. Integrating reinforcement learning</title>
<p>Our bidirectional model is trained in a supervised fashion to perform physical actions in a continuous 3D space (Section 3). However, small deviations from a teacher trajectory could lead to failure, for example, in grasping an object. Reinforcement learning (RL), in contrast, is sensitive to the narrow regions in action space that distinguish successful from non-successful actions. In <xref ref-type="fig" rid="F9">Figure 9</xref>, we therefore suggest using RL as a superior method for the physical actions.</p>
<p>Goal-conditioned RL is advisable for cases where the agent&#x00027;s goal not only depends on the state of the environment, but where the goal is also conditioned on further input, such as its internal state (Dickinson and Balleine, <xref ref-type="bibr" rid="B16">1994</xref>), or on language input as in our model. Goal-conditioned RL furthermore underlies hierarchical RL, where a higher-level module dynamically sets goals for a lower-level module, and hindsight experience replay (HER), where a future state in any trajectory is set as a goal in hindsight. With the availability of abundant high-quality trajectories, Lynch and Sermanet (<xref ref-type="bibr" rid="B46">2021</xref>) use an imitation learning approach, where the agent uses HER to learn from crowd-sourced trajectories, where the goal representation is paired with language input, in order to realize a flexible language-to-action mapping.</p>
<p>While RL is established for learning physical actions and suitable for general use (Silver et al., <xref ref-type="bibr" rid="B62">2021</xref>), its use for language learning is yet emergent (R&#x000F6;der et al., <xref ref-type="bibr" rid="B57">2021</xref>; Uc-Cetina et al., <xref ref-type="bibr" rid="B68">2021</xref>). Regarding language as a sequence production problem, our language decoder could benefit from the availability of high-quality forward models, such as the Transformer language model. Such a language model could be used as a forward model in a model-based RL algorithm, as done by the decision transformer (Chen et al., <xref ref-type="bibr" rid="B9">2021</xref>) and the trajectory transformer (Janner et al., <xref ref-type="bibr" rid="B29">2021</xref>). However, such open-domain language models are difficult to use in a visual context to achieve specified goals. In order to define terminal goals in RL for language learning in specific domains, simple visual guessing game scenarios were devised (Das et al., <xref ref-type="bibr" rid="B14">2017</xref>; Zhao et al., <xref ref-type="bibr" rid="B79">2021</xref>). The generated language can be further augmented for high-quality dialogue by rewarding certain properties like informativity, coherence, and ease of answering (Li et al., <xref ref-type="bibr" rid="B40">2016</xref>), which works in open domains, or by other scores (e.g., BLEU or ROUGE) that compare to human-generated text (Keneshloo et al., <xref ref-type="bibr" rid="B33">2020</xref>). The contrast between domain-specific scenarios, which allow to guide RL language learning <italic>via</italic> rewards, and open-domain sophisticated language models, reflects the contrast between embodied and simulated learning, which allows control over spatial relations, and the use of knowledge bases with their open-domain information.</p>
<p>A challenge for deep RL is that its many parameters are trained from sparse and often binary reward feedback. Therefore, unsupervised or supervised pretraining of model components, such as for sensory preprocessing, or for end-to-end components as described in Sections 3 and 4, can render deep RL efficient.</p>
</sec>
<sec>
<title>7.2. Curriculum learning</title>
<p>For machine learning tasks that span multiple levels of difficulty, curriculum learning has been shown to be efficient for a variety of models (Elman, <xref ref-type="bibr" rid="B18">1993</xref>; Bengio et al., <xref ref-type="bibr" rid="B5">2009</xref>).</p>
<p>In one of our experiments on the QDRL dataset (cf. Section 4) we observed that pretraining the FiLM model first on scenes in an <italic>intrinsic</italic> frame of reference with two objects and then fine-tuning it on scenes in a <italic>relative</italic> frame of reference with three objects helped the model to achieve about 0.9 accuracy on the compositional validation set, instead of 0.745 accuracy without the pretraining (cf. <xref ref-type="table" rid="T2">Table 2</xref>). This large increase in performance could be attributed to the fact that scenes in an intrinsic frame of reference are easier to learn as the relations involve only two objects, while at the same time helping a model to learn the concept of orientation. Fine-tuning on scenes in a relative frame of reference thus requires only modifying the concept of orientation, i.e., orientation is determined by two objects instead of intrinsically (cf. <xref ref-type="fig" rid="F7">Figure 7</xref>).</p>
<p>A specific spatial relation between objects arises when one object occludes another, i.e., when one object is behind another from the observer&#x00027;s point of view. In the task of robotic object existence prediction by occlusion reasoning (Li et al., <xref ref-type="bibr" rid="B41">2021</xref>), a robot needs to reason whether a target object is possibly occluded by a visible object. Curriculum learning has proven essential for the successful training of the proposed model. We found that training the model from scratch on data containing all types of scenes is hard. In the curriculum training strategy, the model is sequentially trained on four types of scenes with increasing difficulty. First, the model is trained on scenes with only one object. Then the model is trained on scenes with two objects but all of them are visible. Next, the model is trained on scenes with two objects with occlusion. In the end, the model is trained jointly on all possible scenes. After the curriculum learning, the obtained model is able to tackle all types of scenes well. Curriculum learning has also been proven useful in other works on embodied learning (Wu et al., <xref ref-type="bibr" rid="B71">2019</xref>; Yang W. et al., <xref ref-type="bibr" rid="B75">2019</xref>).</p>
<p>Models that use knowledge graphs can also benefit from gradually increasing levels of difficulty. For example, to decompose the prediction of a complex scene graph, Mao et al. (<xref ref-type="bibr" rid="B47">2019</xref>) propose to first predict easy relations that models are confident with, and then better infer difficult relations based on the easy ones. Zhang et al. (<xref ref-type="bibr" rid="B78">2021</xref>) leverage relation hierarchies in knowledge bases, and propose to first learn the coarse-grained relations that are distant in relation hierarchies, and then distinguish the fine-grained relations that are nearby in relation hierarchies.</p>
</sec>
<sec>
<title>7.3. Architecture extension possibilities</title>
<p>The combined model concept considers recurrent networks such as LSTMs to be used as the action and language encoders and decoders, following the bidirectional embodied model architecture presented in this paper. Pretrained Transformer-based language models like BERT (Devlin et al., <xref ref-type="bibr" rid="B15">2019</xref>) do not have language grounded in the environment because they are trained exclusively on textual data&#x02014;they are unimodal, with no visual or sensorimotor information considered. However, spatial reasoning requires visual and/or sensorimotor perception to make sense of whether an object is to the left or right of another. Therefore, in order to make use of a pretrained language model <italic>via</italic> transfer learning, we leave adopting a BERT model as a language encoder/decoder and fine-tuning it as part of future work. Integrating a language model in this manner should endow our combined model with commonsense knowledge without having to lose its spatial reasoning capabilities.</p>
<p>Learning spatial relations requires reasoning about the frame of reference. In Section 4, the task was to learn spatial relations when frames of reference are given. A more challenging scenario would be when frames of reference are not given explicitly but need to be inferred. We often encounter this scenario in real-world conversations: some people tend to take the perspective of others, whereas some tend to use the egocentric perspective. This gives rise to ambiguities, which need to be resolved in a dialogue through questions and answers.</p>
<p>Existing works have demonstrated that commonsense knowledge graphs can effectively facilitate visual relation learning. However, knowledge graphs are typically introduced to train a relation predictor to produce scene graphs for downstream tasks. To leverage the symbol-based scene graphs in downstream tasks, graph embedding models are usually needed, which makes the overall procedure expensive and cumbersome. In the future, knowledge graphs can be directly integrated into the representations of pretrained vision-language models during pretraining, helping the models to better learn objects and their relations. The knowledge in pretrained vision-language models can then be readily used to serve downstream tasks through simple fine-tuning.</p>
</sec>
</sec>
<sec sec-type="conclusions" id="s8">
<title>8. Conclusion</title>
<p>In this paper, we have investigated multiple approaches for spatial relation learning. We have shown that an embodied bidirectional model can generate physical actions from language descriptions and vice versa, involving simple left/right relations. We have then shown on a new simple visual dataset that recent visual reasoning models can learn spatial relations in multiple reference frames, with the MAC model outperforming the FiLM model. Since it is unrealistic for a robot to learn exhaustive world knowledge through interaction, or through simple visual datasets, we have considered using the relations from knowledge bases to infer likely spatial relations in a current scene. A practical limitation that has become apparent in our study is that different datasets are needed to learn complementary aspects of spatial reasoning, which hampers the development of a single joint model. This limitation may be overcome by developing more comprehensive datasets, or by devising integrated modular architectures. Finally, we have presented a concept of such an integrated architecture for combining the different models and tasks, which still requires implementation and validation in the future. We furthermore discussed their extension possibilities, which can serve as a basis for intelligent robots solving tasks in the real world that require spatial relation learning and reasoning.</p>
</sec>
<sec sec-type="data-availability" id="s9">
<title>Data availability statement</title>
<p>The code for reproducing the results in Section 4 can be downloaded from <ext-link ext-link-type="uri" xlink:href="https://github.com/knowledgetechnologyuhh/QDRL">https://github.com/knowledgetechnologyuhh/QDRL</ext-link>.</p>
</sec>
<sec id="s10">
<title>Author contributions</title>
<p>JL, O&#x000D6;, YY, and ML developed, implemented, and evaluated the models. CW, ZL, and SW helped in writing and revising the paper. JL and YY collected the data. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec sec-type="funding-information" id="s11">
<title>Funding</title>
<p>This work was jointly funded by the Natural Science Foundation of China (NSFC) and the German Research Foundation (DFG) in Project Crossmodal Learning, NSFC 62061136001/DFG TRR-169.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s12">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Andreas</surname> <given-names>J.</given-names></name> <name><surname>Rohrbach</surname> <given-names>M.</given-names></name> <name><surname>Darrell</surname> <given-names>T.</given-names></name> <name><surname>Klein</surname> <given-names>D.</given-names></name></person-group> (<year>2016</year>). <article-title>Neural module networks,</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Las Vegas, NV</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>39</fpage>&#x02013;<lpage>48</lpage>.</citation>
</ref>
<ref id="B2">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Antol</surname> <given-names>S.</given-names></name> <name><surname>Agrawal</surname> <given-names>A.</given-names></name> <name><surname>Lu</surname> <given-names>J.</given-names></name> <name><surname>Mitchell</surname> <given-names>M.</given-names></name> <name><surname>Batra</surname> <given-names>D.</given-names></name> <name><surname>Lawrence Zitnick</surname> <given-names>C.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>VQA: visual question answering,</article-title> in <source>Proceedings of the IEEE International Conference on Computer Vision</source> (<publisher-loc>Santiago</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2425</fpage>&#x02013;<lpage>2433</lpage>.<pub-id pub-id-type="pmid">30418897</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Arbib</surname> <given-names>M. A.</given-names></name> <name><surname>Arbib</surname> <given-names>M. A.</given-names></name> <name><surname>Hesse</surname> <given-names>M. B.</given-names></name></person-group> (<year>1986</year>). <source>The Construction of Reality</source>. <publisher-name>Cambridge University Press</publisher-name>. <pub-id pub-id-type="doi">10.1017/CBO9780511527234</pub-id></citation>
</ref>
<ref id="B4">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bahdanau</surname> <given-names>D.</given-names></name> <name><surname>Murty</surname> <given-names>S.</given-names></name> <name><surname>Noukhovitch</surname> <given-names>M.</given-names></name> <name><surname>Nguyen</surname> <given-names>T. H.</given-names></name> <name><surname>de Vries</surname> <given-names>H.</given-names></name> <name><surname>Courville</surname> <given-names>A.</given-names></name></person-group> (<year>2019</year>). <article-title>Systematic generalization: what is required and can it be learned?,</article-title> in <source>International Conference on Learning Representations</source> (<publisher-loc>New Orleans, LA</publisher-loc>).</citation>
</ref>
<ref id="B5">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Louradour</surname> <given-names>J.</given-names></name> <name><surname>Collobert</surname> <given-names>R.</given-names></name> <name><surname>Weston</surname> <given-names>J.</given-names></name></person-group> (<year>2009</year>). <article-title>Curriculum learning,</article-title> in <source>Proceedings of the 26th Annual International Conference on Machine Learning, ICML 09</source> (<publisher-loc>Montreal, QC</publisher-loc>: <publisher-name>Association for Computing Machinery</publisher-name>), <fpage>41</fpage>&#x02013;<lpage>48</lpage>. <pub-id pub-id-type="doi">10.1145/1553374.1553380</pub-id></citation>
</ref>
<ref id="B6">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bisk</surname> <given-names>Y.</given-names></name> <name><surname>Holtzman</surname> <given-names>A.</given-names></name> <name><surname>Thomason</surname> <given-names>J.</given-names></name> <name><surname>Andreas</surname> <given-names>J.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Chai</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>Experience grounds language</article-title>. <source>arXiv:2004.10151 [cs]</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2004.10151</pub-id></citation>
</ref>
<ref id="B7">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Chaplot</surname> <given-names>D. S.</given-names></name> <name><surname>Gandhi</surname> <given-names>D.</given-names></name> <name><surname>Gupta</surname> <given-names>S.</given-names></name> <name><surname>Gupta</surname> <given-names>A.</given-names></name> <name><surname>Salakhutdinov</surname> <given-names>R.</given-names></name></person-group> (<year>2020</year>). <article-title>Learning to explore using active neural SLAM,</article-title> in <source>8th International Conference on Learning Representations, ICLR 2020</source> (<publisher-loc>Addis Ababa</publisher-loc>). Available online at: <ext-link ext-link-type="uri" xlink:href="http://openreview.net/">openreview.net/</ext-link></citation>
</ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chaplot</surname> <given-names>D. S.</given-names></name> <name><surname>Sathyendra</surname> <given-names>K. M.</given-names></name> <name><surname>Pasumarthi</surname> <given-names>R. K.</given-names></name> <name><surname>Rajagopal</surname> <given-names>D.</given-names></name> <name><surname>Salakhutdinov</surname> <given-names>R.</given-names></name></person-group> (<year>2018</year>). <article-title>Gated-attention architectures for task-oriented language grounding,</article-title> in <source>Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence</source> (<publisher-loc>New Orleans, LA</publisher-loc>), <fpage>2819</fpage>&#x02013;<lpage>2826</lpage>. <pub-id pub-id-type="doi">10.1609/aaai.v32i1.11832</pub-id></citation>
</ref>
<ref id="B9">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>L.</given-names></name> <name><surname>Lu</surname> <given-names>K.</given-names></name> <name><surname>Rajeswaran</surname> <given-names>A.</given-names></name> <name><surname>Lee</surname> <given-names>K.</given-names></name> <name><surname>Grover</surname> <given-names>A.</given-names></name> <name><surname>Laskin</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Decision transformer: reinforcement learning via sequence modeling</article-title>. <source>arXiv preprint arXiv:2106.01345</source>.</citation>
</ref>
<ref id="B10">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>V. S.</given-names></name> <name><surname>Varma</surname> <given-names>P.</given-names></name> <name><surname>Krishna</surname> <given-names>R.</given-names></name> <name><surname>Bernstein</surname> <given-names>M.</given-names></name> <name><surname>Re</surname> <given-names>C.</given-names></name> <name><surname>Fei-Fei</surname> <given-names>L.</given-names></name></person-group> (<year>2019</year>). <article-title>Scene graph prediction with limited labels,</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Long Beach, CA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2580</fpage>&#x02013;<lpage>2590</lpage>. <pub-id pub-id-type="doi">10.1109/ICCVW.2019.00220</pub-id><pub-id pub-id-type="pmid">32218709</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chevalier-Boisvert</surname> <given-names>M.</given-names></name> <name><surname>Bahdanau</surname> <given-names>D.</given-names></name> <name><surname>Lahlou</surname> <given-names>S.</given-names></name> <name><surname>Willems</surname> <given-names>L.</given-names></name> <name><surname>Saharia</surname> <given-names>C.</given-names></name> <name><surname>Nguyen</surname> <given-names>T. H.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name></person-group> (<year>2019</year>). <article-title>BabyAI: first steps towards grounded language learning with a human in the loop,</article-title> in <source>International Conference on Learning Representations</source> (<publisher-loc>New Orleans, LA</publisher-loc>).</citation>
</ref>
<ref id="B12">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Collell</surname> <given-names>G.</given-names></name> <name><surname>Gool</surname> <given-names>L. V.</given-names></name> <name><surname>Moens</surname> <given-names>M.-F.</given-names></name></person-group> (<year>2018</year>). <article-title>Acquiring common sense spatial knowledge through implicit spatial templates,</article-title> in <source>Thirty-Second AAAI Conference on Artificial Intelligence</source> (<publisher-loc>New Orleans, LA</publisher-loc>: <publisher-name>AAAI</publisher-name>)</citation>
</ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Collell</surname> <given-names>G.</given-names></name> <name><surname>Moens</surname> <given-names>M.-F.</given-names></name></person-group> (<year>2018</year>). <article-title>Learning representations specialized in spatial knowledge: leveraging language and vision</article-title>. <source>Trans. Assoc. Comput. Linguist.</source> <volume>6</volume>, <fpage>133</fpage>&#x02013;<lpage>144</lpage>. <pub-id pub-id-type="doi">10.1162/tacl_a_00010</pub-id></citation>
</ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Das</surname> <given-names>A.</given-names></name> <name><surname>Kottur</surname> <given-names>S.</given-names></name> <name><surname>Moura</surname> <given-names>J. M. F.</given-names></name> <name><surname>Lee</surname> <given-names>S.</given-names></name> <name><surname>Batra</surname> <given-names>D.</given-names></name></person-group> (<year>2017</year>). <article-title>Learning cooperative visual dialog agents with deep reinforcement learning,</article-title> in <source>2017 IEEE International Conference on Computer Vision</source> (<publisher-loc>Venice</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2970</fpage>&#x02013;<lpage>2979</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2017.321</pub-id></citation>
</ref>
<ref id="B15">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Devlin</surname> <given-names>J.</given-names></name> <name><surname>Chang</surname> <given-names>M.-W.</given-names></name> <name><surname>Lee</surname> <given-names>K.</given-names></name> <name><surname>Toutanova</surname> <given-names>K.</given-names></name></person-group> (<year>2019</year>). <article-title>BERT: pre-training of deep bidirectional transformers for language understanding,</article-title> in <source>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1</source> (<publisher-loc>Minneapolis, MN</publisher-loc>: <publisher-name>ACM</publisher-name>).<pub-id pub-id-type="pmid">35689168</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dickinson</surname> <given-names>A.</given-names></name> <name><surname>Balleine</surname> <given-names>B.</given-names></name></person-group> (<year>1994</year>). <article-title>Motivational control of goal-directed action</article-title>. <source>Anim. Learn. Behav</source>. <volume>22</volume>, <fpage>1</fpage>&#x02013;<lpage>18</lpage>. <pub-id pub-id-type="doi">10.3758/BF03199951</pub-id></citation>
</ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dzifcak</surname> <given-names>J.</given-names></name> <name><surname>Scheutz</surname> <given-names>M.</given-names></name> <name><surname>Baral</surname> <given-names>C.</given-names></name> <name><surname>Schermerhorn</surname> <given-names>P.</given-names></name></person-group> (<year>2009</year>). <article-title>What to do and how to do it: translating natural language directives into temporal and dynamic logic representation for goal management and action execution,</article-title> in <source>2009 IEEE International Conference on Robotics and Automation</source>, <fpage>4163</fpage>&#x02013;<lpage>4168</lpage>. <pub-id pub-id-type="doi">10.1109/ROBOT.2009.5152776</pub-id></citation>
</ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Elman</surname> <given-names>J. L.</given-names></name></person-group> (<year>1993</year>). <article-title>Learning and development in neural networks: the importance of starting small</article-title>. <source>Cognition</source> <fpage>71</fpage>&#x02013;<lpage>99</lpage>. <pub-id pub-id-type="doi">10.1016/0010-0277(93)90058-4</pub-id><pub-id pub-id-type="pmid">8403835</pub-id></citation></ref>
<ref id="B19">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Feldman</surname> <given-names>J.</given-names></name> <name><surname>Lakoff</surname> <given-names>G.</given-names></name> <name><surname>Bailey</surname> <given-names>D.</given-names></name> <name><surname>Narayanan</surname> <given-names>S.</given-names></name> <name><surname>Regier</surname> <given-names>T.</given-names></name> <name><surname>Stolcke</surname> <given-names>A.</given-names></name></person-group> (<year>1996</year>). <article-title>L0&#x02014;the first five years of an automated language acquisition project,</article-title> in <source>Integration of Natural Language and Vision Processing: Theory and Grounding Representations Volume III</source>, ed <person-group person-group-type="editor"><name><surname>Mc Kevitt</surname> <given-names>P.</given-names></name></person-group> (<publisher-loc>Dordrecht</publisher-loc>: <publisher-name>Springer Netherlands</publisher-name>), <fpage>205</fpage>&#x02013;<lpage>231</lpage>. <pub-id pub-id-type="doi">10.1007/978-94-009-1639-515</pub-id></citation>
</ref>
<ref id="B20">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Fellbaum</surname> <given-names>C.</given-names></name></person-group> (<year>1998</year>). <source>WordNet: An Electronic Lexical Database</source>. <publisher-name>Bradford Books</publisher-name>. <pub-id pub-id-type="doi">10.7551/mitpress/7287.001.0001</pub-id></citation>
</ref>
<ref id="B21">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Freksa</surname> <given-names>C.</given-names></name></person-group> (<year>2004</year>). <article-title>Spatial cognition an AI perspective,</article-title> in <source>Proceedings of the 16th European Conference on Artificial Intelligence, ECAI 04</source> (<publisher-loc>Valencia</publisher-loc>: <publisher-name>IOS Press</publisher-name>), <fpage>1122</fpage>&#x02013;<lpage>1128</lpage>.</citation>
</ref>
<ref id="B22">
<citation citation-type="thesis"><person-group person-group-type="author"><name><surname>Gori</surname> <given-names>M.</given-names></name> <name><surname>Monfardini</surname> <given-names>G.</given-names></name> <name><surname>Scarselli</surname> <given-names>F.</given-names></name></person-group> (<year>2005</year>). <article-title>A new model for learning in graph domains,</article-title> in <source>Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005, Vol. 2</source> (<publisher-loc>Montreal, QC</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>729</fpage>&#x02013;<lpage>734</lpage>. <pub-id pub-id-type="doi">10.1109/IJCNN.2005.1555942</pub-id></citation>
</ref>
<ref id="B23">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Goyal</surname> <given-names>Y.</given-names></name> <name><surname>Khot</surname> <given-names>T.</given-names></name> <name><surname>Summers-Stay</surname> <given-names>D.</given-names></name> <name><surname>Batra</surname> <given-names>D.</given-names></name> <name><surname>Parikh</surname> <given-names>D.</given-names></name></person-group> (<year>2017</year>). <article-title>Making the v in VQA matter: elevating the role of image understanding in visual question answering,</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Honolulu, HI</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>6904</fpage>&#x02013;<lpage>6913</lpage>.</citation>
</ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hatori</surname> <given-names>J.</given-names></name> <name><surname>Kikuchi</surname> <given-names>Y.</given-names></name> <name><surname>Kobayashi</surname> <given-names>S.</given-names></name> <name><surname>Takahashi</surname> <given-names>K.</given-names></name> <name><surname>Tsuboi</surname> <given-names>Y.</given-names></name> <name><surname>Unno</surname> <given-names>Y.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Interactively picking real-world objects with unconstrained spoken language instructions</article-title>. <source>CoRR, abs/1710.06280</source>. <pub-id pub-id-type="doi">10.1109/ICRA.2018.8460699</pub-id></citation>
</ref>
<ref id="B25">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Ren</surname> <given-names>S.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>Deep residual learning for image recognition,</article-title> in <source>2016 IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Las Vegas, NV</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>770</fpage>&#x02013;<lpage>778</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2016.90</pub-id><pub-id pub-id-type="pmid">32166560</pub-id></citation></ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Heinrich</surname> <given-names>S.</given-names></name> <name><surname>Yao</surname> <given-names>Y.</given-names></name> <name><surname>Hinz</surname> <given-names>T.</given-names></name> <name><surname>Liu</surname> <given-names>Z.</given-names></name> <name><surname>Hummel</surname> <given-names>T.</given-names></name> <name><surname>Kerzel</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>Crossmodal language grounding in an embodied neurocognitive model</article-title>. <source>Front. Neurorobot</source>. <volume>14</volume>:<fpage>52</fpage>. <pub-id pub-id-type="doi">10.3389/fnbot.2020.00052</pub-id><pub-id pub-id-type="pmid">33154720</pub-id></citation></ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hermann</surname> <given-names>K. M.</given-names></name> <name><surname>Hill</surname> <given-names>F.</given-names></name> <name><surname>Green</surname> <given-names>S.</given-names></name> <name><surname>Wang</surname> <given-names>F.</given-names></name> <name><surname>Faulkner</surname> <given-names>R.</given-names></name> <name><surname>Soyer</surname> <given-names>H.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Grounded language learning in a simulated 3D world</article-title>. <source>arXiv preprint arXiv:1706.06551</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1706.06551</pub-id></citation>
</ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hudson</surname> <given-names>D. A.</given-names></name> <name><surname>Manning</surname> <given-names>C. D.</given-names></name></person-group> (<year>2018</year>). <article-title>Compositional attention networks for machine reasoning</article-title>. <source>arXiv:1803.03067 [cs]</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1803.03067</pub-id></citation>
</ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Janner</surname> <given-names>M.</given-names></name> <name><surname>Li</surname> <given-names>Q.</given-names></name> <name><surname>Levine</surname> <given-names>S.</given-names></name></person-group> (<year>2021</year>). <article-title>Reinforcement learning as one big sequence modeling problem</article-title>. <source>arXiv preprint arXiv:2106.02039</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2106.02039</pub-id></citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ji</surname> <given-names>S.</given-names></name> <name><surname>Pan</surname> <given-names>S.</given-names></name> <name><surname>Cambria</surname> <given-names>E.</given-names></name> <name><surname>Marttinen</surname> <given-names>P.</given-names></name> <name><surname>Yu</surname> <given-names>P. S.</given-names></name></person-group> (<year>2021</year>). <article-title>A survey on knowledge graphs: Representation, acquisition, and applications</article-title>. <source>IEEE Trans. Neural Netw. Learn. Syst</source>. <volume>33</volume>, <fpage>1</fpage>&#x02013;<lpage>21</lpage>. <pub-id pub-id-type="doi">10.1109/TNNLS.2021.3070843</pub-id><pub-id pub-id-type="pmid">33900922</pub-id></citation></ref>
<ref id="B31">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Johnson</surname> <given-names>J.</given-names></name> <name><surname>Hariharan</surname> <given-names>B.</given-names></name> <name><surname>van der Maaten</surname> <given-names>L.</given-names></name> <name><surname>Fei-Fei</surname> <given-names>L.</given-names></name> <name><surname>Lawrence Zitnick</surname> <given-names>C.</given-names></name> <name><surname>Girshick</surname> <given-names>R.</given-names></name></person-group> (<year>2017</year>). <article-title>CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning,</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Honolulu, HI</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2901</fpage>&#x02013;<lpage>2910</lpage>.</citation>
</ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kaiser</surname> <given-names>L.</given-names></name> <name><surname>Gomez</surname> <given-names>A. N.</given-names></name> <name><surname>Shazeer</surname> <given-names>N.</given-names></name> <name><surname>Vaswani</surname> <given-names>A.</given-names></name> <name><surname>Parmar</surname> <given-names>N.</given-names></name> <name><surname>Jones</surname> <given-names>L.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>One model to learn them all</article-title>. <source>arXiv:1706.05137</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1706.05137</pub-id></citation>
</ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Keneshloo</surname> <given-names>Y.</given-names></name> <name><surname>Shi</surname> <given-names>T.</given-names></name> <name><surname>Ramakrishnan</surname> <given-names>N.</given-names></name> <name><surname>Reddy</surname> <given-names>C. K.</given-names></name></person-group> (<year>2020</year>). <article-title>Deep reinforcement learning for sequence-to-sequence models</article-title>. <source>IEEE Trans. Neural Netw. Learn. Syst</source>. <volume>31</volume>, <fpage>2469</fpage>&#x02013;<lpage>2489</lpage>. <pub-id pub-id-type="doi">10.1109/TNNLS.2019.2929141</pub-id><pub-id pub-id-type="pmid">31425057</pub-id></citation></ref>
<ref id="B34">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kerzel</surname> <given-names>M.</given-names></name> <name><surname>Strahl</surname> <given-names>E.</given-names></name> <name><surname>Magg</surname> <given-names>S.</given-names></name> <name><surname>Navarro-Guerrero</surname> <given-names>N.</given-names></name> <name><surname>Heinrich</surname> <given-names>S.</given-names></name> <name><surname>Wermter</surname> <given-names>S.</given-names></name></person-group> (<year>2017</year>). <article-title>NICO&#x02014;Neuro-Inspired COmpanion: a developmental humanoid robot platform for multimodal interaction,</article-title> in <source>26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN)</source>, <fpage>113</fpage>&#x02013;<lpage>120</lpage>. <publisher-name>IEEE</publisher-name>. <pub-id pub-id-type="doi">10.1109/ROMAN.2017.8172289</pub-id></citation>
</ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kollar</surname> <given-names>T.</given-names></name> <name><surname>Tellex</surname> <given-names>S.</given-names></name> <name><surname>Roy</surname> <given-names>D.</given-names></name> <name><surname>Roy</surname> <given-names>N.</given-names></name></person-group> (<year>2010</year>). <article-title>Toward understanding natural language directions,</article-title> in <source>2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI) (Osaka)</source>, <fpage>259</fpage>&#x02013;<lpage>266</lpage>. <pub-id pub-id-type="doi">10.1109/HRI.2010.5453186</pub-id></citation>
</ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Krishna</surname> <given-names>R.</given-names></name> <name><surname>Zhu</surname> <given-names>Y.</given-names></name> <name><surname>Groth</surname> <given-names>O.</given-names></name> <name><surname>Johnson</surname> <given-names>J.</given-names></name> <name><surname>Hata</surname> <given-names>K.</given-names></name> <name><surname>Kravitz</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Visual genome: connecting language and vision using crowdsourced dense image annotations</article-title>. <source>Int. J. Comput. Vis.</source> <volume>123</volume>, <fpage>32</fpage>&#x02013;<lpage>73</lpage>. 10/f96kc4</citation>
</ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kuhnle</surname> <given-names>A.</given-names></name> <name><surname>Copestake</surname> <given-names>A.</given-names></name></person-group> (<year>2017</year>). <article-title>ShapeWorld&#x02014;a new test methodology for multimodal language understanding</article-title>. <source>arXiv:1704.04517 [cs].</source> <pub-id pub-id-type="doi">10.48550/arXiv.1704.04517</pub-id></citation>
</ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>LeCun</surname> <given-names>Y.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Hinton</surname> <given-names>G.</given-names></name></person-group> (<year>2015</year>). <article-title>Deep learning</article-title>. <source>Nature</source> <volume>521</volume>, <fpage>436</fpage>&#x02013;<lpage>444</lpage>. <pub-id pub-id-type="doi">10.1038/nature14539</pub-id><pub-id pub-id-type="pmid">26017442</pub-id></citation></ref>
<ref id="B39">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Levinson</surname> <given-names>S.</given-names></name></person-group> (<year>1996</year>). <article-title>Frames of reference and Molyneux&#x00027;s question : cross linguistic evidence,</article-title> in <source>Language and Space</source> (<publisher-loc>Cambridge, Mass</publisher-loc>: <publisher-name>A Bradford Book</publisher-name>). <fpage>109</fpage>&#x02013;<lpage>170</lpage>.</citation>
</ref>
<ref id="B40">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>J.</given-names></name> <name><surname>Monroe</surname> <given-names>W.</given-names></name> <name><surname>Ritter</surname> <given-names>A.</given-names></name> <name><surname>Jurafsky</surname> <given-names>D.</given-names></name> <name><surname>Galley</surname> <given-names>M.</given-names></name> <name><surname>Gao</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>Deep reinforcement learning for dialogue generation,</article-title> in <source>Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing</source> (<publisher-loc>Austin, TX</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>1192</fpage>&#x02013;<lpage>1202</lpage>. <pub-id pub-id-type="doi">10.18653/v1/D16-1127</pub-id></citation>
</ref>
<ref id="B41">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>M.</given-names></name> <name><surname>Weber</surname> <given-names>C.</given-names></name> <name><surname>Kerzel</surname> <given-names>M.</given-names></name> <name><surname>Lee</surname> <given-names>J. H.</given-names></name> <name><surname>Zeng</surname> <given-names>Z.</given-names></name> <name><surname>Liu</surname> <given-names>Z.</given-names></name> <name><surname>Wermter</surname> <given-names>S.</given-names></name></person-group> (<year>2021</year>). <article-title>Robotic occlusion reasoning for efficient object existence prediction,</article-title> in <source>Proceedings of The International Conference on Intelligent Robots and Systems</source> (<publisher-loc>Prague</publisher-loc>: <publisher-name>IROS</publisher-name>). <pub-id pub-id-type="doi">10.1109/IROS51168.2021.9635947</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B42">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>Z.</given-names></name> <name><surname>Zhou</surname> <given-names>J.</given-names></name></person-group> (<year>2020</year>). <article-title>Introduction to graph neural networks,</article-title> in <source>Synthesis Lectures on Artificial Intelligence and Machine Learning</source> <fpage>1</fpage>&#x02013;<lpage>127</lpage>. <pub-id pub-id-type="doi">10.2200/S00980ED1V01Y202001AIM045</pub-id></citation>
</ref>
<ref id="B43">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lu</surname> <given-names>C.</given-names></name> <name><surname>Krishna</surname> <given-names>R.</given-names></name> <name><surname>Bernstein</surname> <given-names>M.</given-names></name> <name><surname>Fei-Fei</surname> <given-names>L.</given-names></name></person-group> (<year>2016</year>). <article-title>Visual relationship detection with language priors,</article-title> in <source>Computer Vision&#x02014;ECCV 2016</source>, Lecture Notes in Computer Science, eds <person-group person-group-type="editor"><name><surname>Leibe</surname> <given-names>B.</given-names></name> <name><surname>Matas</surname> <given-names>J.</given-names></name> <name><surname>Sebe</surname> <given-names>N.</given-names></name> <name><surname>Welling</surname> <given-names>M.</given-names></name></person-group> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>), <fpage>852</fpage>&#x02013;<lpage>869</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-46448-051</pub-id></citation>
</ref>
<ref id="B44">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lu</surname> <given-names>J.</given-names></name> <name><surname>Batra</surname> <given-names>D.</given-names></name> <name><surname>Parikh</surname> <given-names>D.</given-names></name> <name><surname>Lee</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). <article-title>ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,</article-title> in <source>Advances in Neural Information Processing Systems 32</source> (<publisher-loc>Vancouver, BC</publisher-loc>: <publisher-name>Curran Associates, Inc.</publisher-name>), <fpage>13</fpage>&#x02013;<lpage>23</lpage>.</citation>
</ref>
<ref id="B45">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lu</surname> <given-names>J.</given-names></name> <name><surname>Goswami</surname> <given-names>V.</given-names></name> <name><surname>Rohrbach</surname> <given-names>M.</given-names></name> <name><surname>Parikh</surname> <given-names>D.</given-names></name> <name><surname>Lee</surname> <given-names>S.</given-names></name></person-group> (<year>2020</year>). <article-title>12-in-1: multi-task vision and language representation learning,</article-title> in <source>2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>IEEE</publisher-loc>), <fpage>10434</fpage>&#x02013;<lpage>10443</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR42600.2020.01045</pub-id></citation>
</ref>
<ref id="B46">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lynch</surname> <given-names>C.</given-names></name> <name><surname>Sermanet</surname> <given-names>P.</given-names></name></person-group> (<year>2021</year>). <article-title>Language conditioned imitation learning over unstructured data,</article-title> in <source>Robotics: Science and Systems (RSS 2021)</source>. <pub-id pub-id-type="doi">10.15607/RSS.2021.XVII.047</pub-id></citation>
</ref>
<ref id="B47">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mao</surname> <given-names>J.</given-names></name> <name><surname>Yao</surname> <given-names>Y.</given-names></name> <name><surname>Heinrich</surname> <given-names>S.</given-names></name> <name><surname>Hinz</surname> <given-names>T.</given-names></name> <name><surname>Weber</surname> <given-names>C.</given-names></name> <name><surname>Wermter</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>Bootstrapping knowledge graphs from images and text</article-title>. <source>Front. Neurorobot</source>. <volume>13</volume>:<fpage>93</fpage>. <pub-id pub-id-type="doi">10.3389/fnbot.2019.00093</pub-id><pub-id pub-id-type="pmid">31798437</pub-id></citation></ref>
<ref id="B48">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Matuszek</surname> <given-names>C.</given-names></name> <name><surname>FitzGerald</surname> <given-names>N.</given-names></name> <name><surname>Zettlemoyer</surname> <given-names>L.</given-names></name> <name><surname>Bo</surname> <given-names>L.</given-names></name> <name><surname>Fox</surname> <given-names>D.</given-names></name></person-group> (<year>2012</year>). <article-title>A joint model of language and perception for grounded attribute learning</article-title>. <source>arXiv preprint arXiv:1206.6423</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1206.6423</pub-id></citation>
</ref>
<ref id="B49">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Matuszek</surname> <given-names>C.</given-names></name> <name><surname>Herbst</surname> <given-names>E.</given-names></name> <name><surname>Zettlemoyer</surname> <given-names>L.</given-names></name> <name><surname>Fox</surname> <given-names>D.</given-names></name></person-group> (<year>2013</year>). <article-title>Learning to parse natural language commands to a robot control system,</article-title> in <source>Experimental Robotics</source>, eds <person-group person-group-type="editor"><name><surname>Desai</surname> <given-names>J.</given-names></name> <name><surname>Dudek</surname> <given-names>G.</given-names></name> <name><surname>Khatib</surname> <given-names>O.</given-names></name> <name><surname>Kumar</surname> <given-names>V.</given-names></name></person-group> (<publisher-loc>Heidelberg</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>403</fpage>&#x02013;<lpage>415</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-00065-7_28</pub-id></citation>
</ref>
<ref id="B50">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nayak</surname> <given-names>T.</given-names></name> <name><surname>Majumder</surname> <given-names>N.</given-names></name> <name><surname>Goyal</surname> <given-names>P.</given-names></name> <name><surname>Poria</surname> <given-names>S.</given-names></name></person-group> (<year>2021</year>). <article-title>Deep neural approaches to relation triplets extraction: a comprehensive survey</article-title>. <source>Cogn. Comput</source>. <volume>13</volume>, <fpage>1215</fpage>&#x02013;<lpage>1232</lpage>. <pub-id pub-id-type="doi">10.1007/s12559-021-09917-7</pub-id></citation>
</ref>
<ref id="B51">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>&#x000D6;zdemir</surname> <given-names>O.</given-names></name> <name><surname>Kerzel</surname> <given-names>M.</given-names></name> <name><surname>Wermter</surname> <given-names>S.</given-names></name></person-group> (<year>2021</year>). <article-title>Embodied language learning with paired variational autoencoders,</article-title> in <source>IEEE International Conference on Development and Learning (ICDL)</source>. <pub-id pub-id-type="doi">10.1109/ICDL49984.2021.9515668</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B52">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pathak</surname> <given-names>D.</given-names></name> <name><surname>Agrawal</surname> <given-names>P.</given-names></name> <name><surname>Efros</surname> <given-names>A. A.</given-names></name> <name><surname>Darrell</surname> <given-names>T.</given-names></name></person-group> (<year>2017</year>). <article-title>Curiosity-driven exploration by self-supervised prediction,</article-title> in <source>Proceedings of the 34th International Conference on Machine Learning, ICML 2017</source>, eds <person-group person-group-type="editor"><name><surname>Precup</surname> <given-names>D.</given-names></name> <name><surname>Teh</surname> <given-names>Y. W.</given-names></name></person-group> (<publisher-loc>Sydney, NSW</publisher-loc>), <fpage>2778</fpage>&#x02013;<lpage>2787</lpage>. <pub-id pub-id-type="doi">10.1109/CVPRW.2017.70</pub-id></citation>
</ref>
<ref id="B53">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Perez</surname> <given-names>E.</given-names></name> <name><surname>Strub</surname> <given-names>F.</given-names></name> <name><surname>de Vries</surname> <given-names>H.</given-names></name> <name><surname>Dumoulin</surname> <given-names>V.</given-names></name> <name><surname>Courville</surname> <given-names>A.</given-names></name></person-group> (<year>2018</year>). <article-title>FiLM: visual reasoning with a general conditioning layer,</article-title> in <source>Thirty-Second AAAI Conference on Artificial Intelligence</source> (<publisher-loc>New Orleans, LA</publisher-loc>).</citation>
</ref>
<ref id="B54">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Peyre</surname> <given-names>J.</given-names></name> <name><surname>Sivic</surname> <given-names>J.</given-names></name> <name><surname>Laptev</surname> <given-names>I.</given-names></name> <name><surname>Schmid</surname> <given-names>C.</given-names></name></person-group> (<year>2017</year>). <article-title>Weakly-supervised learning of visual relations,</article-title> in <source>Proceedings of the IEEE International Conference on Computer Vision</source> (<publisher-loc>Venice</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>5179</fpage>&#x02013;<lpage>5188</lpage>. <pub-id pub-id-type="doi">10.48550/arXiv.1907.07804</pub-id></citation>
</ref>
<ref id="B55">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pramanik</surname> <given-names>S.</given-names></name> <name><surname>Agrawal</surname> <given-names>P.</given-names></name> <name><surname>Hussain</surname> <given-names>A.</given-names></name></person-group> (<year>2019</year>). <article-title>OmniNet: a unified architecture for multi-modal multi-task learning</article-title>. <source>arXiv:1907.07804</source>.</citation>
</ref>
<ref id="B56">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Regier</surname> <given-names>T. P.</given-names></name></person-group> (<year>1992</year>). <source>The Acquisition of Lexical Semantics for Spatial Terms: A Connectionist Model of Perceptual Categorization.</source> Technical Report 62, <publisher-name>University of California</publisher-name>, <publisher-loc>Berkeley, CA</publisher-loc>.</citation>
</ref>
<ref id="B57">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>R&#x000F6;der</surname> <given-names>F.</given-names></name> <name><surname>&#x000C3;-zdemir</surname> <given-names>O.</given-names></name> <name><surname>Nguyen</surname> <given-names>D. P.</given-names></name> <name><surname>Wermter</surname> <given-names>S.</given-names></name> <name><surname>Eppe</surname> <given-names>M.</given-names></name></person-group> (<year>2021</year>). <article-title>The embodied crossmodal self-forms language and interaction: a computational cognitive review</article-title>. <source>Front. Psychol</source>. <volume>12</volume>:<fpage>3374</fpage>. <pub-id pub-id-type="doi">10.3389/fpsyg.2021.716671</pub-id><pub-id pub-id-type="pmid">34484079</pub-id></citation></ref>
<ref id="B58">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Shah</surname> <given-names>D.</given-names></name> <name><surname>Eysenbach</surname> <given-names>B.</given-names></name> <name><surname>Kahn</surname> <given-names>G.</given-names></name> <name><surname>Rhinehart</surname> <given-names>N.</given-names></name> <name><surname>Levine</surname> <given-names>S.</given-names></name></person-group> (<year>2021</year>). <article-title>Ving: learning open-world navigation with visual goals,</article-title> in <source>IEEE International Conference on Robotics and Automation, ICRA 2021</source> (<publisher-loc>Xi&#x00027;an</publisher-loc>), <fpage>13215</fpage>&#x02013;<lpage>13222</lpage>. <pub-id pub-id-type="doi">10.1109/ICRA48506.2021.9561936</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B59">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shao</surname> <given-names>L.</given-names></name> <name><surname>Migimatsu</surname> <given-names>T.</given-names></name> <name><surname>Zhang</surname> <given-names>Q.</given-names></name> <name><surname>Yang</surname> <given-names>K.</given-names></name> <name><surname>Bohg</surname> <given-names>J.</given-names></name></person-group> (<year>2020</year>). <article-title>Concept2robot: learning manipulation concepts from instructions and human demonstrations,</article-title> in <source>Proceedings of Robotics: Science and Systems (RSS)</source> (Virtual). <pub-id pub-id-type="doi">10.15607/RSS.2020.XVI.082</pub-id></citation>
</ref>
<ref id="B60">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shridhar</surname> <given-names>M.</given-names></name> <name><surname>Hsu</surname> <given-names>D.</given-names></name></person-group> (<year>2018</year>). <article-title>Interactive visual grounding of referring expressions for human-robot interaction</article-title>. <source>arXiv preprint arXiv:1806.03831</source>. <pub-id pub-id-type="doi">10.15607/RSS.2018.XIV.028</pub-id><pub-id pub-id-type="pmid">32670046</pub-id></citation></ref>
<ref id="B61">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Shridhar</surname> <given-names>M.</given-names></name> <name><surname>Manuelli</surname> <given-names>L.</given-names></name> <name><surname>Fox</surname> <given-names>D.</given-names></name></person-group> (<year>2021</year>). <article-title>Cliport: what and where pathways for robotic manipulation,</article-title> in <source>Proceedings of the 5th Conference on Robot Learning</source> (<publisher-loc>London</publisher-loc>).</citation>
</ref>
<ref id="B62">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Silver</surname> <given-names>D.</given-names></name> <name><surname>Singh</surname> <given-names>S.</given-names></name> <name><surname>Precup</surname> <given-names>D.</given-names></name> <name><surname>Sutton</surname> <given-names>R. S.</given-names></name></person-group> (<year>2021</year>). <article-title>Reward is enough</article-title>. <source>Artif. Intell</source>. <volume>299</volume>:<fpage>103535</fpage>. <pub-id pub-id-type="doi">10.1016/j.artint.2021.103535</pub-id></citation>
</ref>
<ref id="B63">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Speer</surname> <given-names>R.</given-names></name> <name><surname>Chin</surname> <given-names>J.</given-names></name> <name><surname>Havasi</surname> <given-names>C.</given-names></name></person-group> (<year>2017</year>). <article-title>ConceptNet 5.5: an open multilingual graph of general knowledge,</article-title> in <source>Proceedings of AAAI</source> (<publisher-loc>San Francisco, CA</publisher-loc>: <publisher-name>AAAI</publisher-name>). <pub-id pub-id-type="doi">10.1609/aaai.v31i1.11164</pub-id></citation>
</ref>
<ref id="B64">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Talmor</surname> <given-names>A.</given-names></name> <name><surname>Herzig</surname> <given-names>J.</given-names></name> <name><surname>Lourie</surname> <given-names>N.</given-names></name> <name><surname>Berant</surname> <given-names>J.</given-names></name></person-group> (<year>2019</year>). <article-title>CommonsenseQA: a question answering challenge targeting commonsense knowledge,</article-title> in <source>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)</source> (<publisher-loc>Minneapolis, MN</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>4149</fpage>&#x02013;<lpage>4158</lpage>.</citation>
</ref>
<ref id="B65">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Tan</surname> <given-names>H.</given-names></name> <name><surname>Bansal</surname> <given-names>M.</given-names></name></person-group> (<year>2019</year>). <article-title>LXMERT: learning cross-modality encoder representations from transformers,</article-title> in <source>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source> (<publisher-loc>Hong Kong</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>5100</fpage>&#x02013;<lpage>5111</lpage>. <pub-id pub-id-type="doi">10.18653/v1/D19-1514</pub-id></citation>
</ref>
<ref id="B66">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tellex</surname> <given-names>S.</given-names></name> <name><surname>Gopalan</surname> <given-names>N.</given-names></name> <name><surname>Kress-Gazit</surname> <given-names>H.</given-names></name> <name><surname>Matuszek</surname> <given-names>C.</given-names></name></person-group> (<year>2020</year>). <article-title>Robots that use language</article-title>. <source>Annu. Rev. Control Robot. Auton. Syst</source>. <volume>3</volume>, <fpage>25</fpage>&#x02013;<lpage>55</lpage>. <pub-id pub-id-type="doi">10.1146/annurev-control-101119-071628</pub-id></citation>
</ref>
<ref id="B67">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tenbrink</surname> <given-names>T.</given-names></name> <name><surname>Fischer</surname> <given-names>K.</given-names></name> <name><surname>Moratz</surname> <given-names>R.</given-names></name></person-group> (<year>2002</year>). <article-title>Spatial strategies in human-robot communication</article-title>. <source>K&#x000FC;nstl. Intell.</source> <volume>16</volume>, <fpage>19</fpage>&#x02013;<lpage>23</lpage>.<pub-id pub-id-type="pmid">24948449</pub-id></citation></ref>
<ref id="B68">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Uc-Cetina</surname> <given-names>V.</given-names></name> <name><surname>Navarro-Guerrero</surname> <given-names>N.</given-names></name> <name><surname>Martin-Gonzalez</surname> <given-names>A.</given-names></name> <name><surname>Weber</surname> <given-names>C.</given-names></name> <name><surname>Wermter</surname> <given-names>S.</given-names></name></person-group> (<year>2021</year>). <article-title>Survey on reinforcement learning for language processing</article-title>. <source>arXiv:2104.05565</source>. <pub-id pub-id-type="doi">10.1007/s10462-022-10205-5</pub-id></citation>
</ref>
<ref id="B69">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Varela</surname> <given-names>F. J.</given-names></name> <name><surname>Thompson</surname> <given-names>E.</given-names></name> <name><surname>Rosch</surname> <given-names>E.</given-names></name></person-group> (<year>2017</year>). <source>The Embodied Mind, Revised Edition: Cognitive Science and Human Experience</source>. <publisher-loc>Cambridge, MA; London</publisher-loc>: <publisher-name>MIT Press</publisher-name>. <pub-id pub-id-type="doi">10.7551/mitpress/9780262529365.001.0001</pub-id></citation>
</ref>
<ref id="B70">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>Q.</given-names></name> <name><surname>Teney</surname> <given-names>D.</given-names></name> <name><surname>Wang</surname> <given-names>P.</given-names></name> <name><surname>Shen</surname> <given-names>C.</given-names></name> <name><surname>Dick</surname> <given-names>A.</given-names></name> <name><surname>van den Hengel</surname> <given-names>A.</given-names></name></person-group> (<year>2017</year>). <article-title>Visual question answering: a survey of methods and datasets</article-title>. <source>Comput. Vis. Image Understand.</source> <volume>163</volume>, <fpage>21</fpage>&#x02013;<lpage>40</lpage>. <pub-id pub-id-type="doi">10.1016/j.cviu.2017.05.001</pub-id></citation>
</ref>
<ref id="B71">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>Y.</given-names></name> <name><surname>Wu</surname> <given-names>Y.</given-names></name> <name><surname>Tamar</surname> <given-names>A.</given-names></name> <name><surname>Russell</surname> <given-names>S.</given-names></name> <name><surname>Gkioxari</surname> <given-names>G.</given-names></name> <name><surname>Tian</surname> <given-names>Y.</given-names></name></person-group> (<year>2019</year>). <article-title>Bayesian relational memory for semantic visual navigation,</article-title> in <source>Proceedings of the 2019 IEEE International Conference on Computer Vision</source> (<publisher-loc>Seoul</publisher-loc>: <publisher-name>IEEE</publisher-name>). <pub-id pub-id-type="doi">10.1109/ICCV.2019.00286</pub-id></citation>
</ref>
<ref id="B72">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yamada</surname> <given-names>T.</given-names></name> <name><surname>Matsunaga</surname> <given-names>H.</given-names></name> <name><surname>Ogata</surname> <given-names>T.</given-names></name></person-group> (<year>2018</year>). <article-title>Paired recurrent autoencoders for bidirectional translation between robot actions and linguistic descriptions</article-title>. <source>IEEE Robot. Autom. Lett</source>. <volume>3</volume>, <fpage>3441</fpage>&#x02013;<lpage>3448</lpage>. <pub-id pub-id-type="doi">10.1109/LRA.2018.2852838</pub-id></citation>
</ref>
<ref id="B73">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>J.</given-names></name> <name><surname>Ren</surname> <given-names>Z.</given-names></name> <name><surname>Xu</surname> <given-names>M.</given-names></name> <name><surname>Chen</surname> <given-names>X.</given-names></name> <name><surname>Crandall</surname> <given-names>D.</given-names></name> <name><surname>Parikh</surname> <given-names>D.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>Embodied amodal recognition: learning to move to perceive objects,</article-title> in <source>2019 IEEE/CVF International Conference on Computer Vision</source> (<publisher-loc>Seoul</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2040</fpage>&#x02013;<lpage>2050</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2019.00213</pub-id></citation>
</ref>
<ref id="B74">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>K.</given-names></name> <name><surname>Russakovsky</surname> <given-names>O.</given-names></name> <name><surname>Deng</surname> <given-names>J.</given-names></name></person-group> (<year>2019</year>). <article-title>SpatialSense: an adversarially crowdsourced benchmark for spatial relation recognition,</article-title> in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source>, <fpage>2051</fpage>&#x02013;<lpage>2060</lpage>.</citation>
</ref>
<ref id="B75">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>W.</given-names></name> <name><surname>Wang</surname> <given-names>X.</given-names></name> <name><surname>Farhadi</surname> <given-names>A.</given-names></name> <name><surname>Gupta</surname> <given-names>A.</given-names></name> <name><surname>Mottaghi</surname> <given-names>R.</given-names></name></person-group> (<year>2019</year>). <article-title>Visual semantic navigation using scene priors,</article-title> in <source>Proceedings of the 7th International Conference on Learning Representations (ICLR)</source>.</citation>
</ref>
<ref id="B76">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yao</surname> <given-names>Y.</given-names></name> <name><surname>Zhang</surname> <given-names>A.</given-names></name> <name><surname>Han</surname> <given-names>X.</given-names></name> <name><surname>Li</surname> <given-names>M.</given-names></name> <name><surname>Weber</surname> <given-names>C.</given-names></name> <name><surname>Liu</surname> <given-names>Z.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Visual distant supervision for scene graph generation,</article-title> in <source>2021 IEEE International Conference on Computer Vision</source> (<publisher-loc>IEEE</publisher-loc>) (Virtual). <pub-id pub-id-type="doi">10.1109/ICCV48922.2021.01552</pub-id></citation>
</ref>
<ref id="B77">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yi</surname> <given-names>K.</given-names></name> <name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>Gan</surname> <given-names>C.</given-names></name> <name><surname>Torralba</surname> <given-names>A.</given-names></name> <name><surname>Kohli</surname> <given-names>P.</given-names></name> <name><surname>Tenenbaum</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). <article-title>Neural-symbolic VQA: disentangling reasoning from vision and language understanding,</article-title> in <source>Advances in Neural Information Processing Systems 31</source>, eds <person-group person-group-type="editor"><name><surname>Bengio</surname> <given-names>S.</given-names></name> <name><surname>Wallach</surname> <given-names>H.</given-names></name> <name><surname>Larochelle</surname> <given-names>H.</given-names></name> <name><surname>Grauman</surname> <given-names>K.</given-names></name> <name><surname>Cesa-Bianchi</surname> <given-names>N.</given-names></name> <name><surname>Garnett</surname> <given-names>R.</given-names></name></person-group> (<publisher-loc>Montreal, QC</publisher-loc>: <publisher-name>Curran Associates, Inc.</publisher-name>), <fpage>1031</fpage>&#x02013;<lpage>1042</lpage>.</citation>
</ref>
<ref id="B78">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>K.</given-names></name> <name><surname>Yao</surname> <given-names>Y.</given-names></name> <name><surname>Xie</surname> <given-names>R.</given-names></name> <name><surname>Han</surname> <given-names>X.</given-names></name> <name><surname>Liu</surname> <given-names>Z.</given-names></name> <name><surname>Lin</surname> <given-names>F.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Open hierarchical relation extraction,</article-title> in <source>Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source> (<publisher-name>Online</publisher-name>) <fpage>5682</fpage>&#x02013;<lpage>5693</lpage>. <pub-id pub-id-type="doi">10.18653/v1/2021.naacl-main.452</pub-id><pub-id pub-id-type="pmid">34604711</pub-id></citation></ref>
<ref id="B79">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>L.</given-names></name> <name><surname>Lyu</surname> <given-names>X.</given-names></name> <name><surname>Song</surname> <given-names>J.</given-names></name> <name><surname>Gao</surname> <given-names>L.</given-names></name></person-group> (<year>2021</year>). <article-title>Guess which? Visual dialog with attentive memory network</article-title>. <source>Pattern Recogn</source>. <volume>114</volume>:<fpage>107823</fpage>. <pub-id pub-id-type="doi">10.1016/j.patcog.2021.107823</pub-id></citation>
</ref>
</ref-list>
<fn-group>
<fn id="fn0001"><p><sup>1</sup>For example, &#x0201C;the cup <monospace>holds</monospace> a drink&#x0201D; has the implicit spatial meaning that the drink is <monospace>inside</monospace> the cup (cf. Collell et al., <xref ref-type="bibr" rid="B12">2018</xref>).</p></fn>
<fn id="fn0002"><p><sup>2</sup>According to Collell and Moens (<xref ref-type="bibr" rid="B13">2018</xref>), the spatial relations <monospace>left</monospace> and <monospace>right</monospace> comprise only &#x0003C;0.1% of the well-known visual genome dataset (Krishna et al., <xref ref-type="bibr" rid="B36">2017</xref>).</p></fn>
<fn id="fn0003"><p><sup>3</sup>The use of <monospace>left</monospace> and <monospace>right</monospace> can involve reference objects that are not explicitly mentioned. For example, &#x0201C;slide <monospace>right</monospace> yellow slowly&#x0201D; implies that there is a reference object in the scene (e.g., the pink cube in <xref ref-type="fig" rid="F3">Figure 3</xref>) and the yellow object is&#x02014;as seen by the agent&#x02014;to the <monospace>right</monospace> of the reference object.</p></fn>
<fn id="fn0004"><p><sup>4</sup>For more details on the hyperparameters and dataset details please refer to &#x000D6;zdemir et al. (<xref ref-type="bibr" rid="B51">2021</xref>).</p></fn>
<fn id="fn0005"><p><sup>5</sup>The code for reproducing the results in this section can be downloaded from <ext-link ext-link-type="uri" xlink:href="https://github.com/knowledgetechnologyuhh/QDRL">https://github.com/knowledgetechnologyuhh/QDRL</ext-link>.</p></fn>
</fn-group>

</back>
</article> 