<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="editorial">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Artif. Intell.</journal-id>
<journal-title>Frontiers in Artificial Intelligence</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Artif. Intell.</abbrev-journal-title>
<issn pub-type="epub">2624-8212</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/frai.2023.1234920</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Artificial Intelligence</subject>
<subj-group>
<subject>Editorial</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Editorial: Multimodal communication and multimodal computing</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Mehler</surname> <given-names>Alexander</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/881802/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>L&#x000FC;cking</surname> <given-names>Andy</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/875334/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Dong</surname> <given-names>Tiansi</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1721939/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Text Technology Lab, Goethe-University Frankfurt</institution>, <addr-line>Frankfurt</addr-line>, <country>Germany</country></aff>
<aff id="aff2"><sup>2</sup><institution>Laboratoire de Linguistique Formelle (LLF), Universit&#x000E9; Paris Cit&#x000E9;</institution>, <addr-line>Paris</addr-line>, <country>France</country></aff>
<aff id="aff3"><sup>3</sup><institution>Neurosymbolic Representation Learning Group, Fraunhofer IAIS</institution>, <addr-line>Sankt Augustin</addr-line>, <country>Germany</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited and reviewed by: Shlomo Engelson Argamon, Illinois Institute of Technology, United States</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Alexander Mehler <email>mehler&#x00040;em.uni-frankfurt.de</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>27</day>
<month>06</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>6</volume>
<elocation-id>1234920</elocation-id>
<history>
<date date-type="received">
<day>05</day>
<month>06</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>09</day>
<month>06</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2023 Mehler, L&#x000FC;cking and Dong.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Mehler, L&#x000FC;cking and Dong</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license> </permissions>
<related-article id="RA1" related-article-type="commentary-article" xlink:href="https://www.frontiersin.org/research-topics/34588/multimodal-communication-and-multimodal-computing" ext-link-type="uri">Editorial on the Research Topic <article-title>Multimodal communication and multimodal computing</article-title></related-article>
<kwd-group>
<kwd>human-object interactions (HOIs)</kwd>
<kwd>multimodal learning and analytics</kwd>
<kwd>visual-linguistic interaction</kwd>
<kwd>text-image analysis</kwd>
<kwd>unified methodology</kwd>
</kwd-group>
<counts>
<fig-count count="0"/>
<table-count count="0"/>
<equation-count count="0"/>
<ref-count count="39"/>
<page-count count="5"/>
<word-count count="4014"/>
</counts>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Language and Computation</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<p>After a successful and text-centered period, AI, computational linguistics, and natural language engineering need to face the &#x0201C;ecological niche&#x0201D; (Holler and Levinson, <xref ref-type="bibr" rid="B18">2019</xref>) of natural language use: <italic>face-to-face interaction</italic>. A particular challenge of human processing in face-to-face interaction is that it is fed by information from the various sense modalities: it is <italic>multimodal</italic>. When talking to each other, we constantly observe and produce information on several channels, such as speech, facial expressions, hand-and-arm gestures, and head movements. To learn drive, we first learn theories about traffic rules in driving schools. After passing the examinations, we practice on the streets, accompanied by an expert sitting aside. We ask questions and follow instant instructions from this expert. These symbolic traffic rules and instant instructions shall be quickly and precisely grounded to the perceived scenes, with which the learner shall update and predict other cars behaviors quickly, then determine her/his own driving action to avoid potential dangers. As a consequence, multimodal communication needs to be <italic>integrated</italic> (in perception) or <italic>distributed</italic> (in production). This, however, characterizes multimodal computing in general (but see also Parcalabescu et al., <xref ref-type="bibr" rid="B28">2021</xref>). Hence, AI, computational linguistics and natural language engineering that address multimodal communication in face-to-face interaction have to involve multimodal computing&#x02013;giving rise to the next grand research challenge of those and related fields. This challenge applies to all computational areas which look beyond sentences and texts, ranging from interacting with virtual agents to the creation and exploitation of multimodal datasets for machine learning, as exemplified by the contributions in this Research Topic.</p>
<p>From this perspective, we face several interwoven challenges: On the one hand, AI approaches need to be informed about the principles of multimodal computing to avoid simply transferring the principles of Large Language Models to multimodal computing. On the other hand, it is important that more linguistically motivated approaches do not underestimate the computational reconstructability of multimodal representations. They might otherwise have to share experiences with parts of computational linguistics, given the success of models such as OpenAI&#x00027;s ChatGPT (cf.Wolfram, <xref ref-type="bibr" rid="B39">2023</xref>), which confronted them with the realization that even higher-order linguistic annotations could be taken over by digital assistants and consequently render the corresponding linguistic modeling work obsolete. Again, the scientific focus on face-to-face communication seems to point to a middle ground. This is because we are dealing with the processing of highly contextualized data whose semantics require recourse to semantic or psycholinguistic concepts such as utterance situation <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/frai.2023.1067125">Sch&#x000FC;z et al.</ext-link>, situation models or mental models (Johnson-Laird, <xref ref-type="bibr" rid="B19">2010</xref>; Ragni and Knauff, <xref ref-type="bibr" rid="B33">2013</xref>; Alfred et al., <xref ref-type="bibr" rid="B1">2020</xref>) or reference to concepts such as grounding (Harnad, <xref ref-type="bibr" rid="B17">1990</xref>), for the automatic reconstruction of which there are not yet adequate computer-based approaches, certainly not on the basis of scenarios such as one-shot or few-shot learning, since the corresponding experiential content is not available as (annotated) mass data. The particular moment in which one finds oneself information-theoretically at this point can be formulated as follows: large domains of linguistic and multimodal interactions, if they provide a sufficient number of patterns for association learning, are well manageable with methods based on current neural networks. However, as soon as we go beyond such associative regularities and arrive at a kind of meaning constitution that includes the <italic>about</italic> of communicative interaction&#x02014;when we are dealing, so to speak, with the alignment of immediate objects and interpretants in the sense of Peirce (<xref ref-type="bibr" rid="B30">1934</xref>) (cf.Gomes et al., <xref ref-type="bibr" rid="B15">2007</xref> for a reference to Peirce in AI)&#x02014;we reach the limits of such models, which have by no means already been explored and which we believe we can identify once again in the area of face-to-face communication. It is obvious that AI models need to complement bottom-up approaches with top-down approaches that start from multimodal situation models grounded in face-to-face communication, or at least from the notion of discourse as put forward by <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/frai.2023.1048874">Alikhani et al.</ext-link>, an approach that finds its obvious extension in an approach more oriented to terms of social science (see, for example, <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/frai.2023.1125533">Cheema et al.</ext-link>).</p>
<p>From another angle, AI applications are increasingly appearing in complex communication situations or action contexts as quasi-agentive fourth-generation interfaces (Floridi, <xref ref-type="bibr" rid="B11">2014</xref>), which raises the question of their status with respect to the distinction between simulation, emulation, and realization (Pattee, <xref ref-type="bibr" rid="B29">1989</xref>). Looking again at the driving example, the issue here is that AI applications are increasingly applied in real-world contexts, where their use is contextualized each time by corresponding multimodal real-world data, representing a potential grounding-relevant resource that can be re-used for fine-tuning such models or even grounding them. One could object that such an AI agent is nothing more than a simulation, which in principle cannot know anything about this its status. However, such simulations perform under real conditions in interaction with more and more humans in no longer simulatively closed systems [of agent(s) and environment(s)], and this can drive a technological development of these systems in terms of life-long learning, which can ultimately make them appear as <italic>realizations</italic> of <italic>interaction partners</italic>. But here, too, one can ask what the limits of this interaction are, even if it is multimodal. For it is something fundamentally different to process multimodally generated data than to experience it through independent production, of which the notion of telic affordance provides a vivid example, since it is based on people&#x00027;s habits of use, a kind of use that AI systems are mostly incapable of at present. Is it this kind of difference, such as being able to identify a telic affordance either through one&#x00027;s own use or merely by observing data left by uses of human agents, that constitutes one of the limitations implied above? Be that as it may, in their paper <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/frai.2023.1084740">Henlein et al.</ext-link> explore the question of the learnability of affordances using vision-based AI models, an approach that we argue could also be interpreted as an example of measuring the implied limit(s).</p>
<p>The counter-scenario to agents interacting with us as artificial interactors in real-world environments is a completely virtualized scenario in which both human and artificial agents interact as avatars (see Chalmers, <xref ref-type="bibr" rid="B4">2022</xref>). Here, conversely, it is the human who enters the sphere of simulation, so to speak, rather than the simulation that we encounter as a putative realization. The key research advantage of such settings is that the resulting multimodal data becomes largely amenable to direct digitization and thus automatic analysis. This concerns areas as diverse as speech data, data regarding interaction with objects, lip movement data, facial expression data, eye movement data, head movement data, manual gesture data, body movement data, and (social) space-related behavioral data, as well as (social) distance behavioral data (see Mehler et al., <xref ref-type="bibr" rid="B26">2023</xref> for a corresponding formal data model in the context of VR). Evidently, virtual worlds provide an excellent experimental environment for the study of artificial interaction. This is addressed in the work of <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/fcomm.2023.1029157">Nunnemann et al.</ext-link>. It can be seen as an example of the study of grounding issues that directly affect the actors involved and thus relate to the issue of grounding interactions. This raises the broader question of how to advance semantic theories that can be experimentally falsified, as VR systems seem to fit into the paradigm of an Experimental Semiotics (Galantucci and Garrod, <xref ref-type="bibr" rid="B12">2011</xref>) in exemplary way, a fit that could not have been foreseen even just a few years earlier. In other words: in VR, the research strands of face-to-face communication, dialogic communication (<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/frai.2022.1029340">Galland et al.</ext-link>), multimodal information processing, grounding in interaction environments that may be equipped with artifacts of a wide variety of affordances, and 4th-order artificial interaction (Floridi, <xref ref-type="bibr" rid="B11">2014</xref>) seem to come together in exemplary fashion, suggesting much further research in this direction in the future. The time is ripe for a fundamental expansion of the empirical base of linguistics and communication studies research that knows how to utilize the possibilities of AI-based systems experimentally for its research purposes, and conversely, for the acquisition of ever more extensive multimodal data for the situation-specific grounding of AI systems, which will ideally no longer rely solely on text windows and wordpiece or subword analogies (Song et al., <xref ref-type="bibr" rid="B35">2021</xref>) (cf. the <italic>Bag-of-Visual Words</italic> approach of Bruni et al., <xref ref-type="bibr" rid="B3">2014</xref>) to infer the putative underlying semantics from the associations shadowed in the character strings observable by means of these windows. At present, it is unclear how far this line of research is developed or to what extent other than the current greedy segmentation models or tokenizers are already emerging that can also identify multimodal ensembles as recurrent data units. Nevertheless, as in the case of transformers (Devlin et al., <xref ref-type="bibr" rid="B6">2019</xref>), this line of research can point to a worthwhile direction for development.</p>
<p>A crucial part of the multimodal challenge is to address the question of how to assemble, let alone parse, multimodal representations. A successful multimodal system shall unify representations from different channels. The fundamental challenge is to merge the two complementary modals, namely, the neural modal and the symbolic modal, and be capable of solving problems from both perspectives (Dinsmore, <xref ref-type="bibr" rid="B7">1992</xref>). Geometrical structure is advocated as a potential cognitive representation apart from symbols or neural-networks (G&#x000E4;rdenfors, <xref ref-type="bibr" rid="B13">2000</xref>). A recent geometric approach successfully unified large symbolic tree structures with pre-trained vector embedding precisely (Dong, <xref ref-type="bibr" rid="B8">2021</xref>), and opens a new door to allow symbolic structures to have precise neural representation, and potentially remove the gap between neural modal and symbolic modal (Bechtel and Abrahamsen, <xref ref-type="bibr" rid="B2">2002</xref>; Dong et al., <xref ref-type="bibr" rid="B9">2022</xref>; Sun, <xref ref-type="bibr" rid="B36">2023</xref>).</p>
<p>Multimodal representations can be compared to musical scores where the different &#x0201C;voices&#x0201D; co-occur and may (or not) be tied together by relevance (L&#x000FC;cking and Ginzburg, <xref ref-type="bibr" rid="B24">2023</xref>) (see Mehler and L&#x000FC;cking, <xref ref-type="bibr" rid="B27">2009</xref> for an example and a formalization of such kinds of representations). In this respect, McNeill (<xref ref-type="bibr" rid="B25">1992</xref>) and Kendon (<xref ref-type="bibr" rid="B20">2004</xref>) have shown in seminal works that manual gesture and speech form unified messages, but without specifying systematic, computational means for analyzing multimodal utterances. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/frai.2023.1048874">Alikhani et al.</ext-link> argue in their contribution &#x0201C;<italic>Image&#x02013;text coherence and its implications for multimodal AI</italic>&#x0201D; that the appropriate level for processing multimodal representations in AI is the level of <italic>discourse</italic>. By example of image&#x02013;text pairs, they apply <italic>coherence theory</italic> to capture the structural, logical and purposeful relationships between images and their captions. Using a dataset of image&#x02013;text coherence relations, the authors question whether simple coherence markers are accounted for in two pre-trained multimodal language models, CLIP (Radford et al., <xref ref-type="bibr" rid="B32">2021</xref>) and ViLBERT (Lu et al., <xref ref-type="bibr" rid="B23">2019</xref>). <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/frai.2023.1048874">Alikhani et al.</ext-link> move on to use these results to critique and improve the architecture of machine learning models, and to develop coherence-based evaluations of multimodal AI systems.</p>
<p>Image&#x02013;text relations are also investigated by <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/frai.2023.1125533">Cheema et al.</ext-link>. The authors focus on the relation between images and texts in the setting of news. They propose directions for multimodal learning and analytics in social sciences. Taking a largely semiotic perspective, the authors bring together news value analysis of news media from both a production and reception perspective, and the multimodality of news articles in terms of image&#x02013;text relations which go beyond (related to the coherence-driven approach by <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/frai.2023.1048874">Alikhani et al.</ext-link>) mere captions. The framework is applied to a couple of examples and is intended to shape larger-scale machine learning applications in the context of multimodal media analysis, as exemplified by means of a number of potential uses cases.</p>
<p>Turning from two-dimensional pictures to objects within virtual reality, <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/frai.2023.1084740">Henlein et al.</ext-link> present their research on Human-Object Interaction (HOI) and augment the HICO-DET dataset (Chao et al., <xref ref-type="bibr" rid="B5">2018</xref>) to distinguish Gibsonian (Gibson, <xref ref-type="bibr" rid="B14">1979</xref>, Chap. 8) affordances (actions to which objects &#x0201C;invite&#x0201D;) and telic affordances (objects&#x00027; conventionalized purposes) (Pustejovsky, <xref ref-type="bibr" rid="B31">2013</xref>). They successfully train the computational model AffordanceUPT on their extended resource and show that is is able to distinguish intentional use from Gibsonian exploitation, even for new objects. Hence, <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/frai.2023.1084740">Henlein et al.</ext-link> contribute to a better understanding of clustering of objects according to their action potentials, in particular a clustering between perceptual features and intention recognition.</p>
<p>(Virtual) Objects and characters are potential referents in human&#x02013;human and human&#x02013;computer interaction. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/fcomm.2023.1029157">Nunnemann et al.</ext-link> investigate &#x0201C;<italic>The effects of referential gaze in spoken language comprehension: human speaker vs. virtual agent listener gaze</italic>&#x0201D;. Hence, they address multimodal computing at the interface of human and artificial communication: On the one hand, people are known to respond to virtual agent gaze (Ruhland et al., <xref ref-type="bibr" rid="B34">2015</xref>). On the other hand, during referential processing eye movements to objects in joint visual scenes are closely time locked to referring words used to describes those scenes (Eberhard et al., <xref ref-type="bibr" rid="B10">1995</xref>). Using eye-tracking methods <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/fcomm.2023.1029157">Nunnemann et al.</ext-link> compared the influence of human speaker gaze to that of virtual agent listener gaze in sentence verification tasks. While they could replicate findings that participants draw on human speaker gaze, they do not rely on the gaze of the virtual agent. Thus, the study hints at important directions in the creation of and interaction with virtual agents, pointing out the influence of the communicative role of virtual agents (i.e., speaker vs. hearer) and potentially the need of a Theory of Mind (Kr&#x000E4;mer, <xref ref-type="bibr" rid="B21">2005</xref>).</p>
<p>While gaze can be used for establishing reference (in particular in dangerous situations, see Hadjikhani et al., <xref ref-type="bibr" rid="B16">2008</xref>), the most important linguistic devices for referring are verbal referring expressions. The form of these referring expressions is adapted to the utterance situation: <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/frai.2023.1067125">Sch&#x000FC;z et al.</ext-link> discuss the representation problem in the sub-field of Referring Expression Generation (REG), where expressions are depended on contexts. They provide a systematic review of a variety of visual contexts and approaches to REGs, and strongly argue for an integrated or unified perspective or methodology. The focus is on different input modalities and how they shape the information that is needed for successful reference (i.e., enable the addressee to single out the intended object), thereby complementing and going beyond established research on multimodal deictic output (e.g., Kranstedt et al., <xref ref-type="bibr" rid="B22">2006</xref>; van der Sluis and Krahmer, <xref ref-type="bibr" rid="B38">2007</xref>).</p>
<p>In conversation, interlocutors exhibit conversational strategies or styles (Tannen, <xref ref-type="bibr" rid="B37">1981</xref>). <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/frai.2022.1029340">Galland et al.</ext-link> explore communicative preferences in the context of human-computer interaction in terms of task-oriented and socially-oriented dialogue acts. By utilizing reinforcement learning, they train an artificial agent to adapt its strategy to meet the preference of a human user by combining task-oriented and socially-oriented dialog act. This is achieved by combining four components: an engagement estimator (mainly based on the user&#x00027;s non-verbal behavior), a topic manager (keeping track of the user&#x00027;s favorite topics), a conversational preferences estimator (estimates the user&#x00027;s task/social preference a each turn), and a dialog manager (selects the most appropriate turn according to the artificial agent&#x00027;s user model). Subjective experiments involving over 100 participants show a cross-modal influence: adapting to a user&#x00027;s preferred conversational strategy or style affects the human&#x00027;s perception and increases user engagement.</p>
<p>The Research Topic <italic>Multimodal communication and multimodal computing</italic> comprises six different contributions that highlight different areas and challenges of the interplay between communication and computing, as they have emerged not only due to the recent rapid development of AI methods. What unites these contributions is their common focus on multimodality, which, however, they treat from very different perspectives: be it in terms of text-image relations, the affordances detectable through images, the interaction between humans and artificial agents, or the specific status of referring expressions in spoken language comprehension. From a methodological perspective, these approaches are interesting because they redirect the AI focus from Big Data to Small or even Tiny Data, massively emphasizing the situatedness of communication in its multiple multimodal manifestations. What we ultimately lack, however, is an approach that integrates these heterogeneous research directions and their underlying distributed data resources to ground a more comprehensive multimodal semantics in a final joint research effort by linguistics, computational linguistics, and computer science&#x02014;before this will all be taken over by AI agents.</p>
<sec sec-type="author-contributions" id="s1">
<title>Author contributions</title>
<p>This Research Topic on <italic>Multimodal communication and multimodal computing</italic> was proposed by AM, AL, and TD. The editors worked collaboratively to decide which potential authors to invite and which papers were accepted or rejected. The single manuscripts were subject to review by the corresponding handling editor as well as peer reviewers. This editorial was drafted by AM, AL, and TD. All authors contributed to the article and approved the submitted version.</p>
</sec>
</body>
<back>
<sec sec-type="funding-information" id="s2">
<title>Funding</title>
<p>AL and AM contribution was partly supported by the German Research Foundation (DFG, project number 502018965) and TD contribution was partially supported by Federal Ministry of Education and Research of Germany as part of the Competence Center for Machine Learning ML2R (01IS18038C).</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s3">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Alfred</surname> <given-names>K. L.</given-names></name> <name><surname>Connolly</surname> <given-names>A. C.</given-names></name> <name><surname>Cetron</surname> <given-names>J. S.</given-names></name> <name><surname>Kraemer</surname> <given-names>D. J. M.</given-names></name></person-group> (<year>2020</year>). <article-title>Mental models use common neural spatial structure for spatial and abstract content</article-title>. <source>Commun. Biol</source>. 3, 17. <pub-id pub-id-type="doi">10.1038/s42003-019-0740-8</pub-id><pub-id pub-id-type="pmid">31925291</pub-id></citation></ref>
<ref id="B2">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bechtel</surname> <given-names>W.</given-names></name> <name><surname>Abrahamsen</surname> <given-names>A.</given-names></name></person-group> (<year>2002</year>). <source>Connectionism and the Mind: Parallel Processing, Dynamics, and Evolution in Networks</source>. <publisher-loc>Hong Kong</publisher-loc>: <publisher-name>Graphicraft Ltd</publisher-name>.</citation>
</ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bruni</surname> <given-names>E.</given-names></name> <name><surname>Tran</surname> <given-names>N.-K.</given-names></name> <name><surname>Baroni</surname> <given-names>M.</given-names></name></person-group> (<year>2014</year>). <article-title>Multimodal distributional semantics</article-title>. <source>J. Artif. Intell. Res</source>. <volume>49</volume>, <fpage>1</fpage>&#x02013;<lpage>47</lpage>. <pub-id pub-id-type="doi">10.1613/jair.4135</pub-id></citation>
</ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chalmers</surname> <given-names>D. J.</given-names></name></person-group> (<year>2022</year>). <source>Reality</source>&#x0002B;: <italic>Virtual Worlds and the Problems of Philosophy</italic>. Allen Lane.</citation>
</ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chao</surname> <given-names>Y.-W.</given-names></name> <name><surname>Liu</surname> <given-names>Y.</given-names></name> <name><surname>Liu</surname> <given-names>X.</given-names></name> <name><surname>Zeng</surname> <given-names>H.</given-names></name> <name><surname>Deng</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Learning to detect human-object interactions,&#x0201D;</article-title> in <source>2018 IEEE Winter Conference on Applications of Computer Vision, WACV</source> 381&#x02013;389. <pub-id pub-id-type="doi">10.1109/WACV.2018.00048</pub-id></citation>
</ref>
<ref id="B6">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Devlin</surname> <given-names>J.</given-names></name> <name><surname>Chang</surname> <given-names>M.</given-names></name> <name><surname>Lee</surname> <given-names>K.</given-names></name> <name><surname>Toutanova</surname> <given-names>K.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;BERT: pre-training of deep bidirectional transformers for language understanding,&#x0201D;</article-title> in <source>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019</source> (<publisher-loc>Minneapolis, MN, USA</publisher-loc>) <fpage>4171</fpage>&#x02013;<lpage>4186</lpage>.</citation>
</ref>
<ref id="B7">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Dinsmore</surname> <given-names>J.</given-names></name></person-group> (<year>1992</year>). <article-title>&#x0201C;Thunder in the gap,&#x0201D;</article-title> in <source>The Symbolic and Connectionist Paradigms: Closing the Gap</source> (<publisher-loc>Erlbaum</publisher-loc>) <fpage>1</fpage>&#x02013;<lpage>23</lpage>.</citation>
</ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Dong</surname> <given-names>T.</given-names></name></person-group> (<year>2021</year>). <source>A Geometric Approach to the Unification of Symbolic Structures and Neural Networks</source>. <publisher-loc>Cham</publisher-loc>: <publisher-name>Springer-Nature.</publisher-name> <pub-id pub-id-type="doi">10.1007/978-3-030-56275-5</pub-id></citation>
</ref>
<ref id="B9">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Dong</surname> <given-names>T.</given-names></name> <name><surname>Rettinger</surname> <given-names>A.</given-names></name> <name><surname>Tang</surname> <given-names>J.</given-names></name> <name><surname>Tversky</surname> <given-names>B.</given-names></name> <name><surname>van Harmelen</surname> <given-names>F.</given-names></name></person-group> (<year>2022</year>). <article-title>&#x0201C;Structure and Learning (Dagstuhl Seminar 21362),&#x0201D;</article-title> in <source>Dagstuhl Reports</source> (<publisher-loc>Schloss Dagstuhl</publisher-loc>: <publisher-name>Leibniz-Zentrum f&#x000FC;r Informatik</publisher-name>) <fpage>11</fpage>&#x02013;<lpage>34</lpage>.</citation>
</ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Eberhard</surname> <given-names>K. M.</given-names></name> <name><surname>Spivey-Knowlton</surname> <given-names>M. J.</given-names></name> <name><surname>Sedivy</surname> <given-names>J. C.</given-names></name> <name><surname>Tanenhaus</surname> <given-names>M. K.</given-names></name></person-group> (<year>1995</year>). <article-title>Eye movements as a window into real-time spoken language comprehension in natural contexts</article-title>. <source>J. Psycholinguist. Res</source>. <volume>24</volume>, <fpage>409</fpage>&#x02013;<lpage>436</lpage>. <pub-id pub-id-type="doi">10.1007/BF02143160</pub-id><pub-id pub-id-type="pmid">8531168</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Floridi</surname> <given-names>L.</given-names></name></person-group> (<year>2014</year>). <source>The Fourth Revolution. How the Infosphere is Reshaping Human Reality</source>. <publisher-loc>Oxford</publisher-loc>: <publisher-name>Oxford University Press</publisher-name>.</citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Galantucci</surname> <given-names>B.</given-names></name> <name><surname>Garrod</surname> <given-names>S.</given-names></name></person-group> (<year>2011</year>). <article-title>Experimental semiotics: a review</article-title>. <source>Front. Human Neurosci</source>. <volume>5</volume>, <fpage>1</fpage>&#x02013;<lpage>15</lpage>. <pub-id pub-id-type="doi">10.3389/fnhum.2011.00011</pub-id><pub-id pub-id-type="pmid">21369364</pub-id></citation></ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>G&#x000E4;rdenfors</surname> <given-names>P.</given-names></name></person-group> (<year>2000</year>). <source>Conceptual Spaces-The Geometry of Thought</source>. <publisher-loc>Cambridge, Massachusetts, USA</publisher-loc>: <publisher-name>MIT Press</publisher-name>. <pub-id pub-id-type="doi">10.7551/mitpress/2076.001.0001</pub-id></citation>
</ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Gibson</surname> <given-names>J. J.</given-names></name></person-group> (<year>1979</year>). <source>The Ecological Approach to Visual Perception</source>. <publisher-loc>Boston</publisher-loc>: <publisher-name>Houghton Mifflin</publisher-name>.</citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gomes</surname> <given-names>A.</given-names></name> <name><surname>Gudwin</surname> <given-names>R.</given-names></name> <name><surname>Ni&#x000F1;o El-Hani</surname> <given-names>C.</given-names></name> <name><surname>Queiroz</surname> <given-names>J.</given-names></name></person-group> (<year>2007</year>). <article-title>Towards the emergence of meaning processes in computers from peircean semiotics</article-title>. <source>Mind Soc</source>. <volume>6</volume>, <fpage>173</fpage>&#x02013;<lpage>187</lpage>. <pub-id pub-id-type="doi">10.1007/s11299-007-0031-9</pub-id><pub-id pub-id-type="pmid">28939326</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hadjikhani</surname> <given-names>N.</given-names></name> <name><surname>Hoge</surname> <given-names>R.</given-names></name> <name><surname>Snyder</surname> <given-names>J.</given-names></name> <name><surname>de Gelder</surname> <given-names>B.</given-names></name></person-group> (<year>2008</year>). <article-title>Pointing with the eyes: The role of gaze in communicating danger</article-title>. <source>Brain Cogn</source>. <volume>68</volume>, <fpage>1</fpage>&#x02013;<lpage>8</lpage>. <pub-id pub-id-type="doi">10.1016/j.bandc.2008.01.008</pub-id><pub-id pub-id-type="pmid">18586370</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Harnad</surname> <given-names>S.</given-names></name></person-group> (<year>1990</year>). <article-title>The symbol grounding problem</article-title>. <source>Physica D</source>. <volume>42</volume>, <fpage>335</fpage>&#x02013;<lpage>346</lpage>. <pub-id pub-id-type="doi">10.1016/0167-2789(90)90087-6</pub-id></citation>
</ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Holler</surname> <given-names>J.</given-names></name> <name><surname>Levinson</surname> <given-names>S. C.</given-names></name></person-group> (<year>2019</year>). <article-title>Multimodal language processing in human communication</article-title>. <source>Trends Cogn. Sci</source>. <volume>23</volume>, <fpage>639</fpage>&#x02013;<lpage>652</lpage>. <pub-id pub-id-type="doi">10.1016/j.tics.2019.05.006</pub-id><pub-id pub-id-type="pmid">31235320</pub-id></citation></ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Johnson-Laird</surname> <given-names>P. N.</given-names></name></person-group> (<year>2010</year>). <article-title>Mental models and human reasoning</article-title>. <source>PNAS</source> <volume>107</volume>, <fpage>18243</fpage>&#x02013;<lpage>18250</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.1012933107</pub-id><pub-id pub-id-type="pmid">20956326</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kendon</surname> <given-names>A.</given-names></name></person-group> (<year>2004</year>). <source>Gesture: Visible Action as Utterance</source>. <publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>Cambridge University Press</publisher-name>. <pub-id pub-id-type="doi">10.1017/CBO9780511807572</pub-id></citation>
</ref>
<ref id="B21">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kr&#x000E4;mer</surname> <given-names>N. C.</given-names></name></person-group> (<year>2005</year>). <article-title>&#x0201C;Theory of mind as a theoretical prerequisite to model communication with virtual humans,&#x0201D;</article-title> in <source>Modeling Communication with Robots and Virtual Humans</source>, eds. I., Wachsmuth, and G., Knoblich (<publisher-loc>Berlin and Heidelberg</publisher-loc>: <publisher-name>Springer</publisher-name>) 222&#x02013;240. <pub-id pub-id-type="doi">10.1007/978-3-540-79037-2_12</pub-id></citation>
</ref>
<ref id="B22">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kranstedt</surname> <given-names>A.</given-names></name> <name><surname>L&#x000FC;cking</surname> <given-names>A.</given-names></name> <name><surname>Pfeiffer</surname> <given-names>T.</given-names></name> <name><surname>Rieser</surname> <given-names>H.</given-names></name> <name><surname>Wachsmuth</surname> <given-names>I.</given-names></name></person-group> (<year>2006</year>). <article-title>&#x0201C;Deictic object reference in task-oriented dialogue,&#x0201D;</article-title> in <source>Situated Communication</source>, eds. G., Rickheit, and I., Wachsmuth (<publisher-loc>Berlin</publisher-loc>: <publisher-name>Mouton de Gruyter</publisher-name>) 155&#x02013;207. <pub-id pub-id-type="doi">10.1515/9783110197747.155</pub-id></citation>
</ref>
<ref id="B23">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lu</surname> <given-names>J.</given-names></name> <name><surname>Batra</surname> <given-names>D.</given-names></name> <name><surname>Parikh</surname> <given-names>D.</given-names></name> <name><surname>Lee</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source>, eds. H., Wallach, H., Larochelle, A., Beygelzimer, F., d&#x00027;Alch&#x000E9;-Buc, E., Fox, and R., Garnett (<publisher-loc>Red Hook, NY</publisher-loc>: <publisher-name>Curran Associates, Inc.</publisher-name>).</citation>
</ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>L&#x000FC;cking</surname> <given-names>A.</given-names></name> <name><surname>Ginzburg</surname> <given-names>J.</given-names></name></person-group> (<year>2023</year>). <article-title>Leading voices: Dialogue semantics, cognitive science, and the polyphonic structure of multimodal interaction</article-title>. <source>Langu. Cogn</source>. <volume>15</volume>, <fpage>148</fpage>&#x02013;<lpage>172</lpage>. <pub-id pub-id-type="doi">10.1017/langcog.2022.30</pub-id></citation>
</ref>
<ref id="B25">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>McNeill</surname> <given-names>D.</given-names></name></person-group> (<year>1992</year>). <source>Hand and Mind-What Gestures Reveal about Thought</source>. <publisher-loc>Chicago</publisher-loc>: <publisher-name>Chicago University Press</publisher-name>.</citation>
</ref>
<ref id="B26">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mehler</surname> <given-names>A.</given-names></name> <name><surname>Bagci</surname> <given-names>M.</given-names></name> <name><surname>Henlein</surname> <given-names>A.</given-names></name> <name><surname>Abrami</surname> <given-names>G.</given-names></name> <name><surname>Spiekermann</surname> <given-names>C.</given-names></name> <name><surname>Schrottenbacher</surname> <given-names>P.</given-names></name> <etal/></person-group>. (<year>2023</year>). <article-title>&#x0201C;A multimodal data model for simulation-based learning with Va.Si.Li-Lab,&#x0201D;</article-title> in <source>Proceedings of HCI International 2023, Lecture Notes in Computer Science</source> (<publisher-loc>Springer</publisher-loc>).</citation>
</ref>
<ref id="B27">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mehler</surname> <given-names>A.</given-names></name> <name><surname>L&#x000FC;cking</surname> <given-names>A.</given-names></name></person-group> (<year>2009</year>). <article-title>&#x0201C;A structural model of semiotic alignment: The classification of multimodal ensembles as a novel machine learning task,&#x0201D;</article-title> in <source>AFRICON 2009</source> (<publisher-loc>IEEE</publisher-loc>) 1&#x02013;6. <pub-id pub-id-type="doi">10.1109/AFRCON.2009.5308098</pub-id></citation>
</ref>
<ref id="B28">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Parcalabescu</surname> <given-names>L.</given-names></name> <name><surname>Trost</surname> <given-names>N.</given-names></name> <name><surname>Frank</surname> <given-names>A.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;What is multimodality?&#x0201D;</article-title> in <source>Proceedings of the 1st Workshop on Multimodal Semantic Representations, MMSR</source> (<publisher-loc>Groningen, Netherlands</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>).</citation>
</ref>
<ref id="B29">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pattee</surname> <given-names>H. H.</given-names></name></person-group> (<year>1989</year>). <article-title>&#x0201C;Simulations, realizations, and theories of life,&#x0201D;</article-title> in <source>Artificial Life. SFI Studies in the Sciences of Complexity</source>, eds. C. G., Langton (<publisher-loc>Boston</publisher-loc>: <publisher-name>Addison-Wesley</publisher-name>) <fpage>63</fpage>&#x02013;<lpage>77</lpage>.</citation>
</ref>
<ref id="B30">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Peirce</surname> <given-names>C. S.</given-names></name></person-group> (<year>1934</year>). <source>Collected Papers: Pragmatism and Pragmaticism, volume 5</source>. <publisher-loc>Cambridge MA</publisher-loc>: <publisher-name>Harvard University Press</publisher-name>.</citation>
</ref>
<ref id="B31">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pustejovsky</surname> <given-names>J.</given-names></name></person-group> (<year>2013</year>). <article-title>&#x0201C;Dynamic event structure and habitat theory,&#x0201D;</article-title> in <source>Proceedings of the 6th International Conference on Generative Approaches to the Lexicon, GL2013</source> (<publisher-loc>Pisa, Italy</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>) <fpage>1</fpage>&#x02013;<lpage>10</lpage>.</citation>
</ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Radford</surname> <given-names>A.</given-names></name> <name><surname>Kim</surname> <given-names>J. W.</given-names></name> <name><surname>Hallacy</surname> <given-names>C.</given-names></name> <name><surname>Ramesh</surname> <given-names>A.</given-names></name> <name><surname>Goh</surname> <given-names>G.</given-names></name> <name><surname>Agarwal</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>&#x0201C;Learning transferable visual models from natural language supervision,&#x0201D;</article-title> in <source>Proceedings of the 38th International Conference on Machine Learning</source>, M., Meila, and T., Zhang (PMLR) <fpage>8748</fpage>&#x02013;<lpage>8763</lpage>.</citation>
</ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ragni</surname> <given-names>M.</given-names></name> <name><surname>Knauff</surname> <given-names>M.</given-names></name></person-group> (<year>2013</year>). <article-title>A theory and a computational model of spatial reasoning with preferred mental models</article-title>. <source>Psychol. Rev</source>. <volume>120</volume>, <fpage>561</fpage>&#x02013;<lpage>588</lpage>. <pub-id pub-id-type="doi">10.1037/a0032460</pub-id><pub-id pub-id-type="pmid">23750832</pub-id></citation></ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ruhland</surname> <given-names>K.</given-names></name> <name><surname>Peters</surname> <given-names>C. E.</given-names></name> <name><surname>Andrist</surname> <given-names>S.</given-names></name> <name><surname>Badler</surname> <given-names>J. B.</given-names></name> <name><surname>Badler</surname> <given-names>N. I.</given-names></name> <name><surname>Gleicher</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>A review of eye gaze in virtual agents, social robotics and hci: Behaviour generation, user interaction and perception</article-title>. <source>Comput. Graph. Forum</source> <volume>34</volume>, <fpage>299</fpage>&#x02013;<lpage>326</lpage>. <pub-id pub-id-type="doi">10.1111/cgf.12603</pub-id></citation>
</ref>
<ref id="B35">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Song</surname> <given-names>X.</given-names></name> <name><surname>Salcianu</surname> <given-names>A.</given-names></name> <name><surname>Song</surname> <given-names>Y.</given-names></name> <name><surname>Dopson</surname> <given-names>D.</given-names></name> <name><surname>Zhou</surname> <given-names>D.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Fast WordPiece tokenization,&#x0201D;</article-title> in <source>Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing</source> (<publisher-loc>Punta Cana, Dominican Republic. Association for Computational Linguistics</publisher-loc>) 2089&#x02013;2103. <pub-id pub-id-type="doi">10.18653/v1/2021.emnlp-main.160</pub-id><pub-id pub-id-type="pmid">36568019</pub-id></citation></ref>
<ref id="B36">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sun</surname> <given-names>R.</given-names></name></person-group> (<year>2023</year>). <source>The Cambridge Handbook of Computational Cognitive Sciences</source>. 2 edition <publisher-loc>Cambridge</publisher-loc>: <publisher-name>Cambridge University Press.</publisher-name> <pub-id pub-id-type="doi">10.1017/9781108755610</pub-id><pub-id pub-id-type="pmid">33441542</pub-id></citation></ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tannen</surname> <given-names>D.</given-names></name></person-group> (<year>1981</year>). <article-title>Indirectness in discourse: Ethnicity as conversational style</article-title>. <source>Disc. Process</source>. <volume>4</volume>, <fpage>221</fpage>&#x02013;<lpage>238</lpage>. <pub-id pub-id-type="doi">10.1080/01638538109544517</pub-id></citation>
</ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>van der Sluis</surname> <given-names>I.</given-names></name> <name><surname>Krahmer</surname> <given-names>E.</given-names></name></person-group> (<year>2007</year>). <article-title>Generating multimodal references</article-title>. <source>Disc. Process</source>. <volume>44</volume>, <fpage>145</fpage>&#x02013;<lpage>174</lpage>. <pub-id pub-id-type="doi">10.1080/01638530701600755</pub-id></citation>
</ref>
<ref id="B39">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wolfram</surname> <given-names>S.</given-names></name></person-group> (<year>2023</year>). <source>What Is ChatGPT Doing &#x02026;and Why Does It Work</source>? <publisher-loc>Champaign, IL</publisher-loc>: <publisher-name>Wolfram Media, Inc.</publisher-name></citation>
</ref>
</ref-list> 
</back>
</article> 