<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Comput. Sci.</journal-id>
<journal-title>Frontiers in Computer Science</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Comput. Sci.</abbrev-journal-title>
<issn pub-type="epub">2624-9898</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fcomp.2021.741148</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Computer Science</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Multimodal User Feedback During Adaptive Robot-Human Presentations</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Axelsson</surname> <given-names>Agnes</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1255727/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Skantze</surname> <given-names>Gabriel</given-names></name>
<xref ref-type="corresp" rid="c002"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/747537/overview"/>
</contrib>
</contrib-group>
<aff><institution>Department of Speech, Music and Hearing (TMH), KTH Royal Institute of Technology</institution>, <addr-line>Stockholm</addr-line>, <country>Sweden</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Hung-Hsuan Huang, The University of Fukuchiyama, Japan</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Raja Jamilah Raja Yusof, University of Malaya, Malaysia; Paul Adam Bremner, University of the West of England, United Kingdom</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Agnes Axelsson <email>agnaxe&#x00040;kth.se</email></corresp>
<corresp id="c002">Gabriel Skantze <email>skantze&#x00040;kth.se</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Human-Media Interaction, a section of the journal Frontiers in Computer Science</p></fn></author-notes>
<pub-date pub-type="epub">
<day>03</day>
<month>01</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2021</year>
</pub-date>
<volume>3</volume>
<elocation-id>741148</elocation-id>
<history>
<date date-type="received">
<day>14</day>
<month>07</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>06</day>
<month>12</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2022 Axelsson and Skantze.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Axelsson and Skantze</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract><p>Feedback is an essential part of all communication, and agents communicating with humans must be able to both give and receive feedback in order to ensure mutual understanding. In this paper, we analyse multimodal feedback given by humans towards a robot that is presenting a piece of art in a shared environment, similar to a museum setting. The data analysed contains both video and audio recordings of 28 participants, and the data has been richly annotated both in terms of multimodal cues (speech, gaze, head gestures, facial expressions, and body pose), as well as the polarity of any feedback (negative, positive, or neutral). We train statistical and machine learning models on the dataset, and find that random forest models and multinomial regression models perform well on predicting the polarity of the participants&#x00027; reactions. An analysis of the different modalities shows that most information is found in the participants&#x00027; speech and head gestures, while much less information is found in their facial expressions, body pose and gaze. An analysis of the timing of the feedback shows that most feedback is given when the robot makes pauses (and thereby invites feedback), but that the more exact timing of the feedback does not affect its meaning.</p></abstract>
<kwd-group>
<kwd>feedback</kwd>
<kwd>presentation</kwd>
<kwd>agent</kwd>
<kwd>robot</kwd>
<kwd>grounding</kwd>
<kwd>polarity</kwd>
<kwd>backchannel</kwd>
<kwd>multimodal</kwd>
</kwd-group>
<contract-sponsor id="cn001">Stiftelsen f&#x000F6;r&#x000A0;Strategisk Forskning<named-content content-type="fundref-id">10.13039/501100001729</named-content></contract-sponsor>
<counts>
<fig-count count="8"/>
<table-count count="4"/>
<equation-count count="0"/>
<ref-count count="100"/>
<page-count count="22"/>
<word-count count="17911"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Agents communicating with humans must be able to both give and receive communicative feedback in order to ensure mutual understanding (Clark, <xref ref-type="bibr" rid="B20">1996</xref>). While there has been a lot of work on how conversational agents should be able to provide feedback at appropriate places (Ward and Tsukahara, <xref ref-type="bibr" rid="B91">2000</xref>; Poppe et al., <xref ref-type="bibr" rid="B70">2010</xref>; Gravano and Hirschberg, <xref ref-type="bibr" rid="B30">2011</xref>), less work has been done on how to pick up and interpret feedback coming from the user. To investigate this, we have in previous work explored the scenario of a robot presenting a piece of art in a shared environment, similar to a museum setting (Axelsson and Skantze, <xref ref-type="bibr" rid="B5">2019</xref>, <xref ref-type="bibr" rid="B6">2020</xref>). In such settings, the presenter can be expected to have the turn the majority of the time, while the listener (the audience) provides positive and negative feedback to the presenter. In our previous work, we have shown how the agent can use such feedback to adapt the presentation in real time, using behaviour trees and knowledge graphs to model the grounding status of the information presented (Axelsson and Skantze, <xref ref-type="bibr" rid="B6">2020</xref>), and that an agent that adapts its presentation according to the feedback it receives is preferred by users (Axelsson and Skantze, <xref ref-type="bibr" rid="B5">2019</xref>). However, since we have so far only evaluated the system using either a Wizard of Oz setup (Axelsson and Skantze, <xref ref-type="bibr" rid="B5">2019</xref>) or with simulated users (Axelsson and Skantze, <xref ref-type="bibr" rid="B6">2020</xref>), we have not yet addressed the question of how feedback in this setting could be automatically identified and classified. Identifying feedback is an important first step in creating an intelligent virtual agent for presentation. If a listener-aware system can identify and classify signals used by its audience as positive or negative, it opens up the possibility of using that classified feedback to create a highly adaptive, engaging agent.</p>
<p>A face-to-face setting, such as the one explored here, provides a wide range of different ways of expressing feedback across different modalities, including speech, head nods, facial expressions, gaze and body pose. Some of these modalities are harder to pick up and process by the agent than others, and some modalities might be complementary or redundant in terms of which information they carry. This calls for a more thorough analysis of how feedback is actually expressed in a robot-human presentation scenario, and what modalities are most important to process. In this article, we analyse a dataset of human-robot interactions recorded using a Wizard of Oz setup, to find out how humans spontaneously produce feedback toward a robot in these different modalities. Whereas much research on social signal processing is based on automatic analysis of the audio and video signals, we base this analysis on manually annotated features in the recorded data. This way, we can make sure that the findings are not dependent on the accuracy of any automatic detection of feedback signals.</p>
<p>Throughout this paper, we will look for answers to these questions:</p>
<list list-type="order">
<list-item><p>What modalities are most commonly used to convey negative and positive feedback?</p></list-item>
<list-item><p>Are any modalities redundant or complimentary when it comes to expressing positive and negative feedback?</p></list-item>
<list-item><p>Does the interpretation of feedback as positive or negative change based on its relative timing to other feedback and the statement being reacted to?</p></list-item>
<list-item><p>Are there individual differences in the use of modalities to communicate different polarities of feedback?</p></list-item>
</list>
<p>This article is structured as follows. In section 2, we describe recent and basic work in the field of human-to-agent feedback, and feedback between humans in general. In section 3, we describe how we collected and annotated the data set used in this paper, and introduce the statistical models we use. Section 4 describes statistical patterns we found in the data as well as work on statistical models for approximating positivity and negativity based on multimodal signals. Section 5 is a discussion where we try to answer the questions listed above, and in section 6 we conclude and summarise the findings. The data we used for our study and analysis have been published; see section Data Availability Statement at the end of this paper.</p>
</sec>
<sec id="s2">
<title>2. Related Work</title>
<sec>
<title>2.1. Presentation Agents</title>
<p>Presentation agents (i.e., agents that present information to humans) can be seen as a sub-domain of conversational agents. However, whereas the initiative in general conversation can be mixed, the agent doing the presentation is expected to have the initiative most of the time, while paying attention to the listener&#x00027;s level of understanding or engagement. Kuno et al. (<xref ref-type="bibr" rid="B48">2007</xref>) found that mutual gaze and co-occurring nods were important indicators of an audience&#x00027;s engagement with a robot&#x00027;s presentation in a museum scenario. Recently, Velentza et al. (<xref ref-type="bibr" rid="B88">2020</xref>) found that a pair of presenting robots were more engaging than a single robot, but the embodiment of their robots&#x02014;an Android tablet on top of a tripod&#x02014;may make their results hard to apply to other, more embodied, scenarios. Iio et al. (<xref ref-type="bibr" rid="B38">2020</xref>) have shown that it is technologically feasible to have a robot walking around an enclosed exhibition at a museum. They present a robot that can identify individual visitors and use this identity information to adapt what the robot says, but the system is not interested in feedback from the users, beyond letting the users walk away if they are not interested in the presentation.</p>
<p>Another space where embodied presentation agents are used is the field of robot teachers. An argument in favour of robot teachers, proposed by Werfel (<xref ref-type="bibr" rid="B92">2014</xref>), is that they can be adaptive to the individual student while being as appealing as a human teacher to interact with. A typical teaching agent is the RoboThespian robot by Verner et al. (<xref ref-type="bibr" rid="B89">2016</xref>), which purely adapts to student responses in terms of which answer they choose on multi-choice questions; this is not an adaptive social agent, but rather a type of branching dialogue management. Tozadore and Romero (<xref ref-type="bibr" rid="B87">2020</xref>) presented a framework for how a virtual teaching agent can choose which questions to ask of a student depending on multimodal features&#x02014;on a high level, the same types of right-or-wrong evaluations as Verner et al. (<xref ref-type="bibr" rid="B89">2016</xref>), but also more low-level features like facial features and attention estimated from gaze direction. Multimodal approaches for sensing student feedback in a school scenario do exist, even if they are not directly connected to a teaching robot; Goldberg et al. (<xref ref-type="bibr" rid="B28">2021</xref>) have presented proposals for multimodal machine-learning approaches for estimating individual students&#x00027; engagement in classroom scenarios, but the approach may not extend beyond the specific setting.</p>
<p>A large body of work in connection to presentation agents relates to their ability to position themselves relatively to their audience in an actual museum scenario. While this is an important part of implementing an actual agent in the field, this track of research does not address the grounding of presented content toward the audience (Nourbakhsh et al., <xref ref-type="bibr" rid="B61">2003</xref>; Kuzuoka et al., <xref ref-type="bibr" rid="B49">2010</xref>; Yamaoka et al., <xref ref-type="bibr" rid="B94">2010</xref>; Yousuf et al., <xref ref-type="bibr" rid="B96">2012</xref>).</p>
</sec>
<sec>
<title>2.2. Feedback and Backchanneling in Communication</title>
<p>In the view of Yngve (<xref ref-type="bibr" rid="B95">1970</xref>), communication happens over a <bold>main channel</bold>, which carries the main message, as well as a <bold>back channel</bold>. Signals on the back channel&#x02014;which have come to be called simply <bold>backchannels</bold> themselves&#x02014;constitute feedback from the listener to the speaker. Feedback can be considered to be <bold>positive</bold> or <bold>negative</bold> (Rajan et al., <xref ref-type="bibr" rid="B72">2001</xref>). This can be referred to as the <bold>polarity</bold> of the feedback (Allwood et al., <xref ref-type="bibr" rid="B4">1992</xref>; Buschmeier and Kopp, <xref ref-type="bibr" rid="B18">2018</xref>). On a wider scope, the polarity of entire utterances or statements is often referred to as <italic>sentiment</italic> (Wilson et al., <xref ref-type="bibr" rid="B93">2009</xref>). Peters et al. (<xref ref-type="bibr" rid="B68">2005</xref>) gave the following example of multimodal feedback with a distinctly negative polarity:</p>
<disp-quote><p>&#x0201C;For instance, to show you don&#x00027;t trust what is being said, a negative backchannel of believability, you can incline your head while staring obliquely and frowning to the Sender: two gaze signals combined with a head signal.&#x0201D; (Peters et al., <xref ref-type="bibr" rid="B68">2005</xref>)</p></disp-quote>
<p>A complementary view to that of Yngve (<xref ref-type="bibr" rid="B95">1970</xref>) was presented by Clark (<xref ref-type="bibr" rid="B20">1996</xref>), who instead split communication into <italic>track 1</italic> and <italic>track 2</italic>. The first track contains main contributions into the discourse, and the second track contains comments on content on the first track. Notably, this is not connected to turn-taking, unlike Yngve&#x00027;s main and back channel model: if the listener takes the turn and says &#x0201C;Wasn&#x00027;t his father dead by then?&#x0201D;, then they have taken the turn, so the utterance is not part of the back channel, but the utterance is purely a comment on a previous utterance, so it is part of <italic>track 2</italic>.</p>
<p>Bavelas et al. (<xref ref-type="bibr" rid="B10">2000</xref>) proposed the difference between <italic>specific</italic> backchannels and <italic>generic</italic> backchannels. In this view, <italic>specific backchannels</italic> are direct comments on the context (e.g., frowning, &#x0201C;wow!&#x0201D;) and <italic>generic backchannels</italic> are less specific signals whose main function is indicating that the listener is paying attention to the speaker&#x00027;s speech. Bavelas et al. (<xref ref-type="bibr" rid="B11">2002</xref>) also showed that gaze cues were an important signal from the speaker to invite backchannel activity from the listener. The view of backchannels as a tool by which the listener can shape the story of the speaker has been supported by a corpus study by Tolins and Fox Tree (<xref ref-type="bibr" rid="B86">2014</xref>). Their study showed that generic backchannels were likely to lead to the speaker continuing their story, while specific backchannels would lead to an elaboration or repair.</p>
<p>Clark (<xref ref-type="bibr" rid="B20">1996</xref>) described communication as a <bold>joint project</bold>, where both the speaker and listener give and receive feedback on several levels to fulfil the task of making the message come across. Clark and Krych (<xref ref-type="bibr" rid="B22">2004</xref>) showed that participants in an experiment were able to build a LEGO model more quickly when coordinating the building activity with an instructor, through speech and other modalities. This illustrates the parallel between a cooperative task and the communication used to facilitate the cooperative task. In the view of Clark (<xref ref-type="bibr" rid="B19">1994</xref>), communication is coordinated on four levels:</p>
<list list-type="order">
<list-item><p>Vocalisation and attention</p></list-item>
<list-item><p>Presentation and identification</p></list-item>
<list-item><p>Meaning and understanding</p></list-item>
<list-item><p>Proposal and uptake</p></list-item>
</list>
<p>Each of these four levels forms a pair of actions on behalf of the speaker and the listener. Both actions must be performed at the same time for either one to be meaningful. The goal of an interaction is to reach mutual acceptance, where both participants believe that proposal and uptake have been achieved (Clark, <xref ref-type="bibr" rid="B20">1996</xref>). Allwood et al. (<xref ref-type="bibr" rid="B4">1992</xref>) presented a similar model, where feedback serves primarily the four functions of indicating <italic>contact, perception, understanding</italic> and <italic>attitudinal reactions</italic>. In the model by Clark (<xref ref-type="bibr" rid="B20">1996</xref>), attitudinal reactions implicitly fall under <italic>acceptance</italic>, to the extent that they are covered by this model.</p>
<p>Feedback ladders like those presented by Clark (<xref ref-type="bibr" rid="B20">1996</xref>) create a system where polarised feedback on one level can imply feedback of the same or another polarity on another level. Clark (<xref ref-type="bibr" rid="B20">1996</xref>) defines two rules for how this process functions with the four levels mentioned above. <italic>Upward completion</italic> means that negative feedback on a low level of feedback implies negative feedback on all higher levels&#x02014;negative <italic>attention</italic> implies negative <italic>identification, understanding</italic> and <italic>uptake</italic> since one can not identify what one is not paying attention to, can not understand what one has not identified, and can&#x00027;t accept what one has not understood (Clark, <xref ref-type="bibr" rid="B20">1996</xref>). The inverse rule of <italic>downward evidence</italic> instead states that any feedback on a level&#x02014;positively or negatively polarised&#x02014;implies positive feedback on all lower levels. If one provides evidence of positive understanding, then this implies that the listener must also have identified and attended to the message (Clark, <xref ref-type="bibr" rid="B20">1996</xref>). Notably, these two rules mean that <italic>positive</italic> feedback on a level says nothing about the levels <italic>above</italic> it&#x02014;when the listener provides evidence of having positively understood an utterance, this neither says that they have accepted, nor that they have not accepted, that same utterance.</p>
<p>The <bold>grounding criterion</bold> is the level of feedback, among the four listed above, upon which feedback must be given for both the speaker and the listener to believe that communication works at any given point in time (Paek and Horvitz, <xref ref-type="bibr" rid="B66">2000</xref>). For example, sometimes it might be enough that the listener shows continued attention, but sometimes the speaker might want to make sure that the listener has actually understood what has been said. If the speaker does not get enough feedback from the listener, the speaker might <italic>elicit</italic> feedback. The grounding criterion can change over the course of a conversation, or over the course of an individual utterance, as the speaker signals appropriate points in time for the listener to deploy backchannel signals, or elicits feedback on a higher level. Clark and Brennan (<xref ref-type="bibr" rid="B21">1991</xref>) also argued that the grounding criterion depends on the channels of communication being used for the discourse, with more limited methods of communication (phone, mail) requiring more explicit positive feedback than more multimodal methods of communication (face-to-face).</p>
<p>In the context of a presentation agent, the model of joint projects, joint problems and joint remedies presented by Clark (<xref ref-type="bibr" rid="B20">1996</xref>) can be a useful model for disambiguation between different types of feedback to the system, and a way to choose what strategies to use to repair problems in communication (Baker et al., <xref ref-type="bibr" rid="B7">1999</xref>; Buschmeier and Kopp, <xref ref-type="bibr" rid="B17">2013</xref>; Axelsson and Skantze, <xref ref-type="bibr" rid="B6">2020</xref>). Buschmeier and Kopp (<xref ref-type="bibr" rid="B16">2011</xref>) argued that a Bayesian model, taking into account the previously estimated state of the user as well as the feedback as it is delivered, interpreted incrementally, is an appropriate method for a conversational agent to estimate the polarity and grounding level of the user&#x00027;s feedback at any given point in time. The advantage of such a model is that an absence of feedback can be represented as the user not having provided evidence of any polarity, which opens up for elicitation strategies.</p>
<p>If a presentation agent can identify and classify feedback given by the user, there are several ways in which the agent can use it to adapt the presentation. In Axelsson and Skantze (<xref ref-type="bibr" rid="B6">2020</xref>), we used a knowledge graph to keep track of the grounding status of specific facts presented to the user, creating a direct link between grounding and what statements are possible to present, as well as how the robot can refer to entities in the presentation. An alternative approach to this method is the approach presented by Pichl et al. (<xref ref-type="bibr" rid="B69">2020</xref>), where edges were inserted in a knowledge graph, containing information about the user&#x00027;s attitude and understanding of concepts. Alternative ways to adapt to identified and classified feedback have been presented by Buschmeier and Kopp (<xref ref-type="bibr" rid="B16">2011</xref>).</p>
</sec>
<sec>
<title>2.3. Feedback in Different Modalities</title>
<p>Feedback from the listener toward the speaker can be expressed in different modalities. Vocal feedback uses the auditory channel, and has both a verbal/linguistic component (the words being spoken), as well as non-verbal components, such as prosody (Stocksmeier et al., <xref ref-type="bibr" rid="B82">2007</xref>; Malisz et al., <xref ref-type="bibr" rid="B53">2012</xref>; Romero-Trillo, <xref ref-type="bibr" rid="B74">2019</xref>). Non-vocal, non-verbal feedback is expressed in the visual channel (Jokinen, <xref ref-type="bibr" rid="B41">2009</xref>; Nakatsukasa and Loewen, <xref ref-type="bibr" rid="B59">2020</xref>), and can take the forms of gestures (Krauss et al., <xref ref-type="bibr" rid="B47">1996</xref>), gaze (Kleinke, <xref ref-type="bibr" rid="B43">1986</xref>; Thepsoonthorn et al., <xref ref-type="bibr" rid="B85">2016</xref>), facial expressions (Buck, <xref ref-type="bibr" rid="B15">1980</xref>; Krauss et al., <xref ref-type="bibr" rid="B47">1996</xref>) and pose (Edinger and Patterson, <xref ref-type="bibr" rid="B25">1983</xref>). In this section, we will provide a more thorough discussion on previous research related to feedback in those modalities.</p>
<sec>
<title>2.3.1. Speech</title>
<p>A common form of vocal feedback are <italic>backchannels</italic>, like &#x0201C;uh-huh&#x0201D; (Yngve, <xref ref-type="bibr" rid="B95">1970</xref>). There is also a span of vocal feedback that takes place somewhere between the main channel and the back channel. A specific form of such feedback is the <bold>clarification request</bold>, defined by Purver (<xref ref-type="bibr" rid="B71">2004</xref>) as a &#x0201C;dialogue device allowing a user to ask about some feature (e.g., the meaning or form) of an utterance, or part thereof.&#x0201D; A similar notion is that of <bold>echoic responses</bold> (e.g., &#x0201C;tomorrow?&#x0201D;), where the listener repeats part of the speaker&#x00027;s utterance as a backchannel. These may serve as either an acknowledgement that the listener has heard that specific part of the speaker&#x00027;s utterance, or a repair request, where the original speaker must make an effort to clarify the previous utterance. Whether these should be considered negative or positive depends on whether they are interpreted as questions (i.e., a request for clarification) or not, and this difference can (to some extent) be signalled through prosody. The most commonly described tonal characteristic for questions is high final pitch and overall higher pitch (Hirst and Di Cristo, <xref ref-type="bibr" rid="B33">1998</xref>), and this is especially true when the word order cannot signal the difference on its own. Several studies of fragmentary clarification requests (i.e., which signal negative feedback) have shown that they are associated with a rising final pitch, in both Swedish (Edlund et al., <xref ref-type="bibr" rid="B26">2005</xref>) and German (Rodr&#x000ED;guez and Schlangen, <xref ref-type="bibr" rid="B73">2004</xref>).</p>
<p>The distinction between positive and negative feedback is also similar to the notion of <bold>go on</bold> and <bold>go back</bold> signals in dialogue, as proposed by Krahmer et al. (<xref ref-type="bibr" rid="B46">2002</xref>). In an analysis of a human-machine dialogue corpus, they found that <italic>go back</italic> signals were longer, lacked new information and often contained corrections or repetitions. They also found that there is a strong connection between the prosody and timing of the listener&#x00027;s response and whether the response is interpreted as <italic>go on</italic> or <italic>go back</italic>. Additionally, Krahmer et al. (<xref ref-type="bibr" rid="B46">2002</xref>) pointed out that the classification as <italic>go on</italic> and <italic>go back</italic> was dependent on the dialogue context; if the system says &#x0201C;Should I repeat that?&#x0201D; or &#x0201C;Did you understand?&#x0201D;, the meaning of answers like &#x0201C;yes&#x0201D; or &#x0201C;no&#x0201D; can depend entirely on prosody and timing. Even when extended to more classes than <italic>go on</italic> and <italic>go back</italic>, like in the feedback classification schemes by Clark (<xref ref-type="bibr" rid="B20">1996</xref>) or Allwood et al. (<xref ref-type="bibr" rid="B4">1992</xref>) presented in section 2.2, the argument that context and multimodal signals (beyond pure linguistic content) can help distinguish between minimal pairs still holds.</p>
<p>Negative feedback can also be linked to the notion of &#x0201C;uncertainty,&#x0201D; i.e., signs of uncertainty can also be regarded as negative feedback on some of the levels of understanding discussed in section 2.2. Skantze et al. (<xref ref-type="bibr" rid="B79">2014</xref>) explored user feedback in the context of a human-robot map task scenario, where the robot was instructing the users on how to draw a route. They showed that participants signalled uncertainty in their feedback through both prosody and word choice (lexical information). Uncertain utterances were shown to have a flatter pitch contour than certain utterances, and were also longer and had a lower intensity. Hough and Schlangen (<xref ref-type="bibr" rid="B36">2017</xref>) presented a grounding model for human-robot interaction where the robot could signal its uncertainty about what the user was referring to. The scenario was a pentamino block game, where the robot&#x00027;s goal was to identify the piece that the human referred to. Hough and Schlangen (<xref ref-type="bibr" rid="B36">2017</xref>) showed that users picked up the robot&#x00027;s uncertainty especially when it communicated it by moving more slowly toward the piece it thought the user was referring to. Hough and Schlangen (<xref ref-type="bibr" rid="B36">2017</xref>) framed uncertainty as a measure of the agent&#x00027;s certainty and understanding of its actions, as opposed to its knowledge&#x02014;this is an extension of grounding.</p>
</sec>
<sec>
<title>2.3.2. Gaze</title>
<p>Gaze can be used for feedback from a listener toward a speaker, but this typically happens in combination with other signals and modalities. Mehlmann et al. (<xref ref-type="bibr" rid="B57">2014</xref>) showed that gaze is used by listeners to show that they understand which object the speaker is referring to, and that this co-occurs with the physiological process of actually finding the object. This also serves as a signal of <italic>joint attention</italic> from the listener toward the referred object. Gaze is also a way to improve the user&#x00027;s perception of the human-ness and social competence of the system (Zhang et al., <xref ref-type="bibr" rid="B99">2017</xref>; Kontogiorgos et al., <xref ref-type="bibr" rid="B45">2021</xref>; Laban et al., <xref ref-type="bibr" rid="B50">2021</xref>).</p>
<p>Gaze is sometimes considered a gestural backchannel (Bertrand et al., <xref ref-type="bibr" rid="B12">2007</xref>), although it is more commonly considered to be a turn-taking indication or turn-taking cue (Skantze, <xref ref-type="bibr" rid="B78">2021</xref>). The uses of gaze as a turn-taking cue are not directly relevant to this paper, as our scenario has a strict turn-taking protocol, where the robot is the main speaker and the user mostly provides brief feedback.</p>
<p>On a low feedback level, mutual gaze can be used as a sign that a user wants to interact with the system, a signal called <italic>engagement</italic> by Bohus and Rudnicky (<xref ref-type="bibr" rid="B14">2006</xref>). Kuno et al. (<xref ref-type="bibr" rid="B48">2007</xref>) found that mutual gaze and co-occurring nods were important indicators of an audience&#x00027;s engagement with a robot&#x00027;s presentation in a museum scenario. Nakano and Ishii (<xref ref-type="bibr" rid="B58">2010</xref>) presented a model of gaze as a sign of mutual engagement. An agent that used a more sophisticated gaze sensing model was found to be preferrable to test participants.</p>
</sec>
<sec>
<title>2.3.3. Head Movements</title>
<p>Similarly to the results related to gaze by Bavelas et al. (<xref ref-type="bibr" rid="B11">2002</xref>) and the results related to prosody by Ward and Tsukahara (<xref ref-type="bibr" rid="B91">2000</xref>), McClave (<xref ref-type="bibr" rid="B55">2000</xref>) has shown that head-nods are a viable listener response to a backchannel-inviting cue from the speaker, especially if that cue is also a nod. Stivers (<xref ref-type="bibr" rid="B81">2008</xref>) presented the view that nods are a stronger signal than conventional backchannels like &#x0201C;uh-huh&#x0201D; and &#x0201C;OK,&#x0201D; arguing that they present evidence that the recipient (listener) is able to visualise being part of the event being told by the teller (speaker).</p>
<p>Heylen (<xref ref-type="bibr" rid="B31">2005</xref>) presented a list of head movements together with the communicative functions they often serve, both for speakers and listeners. Heylen (<xref ref-type="bibr" rid="B31">2005</xref>) argued that head movements can be a communicative signal on both <italic>track 1</italic> and <italic>track 2</italic> as defined by Clark (<xref ref-type="bibr" rid="B20">1996</xref>) (see section 2.2). However, the examples presented by Heylen all related to head movements produced by the speaker, and were co-ordinated with speech or utterances that are unambiguously part of <italic>track 1</italic>&#x02014; such as when the speaker produces an <italic>inclusive</italic> sweeping hand movement while saying the word &#x0201C;everything,&#x0201D; indicating that the scope of the word is wide. Other gestures and functions listed by Heylen, like nodding to signal agreement, were more unambiguously part of <italic>track 2</italic>.</p>
<p>In a multimodal study of the <italic>ALICO</italic> corpus, Malisz et al. (<xref ref-type="bibr" rid="B54">2016</xref>) found that listeners used head movements twice as often as speech in response to being told a story by a speaker. Additionally, nods are by far the most common head movement feature, and multiple nods are twice as common as single nods. Similarly, head shakes occur much more often in multiples than one-by-one, but the opposite holds for head tilts and head jerks, which are significantly more likely to occur one-by-one than in multiples. Singh et al. (<xref ref-type="bibr" rid="B76">2018</xref>) arrived at different rates of usage of modalities when annotating a corpus of children reacting to each other&#x00027;s stories, finding that gazing on the speaker, smiling, leaning toward the speaker, raising one&#x00027;s brow, and responding verbally are all more common behaviours than nodding. While their results may not extend to adults, Singh et al. (<xref ref-type="bibr" rid="B76">2018</xref>) also found that adult evaluators considered certain signals to be indicative of positive or negative attention based on context&#x02014;nodding was correlated with positive attention if the nod was long, but with negative attention if the nod was fast. Oertel et al. (<xref ref-type="bibr" rid="B64">2016</xref>) showed that head-nods are perceived as less indicative of attention the less pronounced they are, the slower they are, and the shorter they are, and conclude that head-nods are not merely a signal of positive attention, but rather reflect various degrees of attentiveness. The apparent difference between these results and those of Singh et al. (<xref ref-type="bibr" rid="B76">2018</xref>) could be argued to be because Singh et al. studied children, whose gestural behaviour is known to be different from that of adults (Colletta et al., <xref ref-type="bibr" rid="B23">2010</xref>).</p>
<p>Navarretta et al. (<xref ref-type="bibr" rid="B60">2012</xref>) found that multiple nods are more common than single nods in Swedish and Danish experiment participants, while Finnish participants used single nods more often than multiple nods. This highlights how signals can be used differently even in cultures that are closely related to each other.</p>
<p>Novick (<xref ref-type="bibr" rid="B62">2012</xref>) showed that head-nods in dyadic conversations between humans were significantly cued by gaze&#x02014;i.e., the listener would nod when gazed at by the speaker. This allows the listener to use feedback when it is the most likely to be picked up by the speaker, in a modality fitting for this. Novick and Gris (<xref ref-type="bibr" rid="B63">2013</xref>) showed that nods were less frequent in multi-party conversations, where there was more than one listener, and not necessarily cued by gaze.</p>
<p>Sidner et al. (<xref ref-type="bibr" rid="B75">2006</xref>) showed that there was a significant effect on how many head nods users of a system used <italic>after</italic> they figured out that the system could recognise such signals. These results generally go together with the findings by Kontogiorgos et al. (<xref ref-type="bibr" rid="B45">2021</xref>) and Laban et al. (<xref ref-type="bibr" rid="B50">2021</xref>), showing that this applies for head movements and speech, respectively.</p>
</sec>
<sec>
<title>2.3.4. Facial Expressions</title>
<p>Facial expressions are typically viewed as a sign of the listener&#x00027;s emotional state (Mehlmann et al., <xref ref-type="bibr" rid="B56">2016</xref>): for example, eyebrow movements are known to signal interest or disbelief from a listener toward a speaker (Ekman, <xref ref-type="bibr" rid="B27">2004</xref>). This is grounding on a high level&#x02014;as mentioned in section 2.2, Allwood et al. (<xref ref-type="bibr" rid="B4">1992</xref>) placed attitudinal reactions as the highest level of the feedback scale, while Clark (<xref ref-type="bibr" rid="B20">1996</xref>) would classify it as a variant of showing acceptance.</p>
<p>Jokinen and Majaranta (<xref ref-type="bibr" rid="B42">2013</xref>) argued that facial expressions are closely tied to gaze signals; when interacting with a human, or with an embodied agent, listeners tend to gaze at the speaker&#x00027;s eyes and upper face region to be prepared to catch subtle facial expressions.</p>
</sec>
<sec>
<title>2.3.5. Body Pose</title>
<p>Body pose is typically only used as a unimodal indication of whether the sensed individual wants to engage with the system or not, as explored by Bohus and Rudnicky (<xref ref-type="bibr" rid="B14">2006</xref>). This is also how body pose was used in the model for estimating classroom engagement presented by Goldberg et al. (<xref ref-type="bibr" rid="B28">2021</xref>). Engagement corresponds to the lowest level of Clark&#x00027;s four levels of feedback described in section 2.2: <italic>attention</italic>. An exception to this is <italic>shrugging</italic>, which Goldberg et al. (<xref ref-type="bibr" rid="B28">2021</xref>) found to be used by children toward a reading partner agent. Shrugging was found by Goldberg et al. (<xref ref-type="bibr" rid="B28">2021</xref>) to signal that the child does not know the answer to a question. We would argue that this implies positive <italic>identification</italic> but negative or ambiguous <italic>understanding</italic> by the scheme detailed in section 2.2. Battersby (<xref ref-type="bibr" rid="B9">2011</xref>) showed that speakers were significantly more likely to use hand gestures than listeners.</p>
<p>Oppenheim et al. (<xref ref-type="bibr" rid="B65">2021</xref>) recently showed that leaning was used as an extended gaze cue by participants in an experiment where they had to learn how to build an object from another participant. In this experiment, the learners, corresponding to our listeners, coordinated gaze cues by leaning around 40% of the time when being taught how to build a smaller Lego model, and 26% of the time when being taught how to build a larger pipe structure&#x02014;while the responses appeared to correspond to the physical properties of the object, indicating that learners&#x00027; responses were cued by physically wanting to see the object being referred to, the gaze and lean signals happened in response to inviting cues by the teacher, indicating that the signals both served a grounding purpose and a practical purpose, simultaneously.</p>
<p>Park et al. (<xref ref-type="bibr" rid="B67">2019</xref>) found that there was a strong connection between the body pose of children&#x02014; specifically the gesture of leaning forwards&#x02014;and their intent to engage with a system. Zaletelj and Ko&#x00161;ir (<xref ref-type="bibr" rid="B98">2017</xref>) have presented models for predicting classroom engagement from body pose in this sense. Body pose can also be used to predict the emotional state of a user: Sun et al. (<xref ref-type="bibr" rid="B83">2019</xref>) used body pose as an input to estimate subjects&#x00027; emotional state, training a neural network on a dataset labelled with the speaker&#x00027;s intended emotional state and the listener&#x00027;s interpretation. This was then used in a robot that was consistently evaluated as more emotionally aware than a baseline.</p>
</sec>
<sec>
<title>2.3.6. Combining Modalities</title>
<p>It important to not just study feedback functions of individual modalities, but to also consider their combined effects. Clark and Krych (<xref ref-type="bibr" rid="B22">2004</xref>) showed how multimodal grounding worked in an instructor-instructee scenario where a LEGO model was constructed by one participant. The data showed many multimodal patterns in how participants coordinated their speech with other modalities to ensure grounding (often specifically establishing which LEGO piece the speaker was referring to). Crucially, participants used visual modalities (for example, holding up a LEGO piece) when that modality resulted in an easier and faster reference than speech.</p>
<p>In a corpus study focusing on physiological indications of attention, Goswami et al. (<xref ref-type="bibr" rid="B29">2020</xref>) found that the rate of blinking, pupil dilation, head movement speed and acceleration, as well as prosodic features and facial landmarks can create a good model for predicting children&#x00027;s engagement with a task as well as when they are going to deploy backchannels. The authors&#x00027; random forest model prioritises gaze as the most important feature for measuring whether the children were engaged with the task.</p>
<p>Visser et al. (<xref ref-type="bibr" rid="B90">2014</xref>) presented a model for how a conversational system could show grounding to a speaking user. This is the inverse of the scenario analysed by us, but the approach, where specific backchannels were assigned to specific states of hierarchic subcomponents of the system, is interesting. For example, their agent nodded if the language understanding component of the system reported a high confidence, and frowned if there was a pause longer than 200 ms and the language component was not confident that it had understood the last thing the speaker said. Kontogiorgos et al. (<xref ref-type="bibr" rid="B44">2019</xref>) presented a model of estimating uncertainty of a test participant by combining gaze and pointing modalities, but the authors conclude that it is uncertain how the results extend to domains outside of the specific test scenario.</p>
<p>Oppenheim et al. (<xref ref-type="bibr" rid="B65">2021</xref>) showed that the modalities that a listener used depended on the context of the cue used by the speaker. The scenario was a teacher/student scenario where the participants took turns teaching each other how to build an object. Depending on if the teacher looked at the student to <italic>supplement, highlight</italic> or <italic>converse</italic>, the student&#x00027;s response modalities significantly changed. Nod responses were significantly more common than speech responses if the teacher&#x00027;s act was <italic>supplement</italic>, nods and speech were approximately as common when responding to <italic>highlight</italic> actions, and speech was significantly more common in response to <italic>converse</italic> acts.</p>
<p>Hsieh et al. (<xref ref-type="bibr" rid="B37">2019</xref>) used multimodal features, specifically speech and head movements, to estimate the certainty of users of their virtual agent. They used <italic>certainty</italic> and <italic>uncertainty</italic> as a term for feedback that can be mapped to several of the levels we described in section 2.2. While the authors showed that applying statistical models to the data is a viable way to estimate self-reported certainty in the answers the users gave to the robot&#x00027;s questions, the study was limited by the small number of participants.</p>
<p>To conclude this review, a large body of work exists that investigates how individual modalities can be sensed and interpreted in terms of feedback. There is less work that investigate the combined effect of several modalities as feedback to achieve some task-specific goal between an agent and a user. More specifically, we have not found any previous systematic analyses of how human listeners express feedback in various modalities, as a response to a presentation agent.</p>
</sec>
</sec>
</sec>
<sec id="s3">
<title>3. Method</title>
<sec>
<title>3.1. Data Collection</title>
<p>For our analysis, we set up an experiment where participants interacted with a robot presenting a piece of art, as seen in <xref ref-type="fig" rid="F1">Figure 1</xref>, similar to that used in Axelsson and Skantze (<xref ref-type="bibr" rid="B5">2019</xref>). As a robot platform, the Furhat robot head was used, which has a back-projected animated face and a mechanical neck (Al Moubayed et al., <xref ref-type="bibr" rid="B2">2012</xref>). The robot presented two paintings<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref> to each participant. Although the agent communicated with the participants in English, participants were allowed to respond in Swedish or English if they wished. A Wizard of Oz setup was used, and the Wizard sat behind a separating wall on the other side of the room. Participants were led to believe that they were interacting with a fully autonomous system, and that the Wizard&#x00027;s role was only to make sure that data were being successfully recorded. The presentation was automated to a large extent, but the Wizard controlled whether the system would repeat utterances, move on, or clarify. To trigger more negative feedback (for the sake of the analysis), the system would misspeak with a certain probability (also controlled by the Wizard). Misspeaking was implemented by replacing a key part of the utterance with something else or by muting parts of the synthesised speech.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>A snapshot of an experiment participant, shown from all three angles from which we recorded them. The participant&#x00027;s face has been blurred for privacy. Top: the participant is shown from behind, illustrating the set-up of our experiment, with the presenting Furhat head visible.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-03-741148-g0001.tif"/>
</fig>
<p>In total, 33 test participants were recruited. For technical reasons, data from 5 participants had to be discarded, leaving 28 usable participants. The participants were recorded on video from multiple angles, and were equipped with a microphone which recorded their speech. A snapshot from the video recording can be seen in <xref ref-type="fig" rid="F1">Figure 1</xref>.</p>
</sec>
<sec>
<title>3.2. Manual Data Annotation</title>
<p>Video and audio data recorded from the experiment were synchronised and annotated in multiple ways. We cut the data into <bold>clips</bold> representing the time between the robot beginning an utterance and the video frame before it started its next utterance; this left us with between 59 and 125 clips per painting presentation. For the rest of this paper, <bold>clip</bold> will be used to refer to a video recording of the robot saying something, and the full reaction of the test participant to that utterance before the robot starts saying its next line.</p>
<sec>
<title>3.2.1. Feedback Polarity of Each Clip</title>
<p>Each clip where the participant&#x00027;s turn was at least five seconds long was annotated based on what type of feedback it contained, using Amazon Mechanical Turk. Our original idea was to classify the feedback as being positive or negative acceptance, understanding, hearing or attention, as defined by Clark (<xref ref-type="bibr" rid="B20">1996</xref>) (discussed in section 2.2 above). However, initial results from annotating our data using these labels gave agreement scores that we considered too low to annotate the entire dataset this way. We believe that the scientific definitions of acceptance, understanding, hearing and attention were too nuanced and theoretical for laypeople to apply consistently, and that our annotators were mostly annotating clips as positive or negative attention, hearing, understanding or acceptance based on their immediate understanding of those words. This would have made the annotations questionable even if we were able to get a higher agreement score.</p>
<p>As a result of this, we decided to instead use the labels <bold>positive</bold>, <bold>negative</bold>, and <bold>neutral</bold>, and ignore the grounding level with which the feedback could be associated. The <italic>positive</italic> and <italic>negative</italic> labels were described to the annotators equivalently to the <italic>go on</italic> and <italic>go back</italic> labels used by Krahmer et al. (<xref ref-type="bibr" rid="B46">2002</xref>), with the <italic>neutral</italic> label representing cases where the annotator did not think that the participant&#x00027;s response was strong enough to classify it as either of the others:</p>
<list list-type="bullet">
<list-item><p>&#x0201C;Positive&#x0201D; was described as &#x0201C;Pick this if you think the person is showing that the presentation can continue the way it is currently going, without having to stop to repair something that went wrong. If the person shows understanding of what the robot said, or asks a followup question, then this may be the right option.&#x0201D;</p></list-item>
<list-item><p>&#x0201C;Negative&#x0201D; was described as &#x0201C;This option is right if you think the person is showing that the robot needs to stop and repeat something, or that it needs to explain something that the person didn&#x00027;t understand, hear or see.&#x0201D;</p></list-item>
<list-item><p>&#x0201C;Neutral&#x0201D; was described as &#x0201C;If the person doesn&#x00027;t really react or show any clear signals, or hasn&#x00027;t really had time to react by the end of the video, you should pick this.&#x0201D;</p></list-item>
</list>
<p>Each clip was annotated by three crowdworkers. In effect, this classified each clip by the <italic>polarity</italic> of the feedback contained in it, as defined in section 2.2.</p>
<p><xref ref-type="table" rid="T1">Table 1</xref> shows how often each combination of positive, negative, and neutral evaluations appear. If the three evaluations are viewed as votes for a class, then the most common class is three votes for positivity (49%), followed by three votes for negativity (10%) and three votes for neutrality (9%). The final label for each clip in our dataset is determined based on the majority vote. 74 clips receive one vote for each class, and these are assigned the neutral label.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>The distribution of output features across all clips.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Positivity</bold></th>
<th valign="top" align="center"><bold>Negativity</bold></th>
<th valign="top" align="center"><bold>Neutrality</bold></th>
<th valign="top" align="center"><bold>Count</bold></th>
<th valign="top" align="left"><bold>Argmax classification</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">0</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">190</td>
<td valign="top" align="left">Neutral</td>
</tr>
<tr>
<td valign="top" align="left">0</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">109</td>
<td valign="top" align="left">Neutral</td>
</tr>
<tr>
<td valign="top" align="left">0</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">51</td>
<td valign="top" align="left">Negative</td>
</tr>
<tr>
<td valign="top" align="left">0</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">202</td>
<td valign="top" align="left">Negative</td>
</tr>
<tr>
<td valign="top" align="left">1</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">146</td>
<td valign="top" align="left">Neutral</td>
</tr>
<tr>
<td valign="top" align="left">1</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">74</td>
<td valign="top" align="left">Neutral</td>
</tr>
<tr>
<td valign="top" align="left">1</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">74</td>
<td valign="top" align="left">Negative</td>
</tr>
<tr>
<td valign="top" align="left">2</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">123</td>
<td valign="top" align="left">Positive</td>
</tr>
<tr>
<td valign="top" align="left">2</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">120</td>
<td valign="top" align="left">Positive</td>
</tr>
<tr>
<td valign="top" align="left">3</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">1,034</td>
<td valign="top" align="left">Positive</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The distributions seen in <xref ref-type="table" rid="T1">Table 1</xref> give a Fleiss&#x00027; &#x003BA; value of &#x003BA; &#x02248; 0.582, which is <italic>moderate agreement</italic> on the scale by Landis and Koch (<xref ref-type="bibr" rid="B51">1977</xref>). For a classification or annotation task, higher &#x003BA; values than this could be beneficial, but we accept this value since there are likely grey zones between the classes, and it is not obvious that clips with very few signals belong to any of the three classes without knowing the context of the clip. One annotator may assume that no signals are a positive signal, presuming that the context before the clip started is such that the listener has established a low grounding criterion for continuing the dialogue. Another annotator may have seen other clips of the robot eliciting feedback from users, and assume that the robot always wants the listener to react in some way, and thus evaluates the same clip as negative. For comparison, Malisz et al. (<xref ref-type="bibr" rid="B54">2016</xref>), who classified dialogues using four feedback levels similar to the schemes described in section 2.2, achieved a &#x003BA; score of around 0.3.</p>
</sec>
<sec>
<title>3.2.2. Multimodal Signals</title>
<p>Separately from the Mechanical Turk polarity classification, we also annotated each clip with what multimodal signals were used by the participant over time. This was done by a small number of annotators employed at KTH. We based our annotation system on the MUMIN standard by Allwood et al. (<xref ref-type="bibr" rid="B3">2007</xref>). To stay consistent with the clip format that had been used on Mechanical Turk, the multimodal feature annotation was also done clip-by-clip, rather than for the entire recording of a participant&#x00027;s interaction with the robot. This had the downside of making signals that were cut off by the beginning or end of a clip impossible to annotate properly, and the advantage of ensuring that only signals that could be fully seen by the Mechanical Turk evaluators were annotated.</p>
<p>In the MUMIN standard (Allwood et al., <xref ref-type="bibr" rid="B3">2007</xref>), multimodal expressions are only annotated if they have a communicative function, and ignored if they are incidental to the communication (for example, if a participant blinks because their eyes are dry). MUMIN standardises how signals should be annotated for their feedback function (contact and perception, and optionally understanding), but we did not use this part of the standard, since we assumed that such information was captured in the polarity annotation by the Mechanical Turk crowdworkers. Finally, MUMIN is intended to be used for annotation of face-to-face interactions between two participants where the signals used by both participants are relevant and where turn-taking is important. However, since our scenario was very restricted in terms of turn-taking, and since we knew which behaviours were used by the speaking robot, we restricted our annotation to only use MUMIN constructs for annotating the signals used by the listening user, ignoring their meaning.</p>
</sec>
</sec>
<sec>
<title>3.3. Data Processing</title>
<p>We post-processed and merged the annotated data from Mechanical Turk and our internal MUMIN-like annotation. Since the goal of this analysis is to find out what feedback could be detected in an online interaction, the post-processing involved taking steps to make the multimodal data more feasible to have been generated in real time. For example, two of the features annotated by our multimodal annotation were <italic>multiple nods</italic> and <italic>single nod</italic>&#x02014;but it would be impossible for a real-time system to know if a nod is the first of many or a single nod in isolation at the time that the head gesture starts. To address this for head gestures and speech, our post-processing procedure involved replacing all head gesture features by an arbitrary <italic>ongoing head gesture</italic> feature. The feature that describes what the head gesture had been is only delivered on the last frame of <italic>ongoing head gesture</italic>. The same was done to speech (<italic>ongoing speech</italic>).</p>
<p>As a separate part of the post-processing process, we converted the transcribed text of the participants&#x00027; speech to binary features representing the contents of the speech: <italic>can&#x00027;t hear, can&#x00027;t see, no</italic> and <italic>yes</italic>. These features were based on whether the transcribed text contains some variation on those phrases in either Swedish or English. The prosody of the speech segment was also converted into <italic>rising F0</italic> or <italic>falling F0</italic> through Praat (Boersma and van Heuven, <xref ref-type="bibr" rid="B13">2001</xref>) by comparing the average <italic>F0</italic> for the first half of the speech segment to the <italic>F0</italic> in the second half, in the cases where Praat was able to extract these values. As mentioned in the previous paragraph, these classifications of speech contents or prosody are only delivered on the last frame of the <italic>ongoing speech</italic> annotation.</p>
<p>Our annotators also labelled speech as backchannels or non-backchannels. This distinction is not universal, and our labelling corresponds to what Duncan and Fiske (<xref ref-type="bibr" rid="B24">1977</xref>) would instead call <italic>short back-channels</italic> and <italic>long back-channels</italic>. Additionally, each speech segment was transcribed (if intelligible). Our annotators disagreed on whether short speech like <italic>yeah</italic> or <italic>OK</italic> should be considered backchannels or not, with a tendency to annotate them as non-backchannels. We retroactively went through the data and changed any speech segment that had been transcribed as just the single word <italic>yes, yeah, OK, okay, yep</italic>, and the corresponding phrases in Swedish, to backchannels if they had been annotated as non-backchannels. Longer speech segments containing the words in question (e.g., &#x0201C;OK, that makes sense.&#x0201D;) were left as non-backchannels.</p>
</sec>
<sec>
<title>3.4. Statistical Models</title>
<p>Apart from analysing how the multimodal signals correlate with the three feedback labels (positive, negative, and neutral), we also apply four statistical models to our dataset, in order to analyse to what extent it is possible to predict these labels from the signals. We here provide a brief overview of these models.</p>
<sec>
<title>3.4.1. Random Forest</title>
<p>Random forest models are a variant of decision tree models where a number of trees classify the data. If used for classification, like in our case, the majority vote determines the forest&#x00027;s classification. Random forests were originally proposed by Svetnik et al. (<xref ref-type="bibr" rid="B84">2003</xref>). They have previously performed well on feedback analysis tasks, like in the recent work by Jain et al. (<xref ref-type="bibr" rid="B39">2021</xref>), who recently successfully used random forests to identify multimodal feedback in clips of test participants, or in the work by Soldner et al. (<xref ref-type="bibr" rid="B80">2019</xref>), who successfully used random forests to classify whether participants in a study were lying based on multimodal cues. Yu and Tapus (<xref ref-type="bibr" rid="B97">2019</xref>) used random forests to classify emotions based on the combined modalities of thermal vision and body pose, finding that the random forest model successfully combined the modalities to achieve better performance than on either modality in isolation.</p>
</sec>
<sec>
<title>3.4.2. RPART Tree</title>
<p>RPART, short for <italic>recursive partitioning</italic>, is an algorithm for how to split the data when generating a decision tree. The trees are thus simply decision trees, and RPART is a name for the algorithm used to generate them (Hothorn et al., <xref ref-type="bibr" rid="B35">2006</xref>).</p>
<p>RPART trees were not included in our analysis because we expected them to outperform random forest methods, but rather because they are easy to visualise and generate human-understandable patterns for classification. A brief analysis of the generated RPART trees can be found in section 4.2.3.</p>
</sec>
<sec>
<title>3.4.3. Multinominal Regression</title>
<p>Multinomial regression is an extension of logistic regression that allows for multi-class classification by linking the input signals to probabilities for each class. Multinomial regression and variants of logistic regression have been used successfully for dialogue state tracking (Bohus and Rudnicky, <xref ref-type="bibr" rid="B14">2006</xref>) and multimodal signal sensing (Jimenez-Molina et al., <xref ref-type="bibr" rid="B40">2018</xref>; Hsieh et al., <xref ref-type="bibr" rid="B37">2019</xref>).</p>
</sec>
<sec>
<title>3.4.4. LSTM Model</title>
<p>The LSTM neural network model, short for <italic>long short-term memory</italic>, was proposed by Hochreiter and Schmidhuber (<xref ref-type="bibr" rid="B34">1997</xref>). Neural networks utilising LSTMs have been used to model a large space of tasks since their introduction, including dialog state tracking (Zilka and Jurcicek, <xref ref-type="bibr" rid="B100">2016</xref>; Pichl et al., <xref ref-type="bibr" rid="B69">2020</xref>) and turn-taking (Skantze, <xref ref-type="bibr" rid="B77">2017</xref>). Within the multimodal feedback space, Agarwal et al. (<xref ref-type="bibr" rid="B1">2019</xref>) have shown that LSTM models can perform incremental sentiment analysis, and Ma et al. (<xref ref-type="bibr" rid="B52">2019</xref>) have proposed model structures that make use of multimodal signals to classify emotions in subjects.</p>
<p>Our LSTM model starts with an embedding layer between the input features and the LSTM layer. The LSTM layer has 64 nodes, feeding into a three-wide embedding layer, feeding into a Softmax layer which gives the outputs as classification probabilities. Categorical cross-entropy is used as the loss function, and categorical accuracy as the accuracy function. For each fold, the model was trained for 100 epochs. The accuracy on the final time-step of each clip was calculated, and this accuracy value was used to choose the most accurate epoch. Larger and deeper models were tried out, but did not achieve better accuracy overall.</p>
</sec>
<sec>
<title>3.4.5. Data Formatting</title>
<p>For the non-timing-aware random forest, RPART tree, and multinomial regression models, the multimodal feature set of each clip is converted into static features in four different ways. If the features are <italic>split</italic>, then the signals used during the robot&#x00027;s speech are separated from the features used during the participant&#x00027;s response. The alternative to this is <italic>non-split</italic>, where each feature represents the usage of a signal for the entire clip. If the features are formatted using <italic>binary</italic> formatting, then the value of the feature is 1 if is present at any point in the clip, and 0 otherwise; the alternative to this is <italic>fractional</italic> formatting, where a value between 0 and 1 (inclusive) is used, representing for how much of the clip the feature is present. This post-processing is required because only the LSTM model can be fed data by time-frame.</p>
<p>For the LSTM model, data is instead segmented into 100 ms time-frames, which allows it to make decisions that depend on the timing and ordering of the signals. In this data, a feature has the binary value of 1 if it is present at some point in the time-frame, and 0 otherwise, with the exceptions for specific head gestures and speech classification mentioned in section 3.3&#x02014; speech was turned into <italic>ongoing speech</italic> until its final time-frame, and head gestures were turned into <italic>ongoing head gesture</italic>.</p>
</sec>
</sec>
</sec>
<sec sec-type="results" id="s4">
<title>4. Results</title>
<p>We split our results into two parts. In section 4.1, the data that we collected and annotated is analysed for statistical patterns. In section 4.2, we investigate to what extent it is possible to predict a clip&#x00027;s polarity from its multimodal signals using various statistical models.</p>
<sec>
<title>4.1. Data Analysis</title>
<p>A summary of the signals that were annotated can be seen in <xref ref-type="table" rid="T2">Table 2</xref>. This table groups annotated signals by modality as <bold>pose</bold>, <bold>facial expressions</bold>, <bold>gaze direction</bold>, <bold>head gestures</bold>, and <bold>speech</bold>. The duration of the signal is not taken into account here. The <italic>positive</italic> class has a clear correlation with head nods, and the <italic>negative</italic> class has a clear correlation with the <italic>speech was no</italic> class, but the <italic>neutral</italic> class is mostly characterised by a lack of signals. It is never the class with the highest proportion of a signal, and when a signal appears more often in neutral clips than in positive <italic>or</italic> neutral clips, it is usually a signal that we would assume to be ambiguous, like <italic>arms misc</italic> or <italic>head gesture was single tilt</italic>.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>How common each signal detailed in section 3.3 is, clip-by-clip, for all clips, positive clips, negative clips and neutral clips.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Modality</bold></th>
<th valign="top" align="left"><bold>Signal</bold></th>
<th valign="top" align="center"><bold>All (%)</bold></th>
<th valign="top" align="center"><bold>Positive (%)</bold></th>
<th valign="top" align="center"><bold>Negative (%)</bold></th>
<th valign="top" align="center"><bold>Neutral (%)</bold></th>
<th valign="top" align="center"><bold>Sign. 3<xref ref-type="table-fn" rid="TN1"><sup>&#x02020;</sup></xref></bold></th>
<th valign="top" align="center"><bold>Sign. &#x000B1;<xref ref-type="table-fn" rid="TN2"><sup>&#x02021;</sup></xref></bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="middle" align="left" rowspan="4">Pose</td>
<td valign="top" align="left">Cross arms</td>
<td valign="top" align="center">9.2</td>
<td valign="top" align="center">9.1</td>
<td valign="top" align="center">11.9</td>
<td valign="top" align="center">7.9</td>
<td valign="top" align="center">ns</td>
<td valign="top" align="center">ns</td>
</tr>
<tr>
<td valign="top" align="left">Arms behind the back</td>
<td valign="top" align="center">0.0</td>
<td valign="top" align="center">0.1</td>
<td valign="top" align="center">0.0</td>
<td valign="top" align="center">0.0</td>
<td valign="top" align="center">ns</td>
<td valign="top" align="center">ns</td>
</tr>
<tr>
<td valign="top" align="left">Arms misc</td>
<td valign="top" align="center">31.8</td>
<td valign="top" align="center">33.1</td>
<td valign="top" align="center">28.2</td>
<td valign="top" align="center">31.1</td>
<td valign="top" align="center">ns</td>
<td valign="top" align="center">ns</td>
</tr>
<tr>
<td valign="top" align="left">Shrug</td>
<td valign="top" align="center">0.4</td>
<td valign="top" align="center">0.4</td>
<td valign="top" align="center">1.0</td>
<td valign="top" align="center">0.0</td>
<td valign="top" align="center">ns</td>
<td valign="top" align="center">ns</td>
</tr>
 <tr style="border-top: thin solid #000000;">
<td valign="middle" align="left" rowspan="6">Face</td>
<td valign="top" align="left">Eyebrow raise</td>
<td valign="top" align="center">9.8</td>
<td valign="top" align="center">11.6</td>
<td valign="top" align="center">12.2</td>
<td valign="top" align="center">4.3</td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN5"><sup>&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
<td valign="top" align="center">ns</td>
</tr>
<tr>
<td valign="top" align="left">Frown</td>
<td valign="top" align="center">12.9</td>
<td valign="top" align="center">8.5</td>
<td valign="top" align="center"><bold>32.4</bold></td>
<td valign="top" align="center">11.2</td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
</tr>
<tr>
<td valign="top" align="left">Facial laughter</td>
<td valign="top" align="center">8.3</td>
<td valign="top" align="center">8.8</td>
<td valign="top" align="center">10.9</td>
<td valign="top" align="center">5.5</td>
<td valign="top" align="center">ns</td>
<td valign="top" align="center">ns</td>
</tr>
<tr>
<td valign="top" align="left">Lip pout</td>
<td valign="top" align="center">4.8</td>
<td valign="top" align="center">6.0</td>
<td valign="top" align="center">5.1</td>
<td valign="top" align="center">1.8</td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN3"><sup>&#x0002A;</sup></xref></td>
<td valign="top" align="center">ns</td>
</tr>
<tr>
<td valign="top" align="left">Mouth miscellaneous</td>
<td valign="top" align="center">18.4</td>
<td valign="top" align="center">20.6</td>
<td valign="top" align="center">24.4</td>
<td valign="top" align="center">9.4</td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
<td valign="top" align="center">ns</td>
</tr>
<tr>
<td valign="top" align="left">Smile</td>
<td valign="top" align="center">25.5</td>
<td valign="top" align="center">29.7</td>
<td valign="top" align="center">34.0</td>
<td valign="top" align="center">10.4</td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
<td valign="top" align="center">ns</td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="middle" align="left" rowspan="3">Gaze</td>
<td valign="top" align="left">Gaze on miscellaneous</td>
<td valign="top" align="center">3.0</td>
<td valign="top" align="center">3.4</td>
<td valign="top" align="center">2.2</td>
<td valign="top" align="center">2.6</td>
<td valign="top" align="center">ns</td>
<td valign="top" align="center">ns</td>
</tr>
<tr>
<td valign="top" align="left">Gaze on poster</td>
<td valign="top" align="center">98.6</td>
<td valign="top" align="center">98.8</td>
<td valign="top" align="center">98.1</td>
<td valign="top" align="center">98.2</td>
<td valign="top" align="center">ns</td>
<td valign="top" align="center">ns</td>
</tr>
<tr>
<td valign="top" align="left">Gaze on robot</td>
<td valign="top" align="center">75.0</td>
<td valign="top" align="center">74.7</td>
<td valign="top" align="center">84.9</td>
<td valign="top" align="center">69.7</td>
<td valign="top" align="center">ns</td>
<td valign="top" align="center">ns</td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="middle" align="left" rowspan="9">Head gestures</td>
<td valign="top" align="left">Head gesture</td>
<td valign="top" align="center">68.1</td>
<td valign="top" align="center"><bold>88.6</bold></td>
<td valign="top" align="center">49.7</td>
<td valign="top" align="center">31.5</td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
</tr>
<tr>
<td valign="top" align="left">Jerk backwards</td>
<td valign="top" align="center">2.6</td>
<td valign="top" align="center">2.7</td>
<td valign="top" align="center">4.2</td>
<td valign="top" align="center">1.6</td>
<td valign="top" align="center">ns</td>
<td valign="top" align="center">ns</td>
</tr>
<tr>
<td valign="top" align="left">Jerk forwards</td>
<td valign="top" align="center">4.5</td>
<td valign="top" align="center">3.4</td>
<td valign="top" align="center"><bold>9.0</bold></td>
<td valign="top" align="center">4.5</td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN4"><sup>&#x0002A;&#x0002A;</sup></xref></td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN5"><sup>&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
</tr>
<tr>
<td valign="top" align="left">Multiple head shakes</td>
<td valign="top" align="center">1.5</td>
<td valign="top" align="center">0.6</td>
<td valign="top" align="center"><bold>7.4</bold></td>
<td valign="top" align="center">0.2</td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
</tr>
<tr>
<td valign="top" align="left">Multiple nods</td>
<td valign="top" align="center">40.3</td>
<td valign="top" align="center"><bold>62.8</bold></td>
<td valign="top" align="center">6.7</td>
<td valign="top" align="center">8.1</td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
</tr>
<tr>
<td valign="top" align="left">Multiple tilts</td>
<td valign="top" align="center">1.2</td>
<td valign="top" align="center">1.0</td>
<td valign="top" align="center">2.2</td>
<td valign="top" align="center">1.2</td>
<td valign="top" align="center">ns</td>
<td valign="top" align="center">ns</td>
</tr>
<tr>
<td valign="top" align="left">Single head shake</td>
<td valign="top" align="center">3.4</td>
<td valign="top" align="center">3.0</td>
<td valign="top" align="center">6.7</td>
<td valign="top" align="center">2.4</td>
<td valign="top" align="center">ns</td>
<td valign="top" align="center">ns</td>
</tr>
<tr>
<td valign="top" align="left">Single nod</td>
<td valign="top" align="center">20.4</td>
<td valign="top" align="center"><bold>28.2</bold></td>
<td valign="top" align="center">8.3</td>
<td valign="top" align="center">9.6</td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
</tr>
<tr>
<td valign="top" align="left">Single tilt</td>
<td valign="top" align="center">9.3</td>
<td valign="top" align="center">7.0</td>
<td valign="top" align="center"><bold>16.0</bold></td>
<td valign="top" align="center">10.6</td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN5"><sup>&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="middle" align="left" rowspan="10">Speech</td>
<td valign="top" align="left">Speech</td>
<td valign="top" align="center">55.1</td>
<td valign="top" align="center">70.0</td>
<td valign="top" align="center">75.6</td>
<td valign="top" align="center">7.5</td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
<td valign="top" align="center">ns</td>
</tr>
<tr>
<td valign="top" align="left">Speech with rising F0</td>
<td valign="top" align="center">30.2</td>
<td valign="top" align="center">36.4</td>
<td valign="top" align="center"><bold>51.3</bold></td>
<td valign="top" align="center">2.6</td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN4"><sup>&#x0002A;&#x0002A;</sup></xref></td>
</tr>
<tr>
<td valign="top" align="left">Speech with falling F0</td>
<td valign="top" align="center">24.9</td>
<td valign="top" align="center">31.8</td>
<td valign="top" align="center">33.3</td>
<td valign="top" align="center">3.3</td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
<td valign="top" align="center">ns</td>
</tr>
<tr>
<td valign="top" align="left">Backchannel</td>
<td valign="top" align="center">19.4</td>
<td valign="top" align="center">26.1</td>
<td valign="top" align="center">17.0</td>
<td valign="top" align="center">5.1</td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
<td valign="top" align="center">ns</td>
</tr>
<tr>
<td valign="top" align="left">Not backchannel</td>
<td valign="top" align="center">42.4</td>
<td valign="top" align="center">52.2</td>
<td valign="top" align="center"><bold>69.2</bold></td>
<td valign="top" align="center">3.1</td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN4"><sup>&#x0002A;&#x0002A;</sup></xref></td>
</tr>
<tr>
<td valign="top" align="left">Speech with interrogative</td>
<td valign="top" align="center">2.2</td>
<td valign="top" align="center">0.3</td>
<td valign="top" align="center"><bold>13.1</bold></td>
<td valign="top" align="center">0.2</td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
</tr>
<tr>
<td valign="top" align="left">&#x0201C;Can&#x00027;t hear&#x0201D;</td>
<td valign="top" align="center">2.9</td>
<td valign="top" align="center">0.1</td>
<td valign="top" align="center"><bold>18.3</bold></td>
<td valign="top" align="center">0.0</td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
</tr>
<tr>
<td valign="top" align="left">&#x0201C;Can&#x00027;t see&#x0201D;</td>
<td valign="top" align="center">0.3</td>
<td valign="top" align="center">0.0</td>
<td valign="top" align="center"><bold>1.9</bold></td>
<td valign="top" align="center">0.0</td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
</tr>
<tr>
<td valign="top" align="left">&#x0201C;No&#x0201D;</td>
<td valign="top" align="center">1.3</td>
<td valign="top" align="center">0.5</td>
<td valign="top" align="center"><bold>6.4</bold></td>
<td valign="top" align="center">0.0</td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
</tr>
<tr>
<td valign="top" align="left">&#x0201C;Yes&#x0201D;</td>
<td valign="top" align="center">25.4</td>
<td valign="top" align="center"><bold>41.6</bold></td>
<td valign="top" align="center">3.5</td>
<td valign="top" align="center">0.8</td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
<td valign="top" align="center"><xref ref-type="table-fn" rid="TN6"><sup>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</sup></xref></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Signals are grouped by modalities. The right-most two columns present significance analyses of the distributions of the signals across clip labels, see section 4.1.1. If a signal is significantly different between positive and negative labels, the over-represented class is marked in bold</italic>.</p>
<fn id="TN1">
<label>&#x02020;</label>
<p><italic>&#x003C7;<sup>2</sup> significance (Bonferroni-corrected) of the distribution on all three labels</italic>.</p></fn>
<fn id="TN2">
<label>&#x02021;</label>
<p><italic>&#x003C7;<sup>2</sup> significance (Bonferroni-corrected) of the distribution of only positive and negative clips</italic>.</p></fn>
<p><italic>ns, Not significant</italic>.</p>
<fn id="TN3">
<label>&#x0002A;</label>
<p><italic>p &#x0003C; 0.05/32</italic>.</p></fn>
<fn id="TN4">
<label>&#x0002A;&#x0002A;</label>
<p><italic>p &#x0003C; 0.01/32</italic>.</p></fn>
<fn id="TN5">
<label>&#x0002A;&#x0002A;&#x0002A;</label>
<p><italic>p &#x0003C; 0.001/32</italic>.</p></fn>
<fn id="TN6">
<label>&#x0002A;&#x0002A;&#x0002A;&#x0002A;</label>
<p><italic>p &#x0003C; 0.0001/32</italic>.</p></fn>
</table-wrap-foot>
</table-wrap>
<sec>
<title>4.1.1. Individual Features Correlated With Labels</title>
<p>The rightmost two columns of <xref ref-type="table" rid="T2">Table 2</xref> show analyses of the distribution of feedback signals across clips labelled as positive, negative, and neutral. We use a &#x003C7;<sup>2</sup> test, Bonferroni corrected, to find if any signal is significantly more or less common in any of the three labels. If this is true, illustrated by the &#x0201C;Sign. 3&#x0201D; column in <xref ref-type="table" rid="T2">Table 2</xref>, we perform a follow-up test only on the positive and negative classes, and report &#x003C7;<sup>2</sup> test significance on that test as well. The results of the follow-up test are presented in the column labelled &#x0201C;Sign. &#x000B1;.&#x0201D;</p>
<p>Many signals, notably all speech signals and most head movement and facial gesture signals, show strong significance when comparing the distribution of all three labels to the distribution of the signal. Looking at the percentages of how often the signal shows up across different labels, discussed above, we find that strong significance in &#x0201C;Sign. 3&#x0201D; typically means that the signal is a strong indication that the clip is not neutral. This is clearly the case for speech and its sub-signals, where speech appears more than 70% of the time in positive or negative clips, but only 7.5% of the time in neutral clips.</p>
<p>An exception to significance in the &#x0201C;Sign. 3&#x0201D; column indicating that the clip is not neutral appears to be the &#x0201C;Jerk forwards&#x0201D; head gesture feature. This feature appears more often in neutral clips than in positive clips. Because of this, the significance in the &#x0201C;Sign. 3&#x0201D; column can be seen as evidence that this signal shows only that a clip is not positive, but that it could still be neutral.</p>
<p>Some signals also have strong significance when comparing the distribution between only positive and negative clips, presented in the rightmost column labelled &#x0201C;Sign &#x000B1;.&#x0201D; Since this is only a comparison of two classes, a quick <italic>post-hoc</italic> test can be performed by simply comparing the percentages of how often the signals appear in positive and negative videos. This is indicated in <xref ref-type="table" rid="T2">Table 2</xref> by marking the most common label in bold, where &#x0201C;Sign &#x000B1;&#x0201D; is significant.</p>
<p>Frowning is significantly connected to negativity, since it appears much more often in negative clips than in positive clips. Speech in general only indicates that a clip is not neutral, but the sub-classifications of speech are strongly correlated with positivity or negativity. The &#x0201C;no&#x0201D; and &#x0201C;yes&#x0201D; features are strongly correlated with negativity and positivity, respectively. Rising F0 is connected to negativity, which can be explained by its connection to a questioning tone, as mentioned in section 2.3.1. Nods and head shakes are obviously strong signals of positivity and negativity, respectively, but head jerks&#x02014;the movement of the head forwards&#x02014;are also correlated with negativity, or, at least, a lack of positivity, as mentioned above. Our interpretation is that this gesture generally conveys confusion and a negatively connotated surprise in our data.</p>
<p><xref ref-type="table" rid="T2">Table 2</xref> shows that speech that is a backchannel is not significantly differently distributed between positive and negative clips, while speech that <italic>is</italic> a backchannel <italic>is</italic> significantly different between positive and negative labels. The proportion of neutral clips is higher for backchannels, and the remaining difference between positive and negative clips is small enough that the difference is not significant for backchannels. We interpret this as non-backchannel speech carrying more verbal information, usually having a more distinct meaning.</p>
<p>We conclude that neutral clips are associated with the <italic>absence</italic> of speech and head gestures, while positivity and negativity are indicated more strongly by differences between sub-labels of head movements and speech.</p>
</sec>
<sec>
<title>4.1.2. Individual Differences</title>
<p>The distribution of positive, negative, and neutral clips in the entire dataset is 59, 15, and 25%, respectively, as seen in <xref ref-type="table" rid="T1">Table 1</xref>. We perform a &#x003C7;<sup>2</sup> test on the participants to see if any participants deviate from the expected proportion of labels. This &#x003C7;<sup>2</sup> test has two degrees of freedom&#x02014;three labels and one participant at a time&#x02014;and we regard the given p-value as significant if it is lower than or equal to a Bonferroni-corrected 0.05/28 &#x02248; 0.0018. 15 out of the 28 participants significantly differ from the mean: the distributions and which participants are significant are presented in <xref ref-type="fig" rid="F2">Figure 2</xref>, where the significantly different individuals are marked in bold. F9, F27, and F33 stand out by their unusually high proportion of neutral clips.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Distributions of neutral, negative and positive clips for each participant. Each distribution is compared to the mean, expected distribution through a &#x003C7;<sup>2</sup> test, and individuals who significantly deviate from the mean are presented by bolding the identifier, the same as in <xref ref-type="fig" rid="F3">Figure 3</xref>.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-03-741148-g0002.tif"/>
</fig>
<p>We also want to see how the usage of the multimodal signals differs between individuals. A high-level view of this distribution is to group signals by modalities. Our modality groups can be seen in the leftmost column of <xref ref-type="table" rid="T2">Table 2</xref>. <xref ref-type="fig" rid="F3">Figure 3</xref> shows how the modalities used differ between the 28 individuals in our dataset. <xref ref-type="fig" rid="F3">Figure 3A</xref> displays positive clips, and <xref ref-type="fig" rid="F3">Figure 3B</xref> displays negative clips. As can be expected from <xref ref-type="table" rid="T2">Table 2</xref>, speech is slightly more common in negative clips than in positive clips, but there are outliers. Participant F27 never uses speech, in either positive or negative clips.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>The usage of different modalities of feedback from different individuals in our dataset. <bold>(A)</bold> Only when considering clips labelled as <italic>positive</italic> by the crowdworkers. <bold>(B)</bold> Only considering clips labelled as <italic>negative</italic>. In both graphs, distributions that are significantly different from the mean according to a &#x003C7;<sup>2</sup> test are marked by bolding the identifier of the participant, see section 4.1.2.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-03-741148-g0003.tif"/>
</fig>
<p>We perform a &#x003C7;<sup>2</sup> analysis to see if the usage of modalities differed significantly between individuals. This analysis is separated by positive, negative, and neutral clips, to find differences in how modalities are used to communicate those three labels. The &#x003C7;<sup>2</sup> test is performed on each of the 28 participants individually: it has <italic>df</italic> &#x0003D; 4, since there are five modalities, and the &#x003C7;<sup>2</sup> test has to give a p-value lower than the Bonferroni-corrected 0.05/28 &#x02248; 0.0018 to be considered significant. The individuals that differ significantly from the mean are denoted with bold labels in <xref ref-type="fig" rid="F3">Figure 3</xref>. 6/28 participants differ significantly from the overall distribution for negative clips, 13/28 differ significantly for neutral clips, and 21/28 differ for positive clips. This tells us that there are significant differences in how individuals choose to use different modalities to give feedback, and that those differences are larger for positive feedback than for negative feedback. Thus, any feedback detection method relying on a single modality is likely to not work well on all subjects.</p>
</sec>
<sec>
<title>4.1.3. Feature Analysis Over Time</title>
<p>The analysis above only considers whether the signal is present or not in the clip: it does not take the timing or length of the signal into account. While <xref ref-type="table" rid="T2">Table 2</xref> shows that some signals are strongly associated with positivity or negativity by their very presence in a clip, it is also possible that the meaning of a signal could also depend on its timing within a clip, both in relation to the robot&#x00027;s speech and in relation to other signals produced by the human participant. This would not be visible in <xref ref-type="table" rid="T2">Table 2</xref>.</p>
<p><xref ref-type="fig" rid="F4">Figure 4A</xref> shows a tendency for positive and negative clips to have user speech after the robot finishes speaking, but negative clips have a peak later than positive clips. <xref ref-type="fig" rid="F4">Figure 4B</xref> shows the timing of gaze on the robot, and <xref ref-type="fig" rid="F4">Figure 4C</xref> shows the timing of gaze on the poster. These graphs show a tendency for participants in negative clips to gaze at the robot after it stops speaking. In positive clips, there is instead tendency to gaze at the poster after the robot stops speaking. Our initial hypothesis was that these patterns indicated a larger trend in the data for positively polarised signals to appear earlier in the user&#x00027;s turn and be shorter, and we believed that these patterns would be of the type that a timing-aware classification model would outperform one that was not timing-aware. We will come back to this hypothesis in the next section.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p><bold>(A&#x02013;C)</bold> Where in the clips that the features <italic>ongoing speech, gaze on robot</italic>, and <italic>gaze on poster</italic>, from top to bottom, as described in section 3.3, appear. The horizontal axis has been normalised and split into two parts; on the left, the part of the clip where the robot speaks is represented with four points, representing whether the signal appears in the first 25% of the robot&#x00027;s speech, between 25 and 50%, between 50 and 75%, and finally for the last 25% of the part of the clip where the robot speaks. The same split is done for the part of the clip where the robot does not speak, which is similarly represented on the right half of the horizontal axis. The vertical axis shows how many of the clips that are marked as positive, negative, and neutral that contain the signal in that span. Note that all three Y axes are clipped to make the comparison between neutral, negative and positive clips clearer.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-03-741148-g0004.tif"/>
</fig>
</sec>
</sec>
<sec>
<title>4.2. Statistical Modelling</title>
<p>In this section we analyse to what extent it is possible to use statistical models for predicting the three feedback labels (positive, negative, and neutral) based on the annotated features described in section 3.</p>
<sec>
<title>4.2.1. Comparison of Models</title>
<p>Our data set is split into ten folds for use in ten fold cross-validation. The mean categorical accuracy and F-score for each model over the ten folds are presented in <xref ref-type="table" rid="T3">Table 3</xref>. For models that output probabilities for each class, the prediction is judged as accurate if the highest-rated class predicted by the model is also the highest-rated class by the Mechanical Turk annotators as described in section 3.2, breaking ties in favour of neutrality over negativity over positivity.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Accuracy and F-score for each combination of feature format and statistical model, as presented in section 3.4.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Fractional</bold></th>
<th valign="top" align="left"><bold>Split</bold></th>
<th valign="top" align="left"><bold>Model</bold></th>
<th valign="top" align="center"><bold>Average accuracy (%)</bold></th>
<th valign="top" align="center"><bold>Average F-score</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">No</td>
<td valign="top" align="left">Yes</td>
<td valign="top" align="left">Multinomial regression</td>
<td valign="top" align="center">85.857</td>
<td valign="top" align="center"><bold>0.814</bold></td>
</tr>
<tr>
<td valign="top" align="left">Yes</td>
<td valign="top" align="left">Yes</td>
<td valign="top" align="left">Random forest</td>
<td valign="top" align="center"><bold>85.998</bold></td>
<td valign="top" align="center">0.811</td>
</tr>
<tr>
<td valign="top" align="left">No</td>
<td valign="top" align="left">Yes</td>
<td valign="top" align="left">Random forest</td>
<td valign="top" align="center">85.372</td>
<td valign="top" align="center">0.805</td>
</tr>
<tr>
<td valign="top" align="left">No</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">Random forest</td>
<td valign="top" align="center">84.971</td>
<td valign="top" align="center">0.804</td>
</tr>
<tr>
<td valign="top" align="left">Yes</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">Random forest</td>
<td valign="top" align="center">84.755</td>
<td valign="top" align="center">0.801</td>
</tr>
<tr>
<td valign="top" align="left">Yes</td>
<td valign="top" align="left">Yes</td>
<td valign="top" align="left">Multinomial regression</td>
<td valign="top" align="center">84.052</td>
<td valign="top" align="center">0.797</td>
</tr>
<tr>
<td valign="top" align="left">No</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">Multinomial regression</td>
<td valign="top" align="center">84.414</td>
<td valign="top" align="center">0.796</td>
</tr>
<tr>
<td valign="top" align="left">No</td>
<td valign="top" align="left">Yes</td>
<td valign="top" align="left">RPART tree</td>
<td valign="top" align="center">85.006</td>
<td valign="top" align="center">0.795</td>
</tr>
<tr>
<td valign="top" align="left">Yes</td>
<td valign="top" align="left">Yes</td>
<td valign="top" align="left">RPART tree</td>
<td valign="top" align="center">84.365</td>
<td valign="top" align="center">0.785</td>
</tr>
<tr>
<td valign="top" align="left">Yes</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">Multinomial regression</td>
<td valign="top" align="center">82.764</td>
<td valign="top" align="center">0.783</td>
</tr>
<tr>
<td valign="top" align="left">-</td>
<td valign="top" align="left">-</td>
<td valign="top" align="left">LSTM</td>
<td valign="top" align="center">83.861</td>
<td valign="top" align="center">0.781</td>
</tr>
<tr>
<td valign="top" align="left">No</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">RPART tree</td>
<td valign="top" align="center">83.235</td>
<td valign="top" align="center">0.777</td>
</tr>
<tr>
<td valign="top" align="left">Yes</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">RPART tree</td>
<td valign="top" align="center">83.222</td>
<td valign="top" align="center">0.775</td>
</tr>
<tr>
<td valign="top" align="left">-</td>
<td valign="top" align="left">-</td>
<td valign="top" align="left">Baseline</td>
<td valign="top" align="center">59.2</td>
<td valign="top" align="center">0.248</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>The highest values in each column have been marked in bold</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p>A baseline model is introduced for comparison. This baseline model always predicts the most common clip class in the training data. This is positive clips for each fold, as can be expected from <xref ref-type="table" rid="T1">Table 1</xref>. The baseline model&#x00027;s accuracy of 59.2% therefore is exactly the proportion of positive clips in the data set as a whole. Its F-score of 0.248 is, as expected, much lower.</p>
<p>The two most well-performing models are the multinomial regression model on split and binary data, and the random forest model on split and fractional data. The multinomial regression model is, by the nature of what a regression model can do, not capable of handling interactions between features, like if head nods mean something different in combination with some other feature X than on their own. Compared to this, the random forest model can theoretically consider interactions between signals, but does not achieve notably higher accuracy or F-score than the multinomial regression model.</p>
</sec>
<sec>
<title>4.2.2. Analysis of Feature Importance</title>
<p><xref ref-type="table" rid="T3">Table 3</xref> shows that multinomial regression and random forests perform similarly well on our data set. To evaluate which <italic>combinations</italic> of features had the strongest classifying power, we perform a meta-analysis with the random forest configuration that achieved the highest accuracy&#x02014;with fractional and split data. For each feature in our data set, a model is trained on only that feature, and the feature that results in the model capable of the highest F-score on the training set is selected. Each combination of two features, where the first was the feature selected in the first step, is then tested in the same way, and then three features, until the F-score does not increase (on both test and training sets) upon adding a new feature. The resulting progression of features is shown in <xref ref-type="fig" rid="F5">Figure 5</xref>. Both F-score on the training and test set are reported, but the models and features were only selected by the F-score on the training set.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>The results of the feature importance analysis described in section 4.2.2, for the <bold>random forest</bold> model. Please note that the Y axis has been clamped.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-03-741148-g0005.tif"/>
</fig>
<p>The features selected by the models show a pattern where orthogonal features are selected first (<italic>ongoing head gesture</italic> during the participant&#x00027;s turn, followed by <italic>ongoing speech</italic> during the participant&#x00027;s turn). Following these, features that refine the information given by the basic features are selected. The selection of <italic>head gesture was single tilt</italic> as the fourth feature seems strange, but it is possible that the model selects this feature since, if it is present, it means that head gesture is less likely to be nods or head shakes, which are split across four features which may not be as important on their own.</p>
</sec>
<sec>
<title>4.2.3. Visualisation of RPART Trees</title>
<p>An advantage of the RPART models in <xref ref-type="table" rid="T3">Table 3</xref> is that they are easily visualised to see which features had the highest classifying power. <xref ref-type="fig" rid="F6">Figure 6</xref> shows a tree trained on fractional and split data. The tree first splits on the presence of head gestures, and refines based on the presence of speech if there are no head gestures. If there are head gestures, the tree first attempts to refine based on which head gesture was present, and falls back to classifying based on the presence of speech if this is not possible. The initial splitting by orthogonal high-level features is similar to the order found for random forests in <xref ref-type="fig" rid="F5">Figure 5</xref>.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>The RPART tree for fractional, split data, with fold 6 used as test data.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-03-741148-g0006.tif"/>
</fig>
</sec>
<sec>
<title>4.2.4. Muting Modalities</title>
<p>The individual modalities shown in <xref ref-type="table" rid="T2">Table 2</xref>&#x02014;pose, face, gaze, head gestures, and speech&#x02014;refer to separate, possibly co-occurring ways to send feedback signals. To explore which of the modalities were less important, and which modalities could be expressed through combinations of other modalities, we train a random forest model on every combination of including and not including each modality. The results are presented in <xref ref-type="table" rid="T4">Table 4</xref>.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>An ordered presentation of F-scores and accuracies when random forest models are not given certain modalities.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Speech</bold></th>
<th valign="top" align="center"><bold>Head</bold></th>
<th valign="top" align="center"><bold>Face</bold></th>
<th valign="top" align="center"><bold>Gaze</bold></th>
<th valign="top" align="center"><bold>Pose</bold></th>
<th valign="top" align="center"><bold>Average accuracy (%)</bold></th>
<th valign="top" align="center"><bold>Average F-score (%)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">85.992</td>
<td valign="top" align="center">81.233</td>
</tr>
<tr style="background-color:#b4b3b2">
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td/>
<td valign="top" align="center">85.934</td>
<td valign="top" align="center">81.078</td>
</tr>
<tr>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">85.648</td>
<td valign="top" align="center">80.858</td>
</tr>
<tr style="background-color:#b4b3b2">
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td/>
<td/>
<td valign="top" align="center">85.67</td>
<td valign="top" align="center">80.801</td>
</tr>
<tr>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">85.755</td>
<td valign="top" align="center">80.726</td>
</tr>
<tr style="background-color:#b4b3b2">
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td/>
<td valign="top" align="center">85.637</td>
<td valign="top" align="center">80.696</td>
</tr>
<tr>
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td/>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">85.552</td>
<td valign="top" align="center">80.54</td>
</tr>
<tr style="background-color:#b4b3b2">
<td valign="top" align="left">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td/>
<td/>
<td/>
<td valign="top" align="center">85.492</td>
<td valign="top" align="center">80.425</td>
</tr>
<tr>
<td valign="top" align="left">&#x02713;</td>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">76.332</td>
<td valign="top" align="center">73.329</td>
</tr>
<tr style="background-color:#b4b3b2">
<td valign="top" align="left">&#x02713;</td>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td/>
<td/>
<td valign="top" align="center">75.06</td>
<td valign="top" align="center">72.065</td>
</tr>
<tr>
<td valign="top" align="left">&#x02713;</td>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td/>
<td valign="top" align="center">74.59</td>
<td valign="top" align="center">71.668</td>
</tr>
<tr style="background-color:#b4b3b2">
<td valign="top" align="left">&#x02713;</td>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">74.951</td>
<td valign="top" align="center">71.101</td>
</tr>
<tr>
<td valign="top" align="left">&#x02713;</td>
<td/>
<td/>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">72.088</td>
<td valign="top" align="center">69.749</td>
</tr>
<tr style="background-color:#b4b3b2">
<td valign="top" align="left">&#x02713;</td>
<td/>
<td/>
<td/>
<td/>
<td valign="top" align="center">71.704</td>
<td valign="top" align="center">69.719</td>
</tr>
<tr>
<td valign="top" align="left">&#x02713;</td>
<td/>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">73.002</td>
<td valign="top" align="center">69.499</td>
</tr>
<tr style="background-color:#b4b3b2">
<td valign="top" align="left">&#x02713;</td>
<td/>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td/>
<td valign="top" align="center">72.317</td>
<td valign="top" align="center">69.442</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">75.663</td>
<td valign="top" align="center">66.53</td>
</tr>
<tr style="background-color:#b4b3b2">
<td/>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td/>
<td valign="top" align="center">75.325</td>
<td valign="top" align="center">66.138</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td/>
<td valign="top" align="center">74.574</td>
<td valign="top" align="center">65.635</td>
</tr>
<tr style="background-color:#b4b3b2">
<td/>
<td valign="top" align="center">&#x02713;</td>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">74.434</td>
<td valign="top" align="center">65.375</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">74.617</td>
<td valign="top" align="center">64.197</td>
</tr>
<tr style="background-color:#b4b3b2">
<td/>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td/>
<td/>
<td valign="top" align="center">74.335</td>
<td valign="top" align="center">63.347</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td/>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">72.392</td>
<td valign="top" align="center">57.376</td>
</tr>
<tr style="background-color:#b4b3b2">
<td/>
<td valign="top" align="center">&#x02713;</td>
<td/>
<td/>
<td/>
<td valign="top" align="center">72.067</td>
<td valign="top" align="center">56.641</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">62.698</td>
<td valign="top" align="center">44.189</td>
</tr>
<tr style="background-color:#b4b3b2">
<td/>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td/>
<td valign="top" align="center">61.653</td>
<td valign="top" align="center">42.395</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td/>
<td/>
<td valign="top" align="center">59.695</td>
<td valign="top" align="center">35.673</td>
</tr>
<tr style="background-color:#b4b3b2">
<td/>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">60.569</td>
<td valign="top" align="center">35.137</td>
</tr>
<tr>
<td/>
<td/>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">59.8</td>
<td valign="top" align="center">34.473</td>
</tr>
<tr style="background-color:#b4b3b2">
<td/>
<td/>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td/>
<td valign="top" align="center">59.152</td>
<td valign="top" align="center">33.541</td>
</tr>
<tr>
<td/>
<td/>
<td/>
<td/>
<td valign="top" align="center">&#x02713;</td>
<td valign="top" align="center">59.107</td>
<td valign="top" align="center">24.945</td>
</tr>
<tr style="background-color:#b4b3b2">
<td/>
<td/>
<td/>
<td/>
<td/>
<td valign="top" align="center">59.199</td>
<td valign="top" align="center">24.771</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>This table represents specifically the <bold>random forest, fractional, split data</bold> model which achieves the highest accuracy in <xref ref-type="table" rid="T3">Table 3</xref></italic>.</p>
</table-wrap-foot>
</table-wrap>
<p>If including a certain modality would lead to overfitting, we could expect the model to perform better when excluding that modality. As can be seen in <xref ref-type="table" rid="T4">Table 4</xref>, including every modality does not lead to overfitting&#x02014;at least for a random forest model&#x02014;and there is a logical binary pattern where removing pose has the smallest effect, followed by gaze, face, head gestures, and speech in order. The model that is trained on data without any multimodal features performs worse than the baseline. We interpret this as a result of overfitting on the <italic>is elicitation</italic> meta-feature that remained.</p>
</sec>
<sec>
<title>4.2.5. Cross-Validation Leaving One Participant Out at a Time</title>
<p>The 10-fold validation we use for our main statistical model evaluation has the advantage of ensuring that test data and training data have comparable distributions of positive, negative, and neutral clips. This is not the case if the choices for training and test data are made on a participant-by-participant basis. However, leaving a single participant at a time out of the training data does illustrate if their behaviours are similar to or different to the behaviours expressed by the other participants left in the training data. It is also a better showcase of whether the trained models generalise to behaviours from individuals they have never seen before. In light of this argument, we perform 28-wise cross-validation on our dataset, for specifically the random forest model that achieves the highest accuracy in <xref ref-type="table" rid="T3">Table 3</xref>. Each participant is used for test data exactly once, with the others being used for training data. The results are presented in <xref ref-type="fig" rid="F7">Figure 7</xref>. Each participant is represented by the F-score and accuracy of the model where they were the test set.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>F-score (blue, the left bar of each pair) and accuracy (orange, the right bar of each pair) when training a random forest model on split, fractional data. For each of the 28 models, one individual is left out as test data&#x02014;this is the individual pointed out on the X axis label. The rest of the dataset is used as training data. This figure thus shows how certain individuals use behaviours that are harder to interpret based on training on the behaviours of the other individuals.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-03-741148-g0007.tif"/>
</fig>
<p>To analyse the results in <xref ref-type="fig" rid="F7">Figure 7</xref>, we want to know if participants with notably lower accuracy and/or F-score in <xref ref-type="fig" rid="F7">Figure 7</xref> correspond to the bolded participants in <xref ref-type="fig" rid="F3">Figure 3</xref>. Looking at the three participants with the lowest F-score&#x02014;F3, F27, and F33&#x02014;we see that F3 has significantly deviating usage of modalities in both positive and negative clips, F27 deviates from the mean only for positive clips, and F33 only for negative clips. Both F27 and F33 have unusually high proportions of neutral clips (see section 4.1.2), so the model&#x00027;s difficulty in estimating their polarity based on the training data is presumably because of their high rate of neutral clips, corresponding to a low rate of feedback signals overall. Notably, however, F9, who has the highest rate of neutral clips overall with 88/97 (see <xref ref-type="fig" rid="F2">Figure 2</xref>), is one of the participants whose model gets the highest accuracy in <xref ref-type="fig" rid="F7">Figure 7</xref>. Clearly, for F9, the model is able to generalise that an absence of signals is a sign of neutrality, but this strategy does not work for F33 and F27. F27 and F33 have a lack of speech signals in common&#x02014;F27 does not speak at all&#x02014;while F9 is very active in the speech modality but uses no head gestures.</p>
</sec>
<sec>
<title>4.2.6. Model Evaluations Over Time</title>
<p>The LSTM model presented in section 3.4.4 operates on data given to it in time-frames representing a tenth of a second. While the other statistical models in <xref ref-type="table" rid="T3">Table 3</xref> are not inherently time-aware in this way, they can still be trained and used to give an evaluation over time by passing them data reflecting what has happened up to a point in the clip.</p>
<p><xref ref-type="fig" rid="F8">Figure 8</xref> reports the F-score of models when given data corresponding to the first second of each participant&#x00027;s turn, the first 2 s, and so on. The X axis has been clipped at 5 s, since the values converge at this point. Five seconds corresponds to the end of the participant&#x00027;s turn in most of our clips. Notably, the LSTM model is slightly better than the other statistical models right at the start of the user&#x00027;s turn; we assume this is because the LSTM model has been able to use the timing and presence of signals during the robot&#x00027;s turn into account to create a slightly better assumption of what the user&#x00027;s reaction is going to be. As more time passes, however, all statistical models eventually surpass the LSTM in both F-score and accuracy (not shown), around 2 s in. The models other than the LSTM are only trained on data corresponding to the user&#x00027;s full turn, so the fact that they outperform the LSTM at almost all points in time is a strong indication that the polarity of a clip is mostly defined by the presence of signals, regardless of their timing.</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>The F-score of the various models from <xref ref-type="table" rid="T3">Table 3</xref> when they are only given data corresponding to the first second of the participant&#x00027;s turn, first 2 s of the participant&#x00027;s turn, and so on until 5 s, at which point the F-scores stop going up or down, even though there are a small number of clips with a participant turn as long as 11 s.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-03-741148-g0008.tif"/>
</fig>
</sec>
</sec>
</sec>
<sec sec-type="discussion" id="s5">
<title>5. Discussion</title>
<p>We will now return back to the questions posed in section 1 and try to answer them in light of the findings from this study.</p>
<p><bold>1. What modalities are most commonly used to convey negative and positive feedback?</bold></p>
<p><xref ref-type="table" rid="T2">Table 2</xref> shows that head-nods (multiples or single nods) are the strongest indicator of positive feedback, whereas head shakes and tilts indicate negative feedback. When it comes to facial expressions, the only clear signal is frowning, which indicates negative feedback. Non-backchannel speech is most often used to express negative feedback, whereas backchannels can be both negative and positive. Rising F0 is also associated with negative feedback. <xref ref-type="table" rid="T2">Table 2</xref> does not tell us which signals are a strong indication of a neutral clip. However, the model analyses we have presented in section 4.2.1 suggest that the strongest indication of a neutral clip is an absence of any strong signals for either positive or negative feedback.</p>
<p>By comparison to the above, there are signals in our dataset that we would have expected to be connected to certain polarities, but which show no such significance. Shrugging is too rare to be a sign of anything, but if it were more common, <xref ref-type="table" rid="T2">Table 2</xref> suggests that it would be a sign of negativity, or at least non-neutrality. Eyebrow raises are not, as we would expect, a sign of negativity, but appear relatively commonly in positively labelled clips as well, indicating that surprise is as often positive as it is negative.</p>
<p>Our scenario and experiment set-up may have affected which signals users tended to use. The turn-taking heuristic we used defaulted to a turn-time of 5 s&#x02014;if the user had not reacted with feedback that could be classified by the Wizard of Oz within this time, the system would produce an elicitation. The Wizard had the capacity to shorten or lengthen the user&#x00027;s turn in response to feedback where this felt natural, but we can see from <xref ref-type="fig" rid="F8">Figure 8</xref> that the models reach their maximum performance after 5 s. Even though our system had the capacity to allow for user turns shorter or longer than 5 s, it appears that users generally synchronised with its preferred cadence of 5-s turns. This cadence of feedback presumably restricted users from reacting with longer speech and sequences of feedback, even when they would have liked to do so. On the other hand, this restriction is not entirely inappropriate for our museum guide scenario&#x02014;a museum guide does not necessarily want their audience to constantly interrupt their presentation, depending on how scripted and prepared the presentation is.</p>
<p>Therefore, we believe that <xref ref-type="table" rid="T2">Tables 2</xref>, <xref ref-type="table" rid="T4">4</xref> accurately depict which modalities and groups of modalities are most appropriate to pick up for the scenario of a presentation agent, but further studies need to be done to find out whether this would also be true for other scenarios&#x02014;where the robot is a more conversing, less driving agent. The relative unimportance of hand gestures from our listeners also matches up with earlier results from Battersby (<xref ref-type="bibr" rid="B9">2011</xref>).</p>
<p>The results of Kuno et al. (<xref ref-type="bibr" rid="B48">2007</xref>) suggested that nods and gaze were important signals of a user being involved with a presenting robot&#x00027;s presentation. While their results match with ours when it comes to head-nods, gaze at first appears to have been more important for their participants than ours. However, looking at <xref ref-type="fig" rid="F4">Figure 4</xref>, participants did in fact generally gaze on the poster along with the robot, regardless of if the clip was positive, negative, or neutral. This feature may not be unimportant for determining if a participant is engaged in a presentation, but since both positively and negatively classified clips assume that the participant is engaged, the difference in importance is not necessarily a disagreement in results.</p>
<p>Oppenheim et al. (<xref ref-type="bibr" rid="B65">2021</xref>) showed that the feedback responses used by test participants were significantly different depending on if the speaker gazed at the listener to <italic>supplement, highlight</italic>, or <italic>converse</italic>, with speech being less common than nods, as common as nods, and more common than nods, respectively. Our presenting robot agent predictably gazed at the listener at the end of each line. We see that nods, single or multiple, appear in more clips than speech, by the frequencies in <xref ref-type="table" rid="T2">Table 2</xref>. Our robot&#x00027;s motivation for gaze was always closer to the <italic>supplement</italic> label by Oppenheim et al. (<xref ref-type="bibr" rid="B65">2021</xref>) than the other two, since our robot had finished speaking by the time it gazed at the user, and never intended to hand over the turn for more than a brief comment. Thus, our results roughly correspond to the proportions seen by Oppenheim et al. (<xref ref-type="bibr" rid="B65">2021</xref>).</p>
<p>Rising F0 is an indication of negativity in our dataset, as seen in <xref ref-type="table" rid="T2">Table 2</xref>. Because of the relatively short user turns, and because user turns were restricted to being feedback or <italic>track 2</italic> comments on the content presented by the robot, we can presume that prosody was not used to invite backchannels or highlight given or non-given information, as mentioned in section 2.3.1. This leaves the use of prosody to mark a proposition as a question, or to ask to receive more information about some aspect about the information previously presented, as mentioned by Hirschberg and Ward (<xref ref-type="bibr" rid="B32">1995</xref>) and Bartels (<xref ref-type="bibr" rid="B8">1997</xref>). It is possible to use this type of prosody to mark a question that we would have labelled as positivity (&#x0201C;And when was that?&#x0201D;, &#x0201C;Why did that happen?&#x0201D;), but one interpretation of our data is that participants to a large extent used such questions to ask the robot to repeat itself, or explain something they had not understood.</p>
<p>Like Malisz et al. (<xref ref-type="bibr" rid="B54">2016</xref>), we also found that nods more commonly occurred in groups than one-by-one, see <xref ref-type="table" rid="T2">Table 2</xref>. Malisz et al. (<xref ref-type="bibr" rid="B54">2016</xref>) also found the same pattern for head shakes, which we do not significantly see in our corpus. This could be because Malisz et al. (<xref ref-type="bibr" rid="B54">2016</xref>) see proportionally much fewer head-shakes than we do, only registering 35 head-shakes in a corpus also containing 1,032 nods&#x02014;a completely different ratio than ours, and hard to compare because we specifically elicited negative feedback from our participants by making the robot presenter misspeak. Like Malisz et al. (<xref ref-type="bibr" rid="B54">2016</xref>), however, we also see that single tilts are more common than multiple tilts.</p>
<p><bold>2. Are any modalities redundant or complimentary when it comes to expressing positive and negative feedback?</bold></p>
<p><xref ref-type="table" rid="T4">Table 4</xref> tells us that Speech and Head are the most important modality groups and when only using these two modalities, the F-score is quite close to using all modalities (80.4 vs. 81.2%). Thus, even though <xref ref-type="table" rid="T2">Table 2</xref> showed that frowning was associated with negative feedback, Face, Gaze, and Pose do not have much overall impact on the classification of feedback type and can be considered fairly redundant. <xref ref-type="table" rid="T4">Table 4</xref> also shows that when only using Speech or Head on their own, the performance drops significantly (69.7 and 56.6%). Thus, they seem to be highly complimentary to each other.</p>
<p><bold>3. Does the interpretation of feedback as positive or negative change based on its relative timing to other feedback and the statement being reacted to?</bold></p>
<p>We were expecting the interpretation and ordering of feedback in our model to affect the meaning in terms of positivity or negativity, but this does not seem to hold based on the results we have presented. Models which are simply given the presence of a signal, ignoring internal order and timing, perform better on classifying our dataset as positive, negative, or neutral than the timing-aware LSTM model. The three most high-performing models in <xref ref-type="table" rid="T3">Table 3</xref> are split models&#x02014;meaning that they received data that differentiated between if a signal was used during the robot&#x00027;s turn or the participant&#x00027;s response. This indicates that multinomial regression and random forest models benefit from the distinction between these timings, and that some information is contained in it. However, the timing within the user&#x00027;s turn does not appear to matter, or at least matters much less than the identity of the signal.</p>
<p><xref ref-type="fig" rid="F5">Figures 5</xref>, <xref ref-type="fig" rid="F6">6</xref> show that the features that describe whether a user used a signal during their turn, after the robot stopped speaking, carry more information than the signals from when the robot was speaking. In fact, in both <xref ref-type="fig" rid="F5">Figures 5</xref>, <xref ref-type="fig" rid="F6">6</xref>, the <italic>only</italic> features that appear are signals denoting the user&#x00027;s turn. This tells us that the relative performance advantage of <italic>split</italic> models in <xref ref-type="table" rid="T3">Table 3</xref> is because they were able to ignore what the user did during the robot&#x00027;s turn.</p>
<p>An important question is <bold>why</bold> there do not seem to be timing and ordering effects in our dataset. One explanation is that the scenario&#x02014;passive audiences to a museum guide presenting facts about a painting&#x02014;lends itself to the audience delivering one strong piece of positive feedback when prompted. It is also possible that our agent design prompted this type of behaviour in its audience because of the turn-taking cadence and elicitation patterns. It has been previously established that users use mostly the modalities and signals that they expect a system like ours to recognise (Sidner et al., <xref ref-type="bibr" rid="B75">2006</xref>; Kontogiorgos et al., <xref ref-type="bibr" rid="B45">2021</xref>; Laban et al., <xref ref-type="bibr" rid="B50">2021</xref>).</p>
<p>Another potential explanation of the relative unimportance of timing and ordering is that those effects <italic>are</italic> present in our data, but are not necessary for predicting our positive/negative/neutral labels&#x02014;they could, however, be useful for a more in-depth grounding annotation, using labels similar to those by Clark (<xref ref-type="bibr" rid="B20">1996</xref>).</p>
<p><bold>4. Are there individual differences in the use of modalities to communicate different polarities of feedback?</bold></p>
<p>As reported in section 4.1.2, many participants had significantly differing distributions (from the mean) of modalities used for expressing negative and positive feedback. <xref ref-type="fig" rid="F3">Figure 3</xref> illustrates these differences. Speech appears to have been the dominant way to express negative feedback. Positive feedback is expressed with signals that are split between head gestures and speech, especially the &#x0201C;yes&#x0201D; signal, as seen in <xref ref-type="table" rid="T2">Table 2</xref>. Since positive clips are more common than negative or neutral clips in our dataset, it is also not surprising that participants are able to use a larger variety of signals in those clips. We have been unable to find previous literature that describes if humans generally use more varied feedback to express positive feedback than negative feedback.</p>
<p>Speech and head movements are not strictly positive or negative modalities&#x02014;but sub-signals within the modality can be significantly positive or negative, as shown in <xref ref-type="table" rid="T2">Table 2</xref>. Head nods and head shakes are unsurprisingly positive and negative, respectively, in our dataset: &#x0201C;yes&#x0201D; and &#x0201C;no&#x0201D; can be seen as the spoken counterparts of these signals, and are similarly significantly positive and negative. These signals can be seen as encoding attitudinal reactions to the content spoken by the robot&#x02014;they only have a meaning if the user understood what the robot was saying.</p>
<p><xref ref-type="fig" rid="F7">Figure 7</xref> and the arguments presented in section 4.1.2 indicate that the hardest individuals to classify based on training on the other individuals in our dataset are the ones that are disproportionately labelled as neutral because they do not use many feedback signals. Participants like F9, who use feedback in an ambiguous way, are easier to classify as neutral. The problem for our models is not classifying feedback as positive and negative, but rather what to do when that feedback is not present. The neutral label is more common than the negative label in the dataset, so by the numbers, correctly classifying participants as neutral is more important than being able to classify them as negative.</p>
<p>Navarretta et al. (<xref ref-type="bibr" rid="B60">2012</xref>) showed that Finnish participants used single nods more than multiple nods. In our dataset, multiple nods are significantly more common than single nods. This could be explained by many of the participants being Swedish, as Navarretta et al. (<xref ref-type="bibr" rid="B60">2012</xref>) showed that Swedish and Danish subjects preferred multiple nods to single nods&#x02014;and even for those participants who were not native Swedish speakers, it could be argued that they were using feedback patterns similar to the Swedish environment in which they live. The corpus study by Malisz et al. (<xref ref-type="bibr" rid="B54">2016</xref>) showed that multiple nods were also more common in a German-speaking context, and since most of our participants were Western European, the fact that multiple nods were more common than single nods could be a sign of a regional pattern where Western Europe prefers multiple nods to single nods. Nonetheless, both single and multiple nods were positive signals, so individual differences in which of the two signals an individual chooses to use would not complicate feedback classification.</p>
</sec>
<sec id="s6">
<title>6. Conclusions and Future Work</title>
<p>In this paper, we have presented an analysis of how humans express negative and positive feedback across different modalities when listening to a robot presenting a piece of art. The results show that the most important information can be found in their speech and head movements (e.g., nods), whereas facial expressions, gaze direction and body pose did not contribute as much. There seems to be more variation between individuals when it comes to how positive feedback is expressed, compared to negative feedback. Often, the very presence of a nod, a head shake, or certain speech is enough to classify an entire reaction as a positive or negative reaction, regardless of the context. The precise timing of the feedback does not seem to be of importance.</p>
<p>For future research, we note that our analyses of the gaze and pose modalities were not as deep as the analyses of the speech and head modalities. An interesting direction for future work in feedback analysis for presentation agents could be to enhance gaze and pose analysis with more detailed sub-signals, like hand gesture sensing and more detailed approximations of gaze targets. We have shown that not much positive or negative information is contained in whether the participant looks at the presented object or the presenting agent, but it is still possible that gaze sub-targets within the presented objects carry information that we were not able to annotate or extract.</p>
<p>We were unable to annotate our dataset with a rich grounding scheme like that of Clark (<xref ref-type="bibr" rid="B20">1996</xref>), and fell back on the labels positive/negative/neutral. It&#x00027;s possible that annotating the data with employed professional annotators would have led to the more in-depth annotation succeeding, like in the work by Malisz et al. (<xref ref-type="bibr" rid="B54">2016</xref>). While we did not see the ordering and timing effects that we were expecting to see&#x02014;see Question 3 in section 5&#x02014; it is possible that such effects come into play when the models are asked to perform a more fine-grained classification with four grounding levels, rather than the simpler positive/negative/neutral labels. One advantage of the rich multimodal annotation of our dataset is that many of the signals listed in <xref ref-type="table" rid="T2">Table 2</xref> carry strong implications about what grounding level the classified feedback must be on&#x02014;if our statistical models report that a clip is positive, and the &#x0201C;yes&#x0201D; feature is present in the clip, for example, we can conclude that the feedback must at least mean <italic>understanding</italic>, if not <italic>acceptance</italic>. This allows us to partially reconstruct grounding data akin to the standards by Clark (<xref ref-type="bibr" rid="B20">1996</xref>) and Allwood et al. (<xref ref-type="bibr" rid="B4">1992</xref>) from our simpler classification.</p>
<p>The results are important for the development of future adaptive presentation agents (which could be museum guides or teachers), as they indicate that such systems should focus on the analysis of speech and head movements, and put less focus on the analysis of the audience&#x00027;s facial expressions, gaze or pose. The results indicate that such an agent should be able to determine fairly reliably whether user feedback is positive, negative, or neutral. If positive, the presentation can proceed, and if negative, the agent can try to repair or reformulate the presentation. If only neutral (i.e., absence of) feedback is received for too long, the agent should elicit (positive or negative) feedback from the user (depending on the <italic>grounding criterion</italic>, as discussed in section 2.2). An example of such a framework, where this kind of classification would be of direct use, is the model we have presented in Axelsson and Skantze (<xref ref-type="bibr" rid="B6">2020</xref>).</p>
</sec>
<sec sec-type="data-availability" id="s7">
<title>Data Availability Statement</title>
<p>The raw ELAN files as well as data files used to train the statistical models in this paper are published with DOI 10.5281/zenodo.5078235, accessible at <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.5078235">https://doi.org/10.5281/zenodo.5078235</ext-link>.</p>
</sec>
<sec id="s8">
<title>Ethics Statement</title>
<p>Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. The patients/participants provided their written informed consent to participate in this study.</p>
</sec>
<sec id="s9">
<title>Author Contributions</title>
<p>AA collected and analysed the data from the experiments, programmed the statistical models, and analysed their behaviours. GS supervises AA and co-wrote the paper. Both authors contributed to the article and approved the submitted version.</p>
</sec>
<sec sec-type="funding-information" id="s10">
<title>Funding</title>
<p>This work was supported by the SSF (Swedish Foundation for Strategic Research) project Co-adaptive Human-Robot Interactive Systems (COIN).</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>AA is a Ph.D. student at KTH&#x00027;s division of Speech, Music and Hearing. GS is a Professor in Speech Technology at KTH and co-founder of Furhat Robotics.</p>
</sec>
<sec sec-type="disclaimer" id="s11">
<title>Publisher&#x00027;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<ack><p>We would like to thank the annotators of our dataset and the reviewers for their insightful comments.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Agarwal</surname> <given-names>A.</given-names></name> <name><surname>Yadav</surname> <given-names>A.</given-names></name> <name><surname>Vishwakarma</surname> <given-names>D. K.</given-names></name></person-group> (<year>2019</year>). <article-title>Multimodal sentiment analysis via RNN variants</article-title>, in <source>Proceedings - 2019 IEEE/ACIS 4th International Conference on Big Data, Cloud Computing, and Data Science, BCD 2019</source> (<publisher-loc>Honolulu, HI</publisher-loc>), <fpage>19</fpage>&#x02013;<lpage>23</lpage>. <pub-id pub-id-type="doi">10.1109/BCD.2019.8885108</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B2">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Al Moubayed</surname> <given-names>S.</given-names></name> <name><surname>Beskow</surname> <given-names>J.</given-names></name> <name><surname>Skantze</surname> <given-names>G.</given-names></name> <name><surname>Granstr&#x000F6;m</surname> <given-names>B.</given-names></name></person-group> (<year>2012</year>). <article-title>FurHat: a back-projected human-like robot head for multiparty human-machine interaction</article-title>, in <source>Lecture Notes in Computer Science</source> (<publisher-loc>Berlin</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>114</fpage>&#x02013;<lpage>130</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-642-34584-5_9</pub-id></citation>
</ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Allwood</surname> <given-names>J.</given-names></name> <name><surname>Cerrato</surname> <given-names>L.</given-names></name> <name><surname>Dybkjaer</surname> <given-names>L.</given-names></name></person-group> (<year>2007</year>). <article-title>The MUMIN coding scheme for the annotation of feedback, turn management and sequencing phenomena</article-title>. <source>Lang. Resour. Eval.</source> <volume>41</volume>, <fpage>273</fpage>&#x02013;<lpage>287</lpage>. <pub-id pub-id-type="doi">10.1007/s10579-007-9061-5</pub-id></citation>
</ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Allwood</surname> <given-names>J.</given-names></name> <name><surname>Nivre</surname> <given-names>J.</given-names></name> <name><surname>Ahls&#x000E9;n</surname> <given-names>E.</given-names></name></person-group> (<year>1992</year>). <article-title>On the semantics and pragmatics of linguistic feedback</article-title>. <source>J. Semant</source>. <volume>9</volume>, <fpage>1</fpage>&#x02013;<lpage>26</lpage>. <pub-id pub-id-type="doi">10.1093/jos/9.1.1</pub-id></citation>
</ref>
<ref id="B5">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Axelsson</surname> <given-names>N.</given-names></name> <name><surname>Skantze</surname> <given-names>G.</given-names></name></person-group> (<year>2019</year>). <article-title>Modelling adaptive presentations in human-robot interaction using behaviour trees</article-title>, in <source>SIGDIAL 2019 - 20th Annual Meeting of the Special Interest Group Discourse Dialogue</source> (<publisher-loc>Stockholm</publisher-loc>), <fpage>345</fpage>&#x02013;<lpage>352</lpage>. <pub-id pub-id-type="doi">10.18653/v1/W19-5940</pub-id></citation>
</ref>
<ref id="B6">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Axelsson</surname> <given-names>N.</given-names></name> <name><surname>Skantze</surname> <given-names>G.</given-names></name></person-group> (<year>2020</year>). <article-title>Using knowledge graphs and behaviour trees for feedback-aware presentation agents</article-title>, in <source>Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents, IVA 2020</source> (<publisher-loc>Glasgow</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>8</lpage>. <pub-id pub-id-type="doi">10.1145/3383652.3423884</pub-id></citation>
</ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Baker</surname> <given-names>M.</given-names></name> <name><surname>Hansen</surname> <given-names>T.</given-names></name> <name><surname>Joiner</surname> <given-names>R.</given-names></name> <name><surname>Traum</surname> <given-names>D.</given-names></name></person-group> (<year>1999</year>). <article-title>The role of grounding in collaborative learning tasks</article-title>. <source>Collab. Learn. Cogn. Comput. Approch</source>. <volume>31</volume>, <fpage>31</fpage>&#x02013;<lpage>63</lpage>. <pub-id pub-id-type="pmid">26571290</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bartels</surname> <given-names>C</given-names></name></person-group>. (<year>1997</year>). <article-title>The pragmatics of Wh-question intonation in English</article-title>, in <source>University of Pennsylvania Working Papers in Linguistics</source> (<publisher-loc>Philadelphia, PA</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>17</lpage>.</citation>
</ref>
<ref id="B9">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Battersby</surname> <given-names>S. A</given-names></name></person-group>. (<year>2011</year>). <source>Moving together: the organisation of non-verbal cues during multiparty conversation</source> (Ph.D. thesis). <publisher-name>Queen Mary University of London</publisher-name>, <publisher-loc>London, United Kingdom</publisher-loc>.</citation>
</ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bavelas</surname> <given-names>J. B.</given-names></name> <name><surname>Coates</surname> <given-names>L.</given-names></name> <name><surname>Johnson</surname> <given-names>T.</given-names></name></person-group> (<year>2000</year>). <article-title>Listeners as co-narrators</article-title>. <source>J. Pers. Soc. Psychol</source>. <volume>79</volume>, <fpage>941</fpage>&#x02013;<lpage>952</lpage>. <pub-id pub-id-type="doi">10.1037/0022-3514.79.6.941</pub-id><pub-id pub-id-type="pmid">11138763</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bavelas</surname> <given-names>J. B.</given-names></name> <name><surname>Coates</surname> <given-names>L.</given-names></name> <name><surname>Johnson</surname> <given-names>T.</given-names></name></person-group> (<year>2002</year>). <article-title>Listener responses as a collaborative process: the role of gaze</article-title>. <source>J. Commun</source>. <volume>52</volume>, <fpage>566</fpage>&#x02013;<lpage>580</lpage>. <pub-id pub-id-type="doi">10.1111/j.1460-2466.2002.tb02562.x</pub-id></citation>
</ref>
<ref id="B12">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Bertrand</surname> <given-names>R.</given-names></name> <name><surname>Ferr&#x000E9;</surname> <given-names>G.</given-names></name> <name><surname>Blache</surname> <given-names>P.</given-names></name> <name><surname>Espesser</surname> <given-names>R.</given-names></name> <name><surname>Rauzy</surname> <given-names>S.</given-names></name></person-group> (<year>2007</year>). <article-title>Backchannels revisited from a multimodal perspective</article-title>, in <source>International Conference on Auditory-Visual Speech Processing 2007 (AVSP2007)</source> (<publisher-loc>Hilvarenbeek</publisher-loc>: <publisher-name>ISCA</publisher-name>). Available online at: <ext-link ext-link-type="uri" xlink:href="https://hal.archives-ouvertes.fr/hal-00244490/document">https://hal.archives-ouvertes.fr/hal-00244490/document</ext-link></citation>
</ref>
<ref id="B13">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Boersma</surname> <given-names>P.</given-names></name> <name><surname>van Heuven</surname> <given-names>V.</given-names></name></person-group> (<year>2001</year>). <article-title>Speak and unspeak with Praat</article-title>. <source>Glot Int</source>. <volume>5</volume>, <fpage>341</fpage>&#x02013;<lpage>347</lpage>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://www.fon.hum.uva.nl/paul/praat.html">https://www.fon.hum.uva.nl/paul/praat.html</ext-link></citation>
</ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bohus</surname> <given-names>D.</given-names></name> <name><surname>Rudnicky</surname> <given-names>A.</given-names></name></person-group> (<year>2006</year>). <article-title>A &#x0201C;K hypotheses &#x0002B; other&#x0201D; belief updating model</article-title>, in <source>AAAI Workshop- Technical Report</source> (<publisher-loc>Menlo Park, CA</publisher-loc>), <fpage>13</fpage>&#x02013;<lpage>18</lpage>.</citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Buck</surname> <given-names>R</given-names></name></person-group>. (<year>1980</year>). <article-title>Nonverbal behavior and the theory of emotion: the facial feedback hypothesis</article-title>. <source>J. Pers. Soc. Psychol</source>. <volume>38</volume>, <fpage>811</fpage>&#x02013;<lpage>824</lpage>. <pub-id pub-id-type="doi">10.1037/0022-3514.38.5.811</pub-id><pub-id pub-id-type="pmid">7381683</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Buschmeier</surname> <given-names>H.</given-names></name> <name><surname>Kopp</surname> <given-names>S.</given-names></name></person-group> (<year>2011</year>). <article-title>Unveiling the information state with a Bayesian model of the listener</article-title>, in <source>SemDial 2011: Proceedings of the 15th Workshop on the Semantics and Pragmatics of Dialogue</source> (<publisher-loc>Los Angeles, CA</publisher-loc>), <fpage>178</fpage>&#x02013;<lpage>179</lpage>.</citation>
</ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Buschmeier</surname> <given-names>H.</given-names></name> <name><surname>Kopp</surname> <given-names>S.</given-names></name></person-group> (<year>2013</year>). <article-title>Co-constructing grounded symbols-feedback and incremental adaptation in human-agent dialogue</article-title>. <source>K&#x000FC;nstliche Intelligenz</source> <volume>27</volume>, <fpage>137</fpage>&#x02013;<lpage>143</lpage>. <pub-id pub-id-type="doi">10.1007/s13218-013-0241-8</pub-id></citation>
</ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Buschmeier</surname> <given-names>H.</given-names></name> <name><surname>Kopp</surname> <given-names>S.</given-names></name></person-group> (<year>2018</year>). <article-title>Communicative listener feedback in human-agent interaction: artificial speakers need to be attentive and adaptive: socially interactive agents track</article-title>, in <source>Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS</source> (<publisher-loc>Stockholm</publisher-loc>), <fpage>1213</fpage>&#x02013;<lpage>1221</lpage>.</citation>
</ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Clark</surname> <given-names>H. H</given-names></name></person-group>. (<year>1994</year>). <article-title>Managing problems in speaking</article-title>. <source>Speech Commun</source>. <volume>15</volume>, <fpage>243</fpage>&#x02013;<lpage>250</lpage>. <pub-id pub-id-type="doi">10.1016/0167-6393(94)90075-2</pub-id></citation>
</ref>
<ref id="B20">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Clark</surname> <given-names>H. H</given-names></name></person-group>. (<year>1996</year>). <source>Using Language</source>. <publisher-loc>Cambridge, UK</publisher-loc>: <publisher-name>Cambridge University Press</publisher-name>. <pub-id pub-id-type="doi">10.1017/CBO9780511620539</pub-id><pub-id pub-id-type="pmid">30886898</pub-id></citation></ref>
<ref id="B21">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Clark</surname> <given-names>H. H.</given-names></name> <name><surname>Brennan</surname> <given-names>S. E.</given-names></name></person-group> (<year>1991</year>). <article-title>Grounding in communication</article-title>, in <source>Perspectives on Socially Shared Cognition</source>, eds <person-group person-group-type="editor"><name><surname>Resnick</surname> <given-names>L. B.</given-names></name> <name><surname>Levine</surname> <given-names>J. M.</given-names></name> <name><surname>Teasley</surname> <given-names>S. D.</given-names></name></person-group> (<publisher-loc>Pittsburgh, PT</publisher-loc>: <publisher-name>American Psychological Association</publisher-name>), <fpage>127</fpage>&#x02013;<lpage>149</lpage>. <pub-id pub-id-type="doi">10.1037/10096-006</pub-id></citation>
</ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Clark</surname> <given-names>H. H.</given-names></name> <name><surname>Krych</surname> <given-names>M. A.</given-names></name></person-group> (<year>2004</year>). <article-title>Speaking while monitoring addressees for understanding</article-title>. <source>J. Mem. Lang</source>. <volume>50</volume>, <fpage>62</fpage>&#x02013;<lpage>81</lpage>. <pub-id pub-id-type="doi">10.1016/j.jml.2003.08.004</pub-id></citation>
</ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Colletta</surname> <given-names>J. M.</given-names></name> <name><surname>Pellenq</surname> <given-names>C.</given-names></name> <name><surname>Guidetti</surname> <given-names>M.</given-names></name></person-group> (<year>2010</year>). <article-title>Age-related changes in co-speech gesture and narrative: evidence from French children and adults</article-title>. <source>Speech Commun</source>. <volume>52</volume>, <fpage>565</fpage>&#x02013;<lpage>576</lpage>. <pub-id pub-id-type="doi">10.1016/j.specom.2010.02.009</pub-id></citation>
</ref>
<ref id="B24">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Duncan</surname> <given-names>S.</given-names></name> <name><surname>Fiske</surname> <given-names>D. W.</given-names></name></person-group> (<year>1977</year>). <source>Face-to-Face Interaction: Research, Methods, and Theory</source>. <publisher-loc>Oxfordshire</publisher-loc>: <publisher-name>Routledge</publisher-name>.</citation>
</ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Edinger</surname> <given-names>J. A.</given-names></name> <name><surname>Patterson</surname> <given-names>M. L.</given-names></name></person-group> (<year>1983</year>). <article-title>Nonverbal involvement and social control</article-title>. <source>Psychol. Bull</source>. <volume>93</volume>, <fpage>30</fpage>&#x02013;<lpage>56</lpage>. <pub-id pub-id-type="doi">10.1037/0033-2909.93.1.30</pub-id></citation>
</ref>
<ref id="B26">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Edlund</surname> <given-names>J.</given-names></name> <name><surname>House</surname> <given-names>D.</given-names></name> <name><surname>Skantze</surname> <given-names>G.</given-names></name></person-group> (<year>2005</year>). <article-title>The effects of prosodic features on the interpretation of clarification ellipses</article-title>, in <source>9th European Conference on Speech Communication and Technology</source> (<publisher-loc>Lisbon</publisher-loc>), <fpage>2389</fpage>&#x02013;<lpage>2392</lpage>. <pub-id pub-id-type="doi">10.21437/Interspeech.2005-43</pub-id></citation>
</ref>
<ref id="B27">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ekman</surname> <given-names>P</given-names></name></person-group>. (<year>2004</year>). <article-title>Emotional and conversational nonverbal signals</article-title>, in <source>Language, Knowledge, and Representation</source>, eds <person-group person-group-type="editor"><name><surname>Larrazabal</surname> <given-names>J. M.</given-names></name> <name><surname>Miranda</surname> <given-names>L. A. P.</given-names></name></person-group> (<publisher-loc>Berlin</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>39</fpage>&#x02013;<lpage>50</lpage>. <pub-id pub-id-type="doi">10.1007/978-1-4020-2783-3_3</pub-id><pub-id pub-id-type="pmid">22438875</pub-id></citation></ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Goldberg</surname> <given-names>P.</given-names></name> <name><surname>S&#x000FC;mer</surname> <given-names>&#x000D6;.</given-names></name> <name><surname>St&#x000FC;rmer</surname> <given-names>K.</given-names></name> <name><surname>Wagner</surname> <given-names>W.</given-names></name> <name><surname>G&#x000F6;llner</surname> <given-names>R.</given-names></name> <name><surname>Gerjets</surname> <given-names>P.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Attentive or not? Toward a machine learning approach to assessing students&#x00027; visible engagement in classroom instruction</article-title>. <source>Educ. Psychol. Rev</source>. <volume>33</volume>, <fpage>27</fpage>&#x02013;<lpage>49</lpage>. <pub-id pub-id-type="doi">10.1007/s10648-019-09514-z</pub-id></citation>
</ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Goswami</surname> <given-names>M.</given-names></name> <name><surname>Manuja</surname> <given-names>M.</given-names></name> <name><surname>Leekha</surname> <given-names>M.</given-names></name></person-group> (<year>2020</year>). <article-title>Towards social &#x00026; engaging peer learning: predicting backchanneling and disengagement in children</article-title>. <source>arXiv [preprint]. arXiv:2007.11346</source>.</citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gravano</surname> <given-names>A.</given-names></name> <name><surname>Hirschberg</surname> <given-names>J.</given-names></name></person-group> (<year>2011</year>). <article-title>Turn-taking cues in task-oriented dialogue</article-title>. <source>Comput. Speech Lang</source>. <volume>25</volume>, <fpage>601</fpage>&#x02013;<lpage>634</lpage>. <pub-id pub-id-type="doi">10.1016/j.csl.2010.10.003</pub-id><pub-id pub-id-type="pmid">28443035</pub-id></citation></ref>
<ref id="B31">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Heylen</surname> <given-names>D</given-names></name></person-group>. (<year>2005</year>). <article-title>Challenges ahead head movements and other social acts in conversations</article-title>, in <source>AISB&#x00027;05 Convention: Proceedings of the Joint Symposium on Virtual Social Agents: Social Presence Cues for Virtual Humanoids Empathic Interaction with Synthetic Characters Mind Minding Agents</source> (<publisher-loc>Hertfordshire</publisher-loc>), <fpage>45</fpage>&#x02013;<lpage>52</lpage>.</citation>
</ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hirschberg</surname> <given-names>J.</given-names></name> <name><surname>Ward</surname> <given-names>G.</given-names></name></person-group> (<year>1995</year>). <article-title>The interpretation of the high-rise question contour in English</article-title>. <source>J. Pragmat</source>. <volume>24</volume>, <fpage>407</fpage>&#x02013;<lpage>412</lpage>. <pub-id pub-id-type="doi">10.1016/0378-2166(94)00056-K</pub-id></citation>
</ref>
<ref id="B33">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hirst</surname> <given-names>D.</given-names></name> <name><surname>Di Cristo</surname> <given-names>A.</given-names></name></person-group> (<year>1998</year>). <article-title>A survey of intonation systems</article-title>, in <source>Intonation Systems: A Survey of Twenty Languages</source>, eds <person-group person-group-type="editor"><name><surname>Hirst</surname> <given-names>D.</given-names></name> <name><surname>Di Cristo</surname> <given-names>A.</given-names></name></person-group> (<publisher-loc>Cambridge</publisher-loc>: <publisher-name>Cambridge University Press</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>44</lpage>.</citation>
</ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hochreiter</surname> <given-names>S.</given-names></name> <name><surname>Schmidhuber</surname> <given-names>J.</given-names></name></person-group> (<year>1997</year>). <article-title>Long short-term memory</article-title>. <source>Neural Comput</source>. <volume>9</volume>, <fpage>1735</fpage>&#x02013;<lpage>1780</lpage>. <pub-id pub-id-type="doi">10.1162/neco.1997.9.8.1735</pub-id><pub-id pub-id-type="pmid">9377276</pub-id></citation></ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hothorn</surname> <given-names>T.</given-names></name> <name><surname>Hornik</surname> <given-names>K.</given-names></name> <name><surname>Zeileis</surname> <given-names>A.</given-names></name></person-group> (<year>2006</year>). <article-title>Unbiased recursive partitioning: a conditional inference framework</article-title>. <source>J. Comput. Graph. Stat</source>. <volume>15</volume>, <fpage>651</fpage>&#x02013;<lpage>674</lpage>. <pub-id pub-id-type="doi">10.1198/106186006X133933</pub-id><pub-id pub-id-type="pmid">33619988</pub-id></citation></ref>
<ref id="B36">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hough</surname> <given-names>J.</given-names></name> <name><surname>Schlangen</surname> <given-names>D.</given-names></name></person-group> (<year>2017</year>). <article-title>It&#x00027;s not what you do, it&#x00027;s how you do it: grounding uncertainty for a simple robot</article-title>, in <source>ACM/IEEE International Conference on Human-Robot Interaction</source> (<publisher-loc>Vienna</publisher-loc>), <fpage>274</fpage>&#x02013;<lpage>282</lpage>. <pub-id pub-id-type="doi">10.1145/2909824.3020214</pub-id></citation>
</ref>
<ref id="B37">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hsieh</surname> <given-names>W. F.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Kasano</surname> <given-names>E.</given-names></name> <name><surname>Simokawara</surname> <given-names>E. S.</given-names></name> <name><surname>Yamaguchi</surname> <given-names>T.</given-names></name></person-group> (<year>2019</year>). <article-title>Confidence identification based on the combination of verbal and non-verbal factors in human robot interaction</article-title>, in <source>Proceedings of the International Joint Conference on Neural Networks</source> (<publisher-loc>Budapest</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>7</lpage>. <pub-id pub-id-type="doi">10.1109/IJCNN.2019.8851845</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Iio</surname> <given-names>T.</given-names></name> <name><surname>Satake</surname> <given-names>S.</given-names></name> <name><surname>Kanda</surname> <given-names>T.</given-names></name> <name><surname>Hayashi</surname> <given-names>K.</given-names></name> <name><surname>Ferreri</surname> <given-names>F.</given-names></name> <name><surname>Hagita</surname> <given-names>N.</given-names></name></person-group> (<year>2020</year>). <article-title>Human-like guide robot that proactively explains exhibits</article-title>. <source>Int. J. Soc. Robot</source>. <volume>12</volume>, <fpage>549</fpage>&#x02013;<lpage>566</lpage>. <pub-id pub-id-type="doi">10.1007/s12369-019-00587-y</pub-id></citation>
</ref>
<ref id="B39">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jain</surname> <given-names>V.</given-names></name> <name><surname>Leekha</surname> <given-names>M.</given-names></name> <name><surname>Shah</surname> <given-names>R. R.</given-names></name> <name><surname>Shukla</surname> <given-names>J.</given-names></name></person-group> (<year>2021</year>). <article-title>Exploring semi-supervised learning for predicting listener backchannels</article-title>. <source>arXiv preprint arXiv:2101.01899</source>. <pub-id pub-id-type="doi">10.1145/3411764.3445449</pub-id></citation>
</ref>
<ref id="B40">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jimenez-Molina</surname> <given-names>A.</given-names></name> <name><surname>Retamal</surname> <given-names>C.</given-names></name> <name><surname>Lira</surname> <given-names>H.</given-names></name></person-group> (<year>2018</year>). <article-title>Using psychophysiological sensors to assess mental workload during web browsing</article-title>. <source>Sensors</source> <volume>18</volume>:<fpage>458</fpage>. <pub-id pub-id-type="doi">10.3390/s18020458</pub-id><pub-id pub-id-type="pmid">29401688</pub-id></citation></ref>
<ref id="B41">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jokinen</surname> <given-names>K</given-names></name></person-group>. (<year>2009</year>). <article-title>Nonverbal feedback in interactions</article-title>, in <source>Affective Information Processing</source>, eds <person-group person-group-type="editor"><name><surname>Tao</surname> <given-names>J.</given-names></name> <name><surname>Tan</surname> <given-names>T.</given-names></name></person-group> (<publisher-loc>Berlin</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>227</fpage>&#x02013;<lpage>240</lpage>. <pub-id pub-id-type="doi">10.1007/978-1-84800-306-4_13</pub-id></citation>
</ref>
<ref id="B42">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jokinen</surname> <given-names>K.</given-names></name> <name><surname>Majaranta</surname> <given-names>P.</given-names></name></person-group> (<year>2013</year>). <article-title>Eye-gaze and facial expressions as feedback signals in educational interactions</article-title>, in <source>Technologies for Inclusive Education: Beyond Traditional Integration Approaches</source>, eds <person-group person-group-type="editor"><name><surname>Barres</surname> <given-names>D. G.</given-names></name> <name><surname>Carrion</surname> <given-names>Z. C.</given-names></name> <name><surname>Lopez-Cozar</surname> <given-names>R. D.</given-names></name></person-group> (<publisher-loc>Pennsylvania, PN</publisher-loc>: <publisher-name>IGI Global</publisher-name>), <fpage>38</fpage>&#x02013;<lpage>58</lpage>. <pub-id pub-id-type="doi">10.4018/978-1-4666-2530-3.ch003</pub-id></citation>
</ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kleinke</surname> <given-names>C. L</given-names></name></person-group>. (<year>1986</year>). <article-title>Gaze and eye contact. A research review</article-title>. <source>Psychol. Bull</source>. <volume>100</volume>, <fpage>78</fpage>&#x02013;<lpage>100</lpage>. <pub-id pub-id-type="doi">10.1037/0033-2909.100.1.78</pub-id><pub-id pub-id-type="pmid">3526377</pub-id></citation></ref>
<ref id="B44">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kontogiorgos</surname> <given-names>D.</given-names></name> <name><surname>Pereira</surname> <given-names>A.</given-names></name> <name><surname>Gustafson</surname> <given-names>J.</given-names></name></person-group> (<year>2019</year>). <article-title>Estimating uncertainty in task-oriented dialogue</article-title>, in <source>ICMI 2019 - Proceedings of the 2019 International Conference on Multimodal Interaction</source> (<publisher-loc>Suzhou</publisher-loc>), <fpage>414</fpage>&#x02013;<lpage>418</lpage>. <pub-id pub-id-type="doi">10.1145/3340555.3353722</pub-id></citation>
</ref>
<ref id="B45">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kontogiorgos</surname> <given-names>D.</given-names></name> <name><surname>Pereira</surname> <given-names>A.</given-names></name> <name><surname>Gustafson</surname> <given-names>J.</given-names></name></person-group> (<year>2021</year>). <article-title>Grounding behaviours with conversational interfaces: effects of embodiment and failures</article-title>. <source>J. Multimodal User Interfaces</source> <volume>15</volume>, <fpage>239</fpage>&#x02013;<lpage>254</lpage>. <pub-id pub-id-type="doi">10.1007/s12193-021-00366-y</pub-id></citation>
</ref>
<ref id="B46">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Krahmer</surname> <given-names>E.</given-names></name> <name><surname>Swerts</surname> <given-names>M.</given-names></name> <name><surname>Theune</surname> <given-names>M.</given-names></name> <name><surname>Weegels</surname> <given-names>M.</given-names></name></person-group> (<year>2002</year>). <article-title>The dual of denial: two uses of disconfirmations in dialogue and their prosodic correlates</article-title>. <source>Speech Commun</source>. <volume>36</volume>, <fpage>133</fpage>&#x02013;<lpage>145</lpage>. <pub-id pub-id-type="doi">10.1016/S0167-6393(01)00030-9</pub-id></citation>
</ref>
<ref id="B47">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Krauss</surname> <given-names>R. M.</given-names></name> <name><surname>Chen</surname> <given-names>Y.</given-names></name> <name><surname>Chawla</surname> <given-names>P.</given-names></name></person-group> (<year>1996</year>). <article-title>Nonverbal behavior and nonverbal communication: what do conversational hand gestures tell us?</article-title> <source>Adv. Exp. Soc. Psychol</source>. <volume>28</volume>, <fpage>389</fpage>&#x02013;<lpage>450</lpage>. <pub-id pub-id-type="doi">10.1016/S0065-2601(08)60241-5</pub-id></citation>
</ref>
<ref id="B48">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kuno</surname> <given-names>Y.</given-names></name> <name><surname>Sadazuka</surname> <given-names>K.</given-names></name> <name><surname>Kawashima</surname> <given-names>M.</given-names></name> <name><surname>Yamazaki</surname> <given-names>K.</given-names></name> <name><surname>Yamazaki</surname> <given-names>A.</given-names></name> <name><surname>Kuzuoka</surname> <given-names>H.</given-names></name></person-group> (<year>2007</year>). <article-title>Museum guide robot based on sociological interaction analysis</article-title>, in <source>Conference on Human Factors in Computing Systems - Proceedings</source> (<publisher-loc>San Jose, CA</publisher-loc>), <fpage>1191</fpage>&#x02013;<lpage>1194</lpage>. <pub-id pub-id-type="doi">10.1145/1240624.1240804</pub-id></citation>
</ref>
<ref id="B49">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kuzuoka</surname> <given-names>H.</given-names></name> <name><surname>Suzuki</surname> <given-names>Y.</given-names></name> <name><surname>Yamashita</surname> <given-names>J.</given-names></name> <name><surname>Yamazaki</surname> <given-names>K.</given-names></name></person-group> (<year>2010</year>). <article-title>Reconfiguring spatial formation arrangement by robot body orientation</article-title>, in <source>5th ACM/IEEE International Conference on Human-Robot Interaction, HRI 2010</source> (<publisher-loc>Osaka</publisher-loc>), <fpage>285</fpage>&#x02013;<lpage>292</lpage>. <pub-id pub-id-type="doi">10.1109/HRI.2010.5453182</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B50">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Laban</surname> <given-names>G.</given-names></name> <name><surname>George</surname> <given-names>J. N.</given-names></name> <name><surname>Morrison</surname> <given-names>V.</given-names></name> <name><surname>Cross</surname> <given-names>E. S.</given-names></name></person-group> (<year>2021</year>). <article-title>Tell me more! Assessing interactions with social robots from speech</article-title>. <source>Paladyn</source> <volume>12</volume>, <fpage>136</fpage>&#x02013;<lpage>159</lpage>. <pub-id pub-id-type="doi">10.1515/pjbr-2021-0011</pub-id></citation>
</ref>
<ref id="B51">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Landis</surname> <given-names>J. R.</given-names></name> <name><surname>Koch</surname> <given-names>G. G.</given-names></name></person-group> (<year>1977</year>). <article-title>The measurement of observer agreement for categorical data</article-title>. <source>Biometrics</source> <volume>33</volume>:<fpage>159</fpage>. <pub-id pub-id-type="doi">10.2307/2529310</pub-id><pub-id pub-id-type="pmid">843571</pub-id></citation></ref>
<ref id="B52">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ma</surname> <given-names>J.</given-names></name> <name><surname>Zheng</surname> <given-names>W. L.</given-names></name> <name><surname>Tang</surname> <given-names>H.</given-names></name> <name><surname>Lu</surname> <given-names>B. L.</given-names></name></person-group> (<year>2019</year>). <article-title>Emotion recognition using multimodal residual LSTM network</article-title>, in <source>MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia</source> (<publisher-loc>Nice</publisher-loc>), <fpage>176</fpage>&#x02013;<lpage>183</lpage>. <pub-id pub-id-type="doi">10.1145/3343031.3350871</pub-id></citation>
</ref>
<ref id="B53">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Malisz</surname> <given-names>Z.</given-names></name> <name><surname>W&#x00142;odarczak</surname> <given-names>M.</given-names></name> <name><surname>Buschmeier</surname> <given-names>H.</given-names></name> <name><surname>Kopp</surname> <given-names>S.</given-names></name> <name><surname>Wagner</surname> <given-names>P.</given-names></name></person-group> (<year>2012</year>). <article-title>Prosodic characteristics of feedback expressions in distracted and non-distracted listeners</article-title>, in <source>Proceedings of The Listening Talker. An Interdisciplinary Workshop on Natural and Synthetic Modification of Speech in Response to Listening Conditions</source> (<publisher-loc>Edinburgh</publisher-loc>), <fpage>36</fpage>&#x02013;<lpage>39</lpage>.</citation>
</ref>
<ref id="B54">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Malisz</surname> <given-names>Z.</given-names></name> <name><surname>W&#x00142;odarczak</surname> <given-names>M.</given-names></name> <name><surname>Buschmeier</surname> <given-names>H.</given-names></name> <name><surname>Skubisz</surname> <given-names>J.</given-names></name> <name><surname>Kopp</surname> <given-names>S.</given-names></name> <name><surname>Wagner</surname> <given-names>P.</given-names></name></person-group> (<year>2016</year>). <article-title>The ALICO corpus: analysing the active listener</article-title>. <source>Lang. Resour. Eval</source>. <volume>50</volume>, <fpage>411</fpage>&#x02013;<lpage>442</lpage>. <pub-id pub-id-type="doi">10.1007/s10579-016-9355-6</pub-id></citation>
</ref>
<ref id="B55">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>McClave</surname> <given-names>E. Z</given-names></name></person-group>. (<year>2000</year>). <article-title>Linguistic functions of head movements in the context of speech</article-title>. <source>J. Pragmat</source>. <volume>32</volume>, <fpage>855</fpage>&#x02013;<lpage>878</lpage>. <pub-id pub-id-type="doi">10.1016/S0378-2166(99)00079-X</pub-id></citation>
</ref>
<ref id="B56">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mehlmann</surname> <given-names>G.</given-names></name> <name><surname>Janowski</surname> <given-names>K.</given-names></name> <name><surname>Andr&#x000E9;</surname> <given-names>E.</given-names></name></person-group> (<year>2016</year>). <article-title>Modeling grounding for interactive social companions</article-title>. <source>K&#x000FC;nstliche Intelligenz</source> <volume>30</volume>, <fpage>45</fpage>&#x02013;<lpage>52</lpage>. <pub-id pub-id-type="doi">10.1007/s13218-015-0397-5</pub-id></citation>
</ref>
<ref id="B57">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mehlmann</surname> <given-names>G.</given-names></name> <name><surname>Janowski</surname> <given-names>K.</given-names></name> <name><surname>H&#x000E4;ring</surname> <given-names>M.</given-names></name> <name><surname>Baur</surname> <given-names>T.</given-names></name> <name><surname>Gebhard</surname> <given-names>P.</given-names></name> <name><surname>Andr&#x000E9;</surname> <given-names>E.</given-names></name></person-group> (<year>2014</year>). <article-title>Exploring a model of gaze for grounding in multimodal HRI</article-title>, in <source>ICMI 2014 - Proceedings of the 2014 International Conference on Multimodal Interaction</source> (<publisher-loc>Istanbul</publisher-loc>), <fpage>247</fpage>&#x02013;<lpage>254</lpage>. <pub-id pub-id-type="doi">10.1145/2663204.2663275</pub-id></citation>
</ref>
<ref id="B58">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Nakano</surname> <given-names>Y. I.</given-names></name> <name><surname>Ishii</surname> <given-names>R.</given-names></name></person-group> (<year>2010</year>). <article-title>Estimating user&#x00027;s engagement from eye-gaze behaviors in human-agent conversations</article-title>, in <source>International Conference on Intelligent User Interfaces, Proceedings IUI</source> (<publisher-loc>Hong Kong</publisher-loc>), <fpage>139</fpage>&#x02013;<lpage>148</lpage>. <pub-id pub-id-type="doi">10.1145/1719970.1719990</pub-id></citation>
</ref>
<ref id="B59">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Nakatsukasa</surname> <given-names>K.</given-names></name> <name><surname>Loewen</surname> <given-names>S.</given-names></name></person-group> (<year>2020</year>). <article-title>Non-verbal feedback</article-title>, in <source>Corrective Feedback in Second Language Teaching and Learning</source>, eds <person-group person-group-type="editor"><name><surname>Nassaji</surname> <given-names>H.</given-names></name> <name><surname>Kartchava</surname> <given-names>E.</given-names></name></person-group> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Routledge</publisher-name>), <fpage>158</fpage>&#x02013;<lpage>173</lpage>. <pub-id pub-id-type="doi">10.4324/9781315621432-12</pub-id></citation>
</ref>
<ref id="B60">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Navarretta</surname> <given-names>C.</given-names></name> <name><surname>Ahls&#x000E9;n</surname> <given-names>E.</given-names></name> <name><surname>Allwood</surname> <given-names>J.</given-names></name> <name><surname>Jokinen</surname> <given-names>K.</given-names></name> <name><surname>Paggio</surname> <given-names>P.</given-names></name></person-group> (<year>2012</year>). <article-title>Feedback in nordic first-encounters: a comparative study</article-title>, in <source>LREC</source>, eds <person-group person-group-type="editor"><name><surname>Calzolari&#x0201E;</surname> <given-names>N. K.</given-names></name> <name><surname>Choukri</surname> <given-names>T.</given-names></name> <name><surname>Declerck</surname> <given-names>M. U.</given-names></name> <name><surname>Dogan</surname> <given-names>B.</given-names></name> <name><surname>Maegaard</surname>  <given-names>J.</given-names></name> <name><surname>Mariani</surname></name> <etal/></person-group> (<publisher-loc>Istanbul</publisher-loc>: <publisher-name>European Language Resources Association</publisher-name>), <fpage>2494</fpage>&#x02013;<lpage>2499</lpage>.</citation>
</ref>
<ref id="B61">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Nourbakhsh</surname> <given-names>I. R.</given-names></name> <name><surname>Kunz</surname> <given-names>C.</given-names></name> <name><surname>Willeke</surname> <given-names>T.</given-names></name></person-group> (<year>2003</year>). <article-title>The mobot museum robot installations: a five year experiment</article-title>, in <source>IEEE International Conference on Intelligent Robots and Systems</source> (<publisher-loc>Las Vegas, NV</publisher-loc>), <fpage>3636</fpage>&#x02013;<lpage>3641</lpage>. <pub-id pub-id-type="doi">10.1109/IROS.2003.1249720</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B62">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Novick</surname> <given-names>D</given-names></name></person-group>. (<year>2012</year>). <article-title>Paralinguistic behaviors in dialog as a continuous process</article-title>, in <source>Interdisciplinary Workshop on Feedback Behaviors in Dialog</source> (<publisher-loc>Stevenson, WA</publisher-loc>).</citation>
</ref>
<ref id="B63">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Novick</surname> <given-names>D.</given-names></name> <name><surname>Gris</surname> <given-names>I.</given-names></name></person-group> (<year>2013</year>). <article-title>Grounding and turn-taking in multimodal multiparty conversation</article-title>, in <source>Lecture Notes in Computer Science</source>, ed <person-group person-group-type="editor"><name><surname>Masaaki</surname> <given-names>K.</given-names></name></person-group> (<publisher-loc>Berlin</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>97</fpage>&#x02013;<lpage>106</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-642-39330-3_11</pub-id></citation>
</ref>
<ref id="B64">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Oertel</surname> <given-names>C.</given-names></name> <name><surname>Lopes</surname> <given-names>J.</given-names></name> <name><surname>Yu</surname> <given-names>Y.</given-names></name> <name><surname>Mora</surname> <given-names>K. A. F.</given-names></name> <name><surname>Gustafson</surname> <given-names>J.</given-names></name> <name><surname>Black</surname> <given-names>A. W.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>Towards building an attentive artificial listener: on the perception of attentiveness in audio-visual feedback tokens</article-title>, in <source>Proceedings of the 18th ACM International Conference on Multimodal Interaction</source> (<publisher-loc>Tokyo</publisher-loc>), <fpage>21</fpage>&#x02013;<lpage>28</lpage>. <pub-id pub-id-type="doi">10.1145/2993148.2993188</pub-id></citation>
</ref>
<ref id="B65">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Oppenheim</surname> <given-names>J.</given-names></name> <name><surname>Huang</surname> <given-names>J.</given-names></name> <name><surname>Won</surname> <given-names>I.</given-names></name> <name><surname>Huang</surname> <given-names>C. M.</given-names></name></person-group> (<year>2021</year>). <article-title>Mental synchronization in human task demonstration: implications for robot teaching and learning</article-title>, in <source>ACM/IEEE International Conference on Human-Robot Interaction</source> (Virtual Conference), <fpage>470</fpage>&#x02013;<lpage>474</lpage>. <pub-id pub-id-type="doi">10.1145/3434074.3447216</pub-id></citation>
</ref>
<ref id="B66">
<citation citation-type="other"><person-group person-group-type="author"><name><surname>Paek</surname> <given-names>T.</given-names></name> <name><surname>Horvitz</surname> <given-names>E.</given-names></name></person-group> (<year>2000</year>). <article-title>Grounding criterion: toward a formal theory of grounding</article-title>.</citation>
</ref>
<ref id="B67">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Park</surname> <given-names>H. W.</given-names></name> <name><surname>Grover</surname> <given-names>I.</given-names></name> <name><surname>Spaulding</surname> <given-names>S.</given-names></name> <name><surname>Gomez</surname> <given-names>L.</given-names></name> <name><surname>Breazeal</surname> <given-names>C.</given-names></name></person-group> (<year>2019</year>). <article-title>A model-free affective reinforcement learning approach to personalization of an autonomous social robot companion for early literacy education</article-title>, in <source>33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019</source> (<publisher-loc>Honolulu, HI</publisher-loc>), <fpage>687</fpage>&#x02013;<lpage>694</lpage>. <pub-id pub-id-type="doi">10.1609/aaai.v33i01.3301687</pub-id></citation>
</ref>
<ref id="B68">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Peters</surname> <given-names>C.</given-names></name> <name><surname>Pelachaud</surname> <given-names>C.</given-names></name> <name><surname>Bevacqua</surname> <given-names>E.</given-names></name> <name><surname>Mancini</surname> <given-names>M.</given-names></name> <name><surname>Poggi</surname> <given-names>I.</given-names></name></person-group> (<year>2005</year>). <article-title>Engagement capabilities for ECAs</article-title>, in <source>AAMAS&#x00027;05 Workshop: Creating Bonds with ECAs</source>, <publisher-loc>Utrecht</publisher-loc>.</citation>
</ref>
<ref id="B69">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pichl</surname> <given-names>J.</given-names></name> <name><surname>Marek</surname> <given-names>P.</given-names></name> <name><surname>Konr&#x000E1;d</surname> <given-names>J.</given-names></name> <name><surname>Matul&#x000ED;k</surname> <given-names>M.</given-names></name> <name><surname>&#x00160;ediv&#x000FD;</surname> <given-names>J.</given-names></name></person-group> (<year>2020</year>). <article-title>Alquist 2.0: Alexa prize socialbot based on sub-dialogue models</article-title>. <source>arXiv [preprint]. arXiv:2011.03259</source>.</citation>
</ref>
<ref id="B70">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Poppe</surname> <given-names>R.</given-names></name> <name><surname>Truong</surname> <given-names>K. P.</given-names></name> <name><surname>Reidsma</surname> <given-names>D.</given-names></name> <name><surname>Heylen</surname> <given-names>D.</given-names></name></person-group> (<year>2010</year>). <article-title>Backchannel strategies for artificial listeners</article-title>, in <source>Lecture Notes in Computer Science</source>, eds <person-group person-group-type="editor"><name><surname>Allbeck</surname> <given-names>J.</given-names></name> <name><surname>Badler</surname> <given-names>N.</given-names></name> <name><surname>Bickmore</surname> <given-names>T.</given-names></name> <name><surname>Pelachaud</surname> <given-names>C.</given-names></name> <name><surname>Safonova</surname> <given-names>A.</given-names></name></person-group> (<publisher-loc>Berlin</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>146</fpage>&#x02013;<lpage>158</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-642-15892-6_16</pub-id></citation>
</ref>
<ref id="B71">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Purver</surname> <given-names>M</given-names></name></person-group>. (<year>2004</year>). <source>The Theory and Use of Clarification Requests in Dialogue</source> (Ph.D. thesis). <publisher-name>Department of Computer Sciencce, University of London</publisher-name>, <publisher-loc>England</publisher-loc>. Available Online at: <ext-link ext-link-type="uri" xlink:href="http://www.eecs.qmul.ac.uk/mpurver/papers/purver04thesis.pdf">http://www.eecs.qmul.ac.uk/mpurver/papers/purver04thesis.pdf</ext-link></citation>
</ref>
<ref id="B72">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rajan</surname> <given-names>S.</given-names></name> <name><surname>Craig</surname> <given-names>S. D.</given-names></name> <name><surname>Gholson</surname> <given-names>B.</given-names></name> <name><surname>Person</surname> <given-names>N. K.</given-names></name> <name><surname>Graesser</surname> <given-names>A. C.</given-names></name></person-group> (<year>2001</year>). <article-title>AutoTutor: incorporating back-channel feedback and other human-like conversational behaviors into an intelligent tutoring system</article-title>. <source>Int. J. Speech Technol</source>. <volume>4</volume>, <fpage>117</fpage>&#x02013;<lpage>126</lpage>. <pub-id pub-id-type="doi">10.1023/A:1017319110294</pub-id></citation>
</ref>
<ref id="B73">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Rodr&#x000ED;guez</surname> <given-names>K. J.</given-names></name> <name><surname>Schlangen</surname> <given-names>D.</given-names></name></person-group> (<year>2004</year>). <article-title>Form, intonation and function of clarification requests in german task-oriented spoken dialogues</article-title>, in <source>Proceedings of SemDial 2004</source> (<publisher-loc>Barcelona</publisher-loc>), <fpage>101</fpage>&#x02013;<lpage>108</lpage>.</citation>
</ref>
<ref id="B74">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Romero-Trillo</surname> <given-names>J</given-names></name></person-group>. (<year>2019</year>). <article-title>Prosodic pragmatics and feedback in intercultural communication</article-title>. <source>J. Pragmat</source>. <volume>151</volume>, <fpage>91</fpage>&#x02013;<lpage>102</lpage>. <pub-id pub-id-type="doi">10.1016/j.pragma.2019.02.018</pub-id></citation>
</ref>
<ref id="B75">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sidner</surname> <given-names>C. L.</given-names></name> <name><surname>Lee</surname> <given-names>C.</given-names></name> <name><surname>Morency</surname> <given-names>L. P.</given-names></name> <name><surname>Forlines</surname> <given-names>C.</given-names></name></person-group> (<year>2006</year>). <article-title>The effect of head-nod recognition in human-robot conversation</article-title>, in <source>HRI 2006: Proceedings of the 2006 ACM Conference on Human-Robot Interaction</source> (<publisher-loc>Salt Lake City, UH</publisher-loc>), <fpage>290</fpage>&#x02013;<lpage>296</lpage>. <pub-id pub-id-type="doi">10.1145/1121241.1121291</pub-id></citation>
</ref>
<ref id="B76">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Singh</surname> <given-names>N.</given-names></name> <name><surname>Lee</surname> <given-names>J. J.</given-names></name> <name><surname>Grover</surname> <given-names>I.</given-names></name> <name><surname>Breazeal</surname> <given-names>C.</given-names></name></person-group> (<year>2018</year>). <article-title>P2PSTORY: dataset of children as storytellers and listeners in peer-to-peer interactions</article-title>, in <source>Conference on Human Factors in Computing Systems - Proceedings</source> (<publisher-loc>Montr&#x000E9;al, QC</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>11</lpage>. <pub-id pub-id-type="doi">10.1145/3173574.3174008</pub-id></citation>
</ref>
<ref id="B77">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Skantze</surname> <given-names>G</given-names></name></person-group>. (<year>2017</year>). <article-title>Towards a general, continuous model of turn-taking in spoken dialogue using LSTM recurrent neural networks</article-title>, in <source>Proceedings of SIGdial</source> (<publisher-loc>Saarbr&#x000FC;cken</publisher-loc>), <fpage>220</fpage>&#x02013;<lpage>230</lpage>. <pub-id pub-id-type="doi">10.18653/v1/W17-5527</pub-id></citation>
</ref>
<ref id="B78">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Skantze</surname> <given-names>G</given-names></name></person-group>. (<year>2021</year>). <article-title>Turn-taking in conversational systems and human-robot interaction: a review</article-title>. <source>Comput. Speech Lang</source>. <volume>67</volume>:<fpage>101178</fpage>. <pub-id pub-id-type="doi">10.1016/j.csl.2020.101178</pub-id></citation>
</ref>
<ref id="B79">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Skantze</surname> <given-names>G.</given-names></name> <name><surname>Hjalmarsson</surname> <given-names>A.</given-names></name> <name><surname>Oertel</surname> <given-names>C.</given-names></name></person-group> (<year>2014</year>). <article-title>Turn-taking, feedback and joint attention in situated human-robot interaction</article-title>. <source>Speech Commun</source>. <volume>65</volume>, <fpage>50</fpage>&#x02013;<lpage>66</lpage>. <pub-id pub-id-type="doi">10.1016/j.specom.2014.05.005</pub-id></citation>
</ref>
<ref id="B80">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Soldner</surname> <given-names>F.</given-names></name> <name><surname>P&#x000E9;rez-Rosas</surname> <given-names>V.</given-names></name> <name><surname>Mihalcea</surname> <given-names>R.</given-names></name></person-group> (<year>2019</year>). <article-title>Box of lies: multimodal deception detection in dialogues</article-title>, in <source>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source> (<publisher-loc>Minneapolis, MN</publisher-loc>), <fpage>1768</fpage>&#x02013;<lpage>1777</lpage>. <pub-id pub-id-type="doi">10.18653/v1/N19-1175</pub-id></citation>
</ref>
<ref id="B81">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stivers</surname> <given-names>T</given-names></name></person-group>. (<year>2008</year>). <article-title>Stance, alignment, and affiliation during storytelling: when nodding is a token of affiliation</article-title>. <source>Res. Lang. Soc. Interact</source>. <volume>41</volume>, <fpage>31</fpage>&#x02013;<lpage>57</lpage>. <pub-id pub-id-type="doi">10.1080/08351810701691123</pub-id></citation>
</ref>
<ref id="B82">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Stocksmeier</surname> <given-names>T.</given-names></name> <name><surname>Kopp</surname> <given-names>S.</given-names></name> <name><surname>Gibbon</surname> <given-names>D.</given-names></name></person-group> (<year>2007</year>). <article-title>Synthesis of prosodic attitudinal variants in German backchannel JA</article-title>, in <source>Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH</source> (<publisher-loc>Antwerp</publisher-loc>), <fpage>409</fpage>&#x02013;<lpage>412</lpage>. <pub-id pub-id-type="doi">10.21437/Interspeech.2007-232</pub-id></citation>
</ref>
<ref id="B83">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sun</surname> <given-names>M.</given-names></name> <name><surname>Mou</surname> <given-names>Y.</given-names></name> <name><surname>Xie</surname> <given-names>H.</given-names></name> <name><surname>Xia</surname> <given-names>M.</given-names></name> <name><surname>Wong</surname> <given-names>M.</given-names></name> <name><surname>Ma</surname> <given-names>X.</given-names></name></person-group> (<year>2019</year>). <article-title>Estimating emotional intensity from body poses for human-robot interaction</article-title>. <source>arXiv [preprint]. arXiv:1904.09435</source>.</citation>
</ref>
<ref id="B84">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Svetnik</surname> <given-names>V.</given-names></name> <name><surname>Liaw</surname> <given-names>A.</given-names></name> <name><surname>Tong</surname> <given-names>C.</given-names></name> <name><surname>Christopher Culberson</surname> <given-names>J.</given-names></name> <name><surname>Sheridan</surname> <given-names>R. P.</given-names></name> <name><surname>Feuston</surname> <given-names>B. P.</given-names></name></person-group> (<year>2003</year>). <article-title>Random forest: a classification and regression tool for compound classification and QSAR modeling</article-title>. <source>J. Chem. Inform. Comput. Sci</source>. <volume>43</volume>, <fpage>1947</fpage>&#x02013;<lpage>1958</lpage>. <pub-id pub-id-type="doi">10.1021/ci034160g</pub-id><pub-id pub-id-type="pmid">14632445</pub-id></citation></ref>
<ref id="B85">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Thepsoonthorn</surname> <given-names>C.</given-names></name> <name><surname>Yokozuka</surname> <given-names>T.</given-names></name> <name><surname>Miura</surname> <given-names>S.</given-names></name> <name><surname>Ogawa</surname> <given-names>K.</given-names></name> <name><surname>Miyake</surname> <given-names>Y.</given-names></name></person-group> (<year>2016</year>). <article-title>Prior knowledge facilitates mutual gaze convergence and head nodding synchrony in face-to-face communication</article-title>. <source>Sci. Rep</source>. <volume>6</volume>, <fpage>1</fpage>&#x02013;<lpage>14</lpage>. <pub-id pub-id-type="doi">10.1038/srep38261</pub-id><pub-id pub-id-type="pmid">27910902</pub-id></citation></ref>
<ref id="B86">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tolins</surname> <given-names>J.</given-names></name> <name><surname>Fox Tree</surname> <given-names>J. E.</given-names></name></person-group> (<year>2014</year>). <article-title>Addressee backchannels steer narrative development</article-title>. <source>J. Pragmat</source>. <volume>70</volume>, <fpage>152</fpage>&#x02013;<lpage>164</lpage>. <pub-id pub-id-type="doi">10.1016/j.pragma.2014.06.006</pub-id></citation>
</ref>
<ref id="B87">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Tozadore</surname> <given-names>D. C.</given-names></name> <name><surname>Romero</surname> <given-names>R. A.</given-names></name></person-group> (<year>2020</year>). <article-title>Multimodal fuzzy assessment for robot behavioral adaptation in educational children-robot interaction</article-title>, in <source>Companion Publication of the 2020 International Conference on Multimodal Interaction</source> (<publisher-loc>Utrecht</publisher-loc>), <fpage>392</fpage>&#x02013;<lpage>399</lpage>. <pub-id pub-id-type="doi">10.1145/3395035.3425201</pub-id></citation>
</ref>
<ref id="B88">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Velentza</surname> <given-names>A. M.</given-names></name> <name><surname>Heinke</surname> <given-names>D.</given-names></name> <name><surname>Wyatt</surname> <given-names>J.</given-names></name></person-group> (<year>2020</year>). <article-title>Museum robot guides or conventional audio guides? An experimental study</article-title>. <source>Adv. Robot</source>. <volume>34</volume>, <fpage>1571</fpage>&#x02013;<lpage>1580</lpage>. <pub-id pub-id-type="doi">10.1080/01691864.2020.1854113</pub-id><pub-id pub-id-type="pmid">22103235</pub-id></citation></ref>
<ref id="B89">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Verner</surname> <given-names>I. M.</given-names></name> <name><surname>Polishuk</surname> <given-names>A.</given-names></name> <name><surname>Krayner</surname> <given-names>N.</given-names></name></person-group> (<year>2016</year>). <article-title>Science class with RoboThespian: using a robot teacher to make science fun and engage students</article-title>. <source>IEEE Robot. Automat. Mag</source>. <volume>23</volume>, <fpage>74</fpage>&#x02013;<lpage>80</lpage>. <pub-id pub-id-type="doi">10.1109/MRA.2016.2515018</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B90">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Visser</surname> <given-names>T.</given-names></name> <name><surname>Traum</surname> <given-names>D.</given-names></name> <name><surname>DeVault</surname> <given-names>D.</given-names></name> <name><surname>op den Akker</surname> <given-names>R.</given-names></name></person-group> (<year>2014</year>). <article-title>A model for incremental grounding in spoken dialogue systems</article-title>. <source>J. Multimodal User Interfaces</source> <volume>8</volume>, <fpage>61</fpage>&#x02013;<lpage>73</lpage>. <pub-id pub-id-type="doi">10.1007/s12193-013-0147-7</pub-id></citation>
</ref>
<ref id="B91">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ward</surname> <given-names>N. G.</given-names></name> <name><surname>Tsukahara</surname> <given-names>W.</given-names></name></person-group> (<year>2000</year>). <article-title>Prosodic features which cue back-channel responses in English and Japanese</article-title>. <source>J. Pragmat</source>. <volume>38</volume>, <fpage>1177</fpage>&#x02013;<lpage>1207</lpage>. <pub-id pub-id-type="doi">10.1016/S0378-2166(99)00109-5</pub-id></citation>
</ref>
<ref id="B92">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Werfel</surname> <given-names>J</given-names></name></person-group>. (<year>2014</year>). <article-title>Embodied teachable agents: learning by teaching robots</article-title>, in <source>New Research Frontiers for Intelligent Autonomous Systems</source>, (<publisher-loc>Padova, Italy</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>8</lpage>.</citation>
</ref>
<ref id="B93">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wilson</surname> <given-names>T.</given-names></name> <name><surname>Wiebe</surname> <given-names>J.</given-names></name> <name><surname>Hoffmann</surname> <given-names>P.</given-names></name></person-group> (<year>2009</year>). <article-title>Recognizing contextual polarity: an exploration of features for phrase-level sentiment analysis</article-title>. <source>Comput. Linguist</source>. <volume>35</volume>, <fpage>399</fpage>&#x02013;<lpage>433</lpage>. <pub-id pub-id-type="doi">10.1162/coli.08-012-R1-06-90</pub-id></citation>
</ref>
<ref id="B94">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yamaoka</surname> <given-names>F.</given-names></name> <name><surname>Kanda</surname> <given-names>T.</given-names></name> <name><surname>Ishiguro</surname> <given-names>H.</given-names></name> <name><surname>Hagita</surname> <given-names>N.</given-names></name></person-group> (<year>2010</year>). <article-title>A model of proximity control for information-presenting robots</article-title>. <source>IEEE Trans. Robot</source>. <volume>26</volume>, <fpage>187</fpage>&#x02013;<lpage>195</lpage>. <pub-id pub-id-type="doi">10.1109/TRO.2009.2035747</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B95">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yngve</surname> <given-names>V. H</given-names></name></person-group>. (<year>1970</year>). <article-title>On getting a word in edgewise</article-title>, in <source>CLS-70</source> (<publisher-loc>Chicago, IL</publisher-loc>: <publisher-name>Chicago Linguistics Society</publisher-name>), <fpage>567</fpage>&#x02013;<lpage>578</lpage>.</citation>
</ref>
<ref id="B96">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yousuf</surname> <given-names>M. A.</given-names></name> <name><surname>Kobayashi</surname> <given-names>Y.</given-names></name> <name><surname>Kuno</surname> <given-names>Y.</given-names></name> <name><surname>Yamazaki</surname> <given-names>A.</given-names></name> <name><surname>Yamazaki</surname> <given-names>K.</given-names></name></person-group> (<year>2012</year>). <article-title>Developmen of a mobile museum guide robot that can configure spatial formation with visitors</article-title>, in <source>International Conference on Intelligent Computing</source> (<publisher-loc>Huangshan</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>423</fpage>&#x02013;<lpage>432</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-642-31588-6_55</pub-id></citation>
</ref>
<ref id="B97">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yu</surname> <given-names>C.</given-names></name> <name><surname>Tapus</surname> <given-names>A.</given-names></name></person-group> (<year>2019</year>). <article-title>Interactive robot learning for multimodal emotion recognition</article-title>, in <source>Lecture Notes in Computer Science</source>, eds <person-group person-group-type="editor"><name><surname>Salichs</surname> <given-names>A. M.</given-names></name> <name><surname>Ge</surname> <given-names>S. S.</given-names></name> <name><surname>Barakova</surname> <given-names>I. E.</given-names></name> <name><surname>Cabibihan</surname> <given-names>J. J.</given-names></name> <name><surname>Wagner</surname> <given-names>A. R.</given-names></name> <name><surname>Castro-Gonzalez</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<publisher-loc>Madrid</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>633</fpage>&#x02013;<lpage>642</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-35888-4_59</pub-id></citation>
</ref>
<ref id="B98">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zaletelj</surname> <given-names>J.</given-names></name> <name><surname>Ko&#x00161;ir</surname> <given-names>A.</given-names></name></person-group> (<year>2017</year>). <article-title>Predicting students&#x00027; attention in the classroom from Kinect facial and body features</article-title>. <source>Eurasip J. Image Video Process</source>. <volume>2017</volume>, <fpage>1</fpage>&#x02013;<lpage>12</lpage>. <pub-id pub-id-type="doi">10.1186/s13640-017-0228-8</pub-id></citation>
</ref>
<ref id="B99">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>Y.</given-names></name> <name><surname>Beskow</surname> <given-names>J.</given-names></name> <name><surname>Kjellstr&#x000F6;m</surname> <given-names>H.</given-names></name></person-group> (<year>2017</year>). <article-title>Look but don&#x00027;t stare: mutual gaze interaction in social robots</article-title>, in <source>Lecture Notes in Computer Science</source>, <fpage>556</fpage>&#x02013;<lpage>566</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-70022-9_55</pub-id></citation>
</ref>
<ref id="B100">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zilka</surname> <given-names>L.</given-names></name> <name><surname>Jurcicek</surname> <given-names>F.</given-names></name></person-group> (<year>2016</year>). <article-title>Incremental LSTM-based dialog state tracker</article-title>, in <source>2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015 - Proceedings</source> (<publisher-loc>Scottsdale, AZ</publisher-loc>), <fpage>757</fpage>&#x02013;<lpage>762</lpage>. <pub-id pub-id-type="doi">10.1109/ASRU.2015.7404864</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
</ref-list>
<fn-group>
<fn id="fn0001"><p><sup>1</sup>The first painting presented to each participant was Pieter Brueghel&#x00027;s <italic>Tower of Babel</italic>, and the second was Gentile Bellini&#x00027;s <italic>Miracle of the Cross fallen into the channel of Saint Lawrence</italic>.</p></fn>
</fn-group>
</back>
</article>
