<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Hum. Neurosci.</journal-id>
<journal-title>Frontiers in Human Neuroscience</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Hum. Neurosci.</abbrev-journal-title>
<issn pub-type="epub">1662-5161</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnhum.2013.00369</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Hypothesis and Theory Article</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>On the role of crossmodal prediction in audiovisual emotion perception</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Jessen</surname> <given-names>Sarah</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="author-notes" rid="fn001"><sup>&#x0002A;</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Kotz</surname> <given-names>Sonja A.</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Research Group &#x0201C;Early Social Development,&#x0201D; Max Planck Institute for Human Cognitive and Brain Sciences</institution> <country>Leipzig, Germany</country></aff>
<aff id="aff2"><sup>2</sup><institution>Research Group &#x0201C;Subcortical Contributions to Comprehension&#x0201D;, Department of Neuropsychology, Max Planck Institute for Human Cognitive and Brain Sciences</institution> <country>Leipzig, Germany</country></aff>
<aff id="aff3"><sup>3</sup><institution>School of Psychological Sciences, University of Manchester</institution> <country>Manchester, UK</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Martin Klasen, RWTH Aachen University, Germany</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Erich Schr&#x000F6;ger, University of Leipzig, Germany; Llu&#x000ED;s Fuentemilla, University of Barcelona, Spain</p></fn>
<fn fn-type="corresp" id="fn001"><p>&#x0002A;Correspondence: Sarah Jessen, Research Group &#x0201C;Early Social Development,&#x0201D; Max Planck Institute for Human Cognitive and Brain Sciences, Stephanstr. 1A, 04103 Leipzig, Germany e-mail: <email>jessen&#x00040;cbs.mpg.de</email></p></fn>
</author-notes>
<pub-date pub-type="epub">
<day>18</day>
<month>07</month>
<year>2013</year>
</pub-date>
<pub-date pub-type="collection">
<year>2013</year>
</pub-date>
<volume>7</volume>
<elocation-id>369</elocation-id>
<history>
<date date-type="received">
<day>04</day>
<month>04</month>
<year>2013</year>
</date>
<date date-type="accepted">
<day>25</day>
<month>06</month>
<year>2013</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2013 Jessen and Kotz.</copyright-statement>
<copyright-year>2013</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/3.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and subject to any copyright notices concerning any third-party graphics etc.</p>
</license>
</permissions>
<abstract><p>Humans rely on multiple sensory modalities to determine the emotional state of others. In fact, such multisensory perception may be one of the mechanisms explaining the ease and efficiency by which others&#x00027; emotions are recognized. But how and when exactly do the different modalities interact? One aspect in multisensory perception that has received increasing interest in recent years is the concept of <italic>cross-modal prediction</italic>. In emotion perception, as in most other settings, visual information precedes the auditory information. Thereby, leading in visual information can facilitate subsequent auditory processing. While this mechanism has often been described in audiovisual speech perception, so far it has not been addressed in audiovisual emotion perception. Based on the current state of the art in (a) cross-modal prediction and (b) multisensory emotion perception research, we propose that it is essential to consider the former in order to fully understand the latter. Focusing on electroencephalographic (EEG) and magnetoencephalographic (MEG) studies, we provide a brief overview of the current research in both fields. In discussing these findings, we suggest that emotional visual information may allow more reliable predicting of auditory information compared to non-emotional visual information. In support of this hypothesis, we present a re-analysis of a previous data set that shows an inverse correlation between the N1 EEG response and the duration of visual emotional, but not non-emotional information. If the assumption that emotional content allows more reliable predicting can be corroborated in future studies, cross-modal prediction is a crucial factor in our understanding of multisensory emotion perception.</p></abstract>
<kwd-group>
<kwd>cross-modal prediction</kwd>
<kwd>emotion</kwd>
<kwd>multisensory</kwd>
<kwd>EEG</kwd>
<kwd>audiovisual</kwd>
</kwd-group>
<counts>
<fig-count count="1"/>
<table-count count="0"/>
<equation-count count="0"/>
<ref-count count="44"/>
<page-count count="7"/>
<word-count count="6328"/>
</counts>
</article-meta>
</front>
<body>
<p>Perceiving others&#x00027; emotions is an important component of everyday social interaction. We can gather such information via somebody&#x00027;s vocal, facial, or body expressions, and by the content of his or her speech. If the information obtained by these different modalities is congruent, a correct interpretation appears to be faster and more efficient. This becomes evident at the behavioral level, for instance, in shorter reaction times (Giard and Peronnet, <xref ref-type="bibr" rid="B18">1999</xref>; Sperdin et al., <xref ref-type="bibr" rid="B36">2009</xref>) and higher accuracy (Giard and Peronnet, <xref ref-type="bibr" rid="B18">1999</xref>; Kreifelts et al., <xref ref-type="bibr" rid="B24">2007</xref>), but also at the neural level where clear differences between unisensory and multisensory processing can be observed. An interaction between complex auditory and visual information can be seen within 100 ms (e.g., van Wassenhove et al., <xref ref-type="bibr" rid="B43">2005</xref>; Stekelenburg and Vroomen, <xref ref-type="bibr" rid="B38">2007</xref>) and involves a large network of brain regions ranging from early uni- and multisensory areas, such as the primary auditory and the primary visual cortex (see, e.g., Calvert et al., <xref ref-type="bibr" rid="B9">1998</xref>, <xref ref-type="bibr" rid="B8">1999</xref>; Ghazanfar and Schroeder, <xref ref-type="bibr" rid="B17">2006</xref>) and the superior temporal gyrus (Calvert et al., <xref ref-type="bibr" rid="B10">2000</xref>; Callan et al., <xref ref-type="bibr" rid="B7">2003</xref>), to higher cognitive brain regions, such as the prefrontal cortex and the cingulate cortex (e.g., Laurienti et al., <xref ref-type="bibr" rid="B26">2003</xref>). These data are interpreted to support the assumption of multisensory facilitation.</p>
<p>The fact that multisensory perception leads to facilitation is generally accepted, however, the mechanisms underlying such facilitation, especially for complex dynamic stimuli, are yet to be fully understood. One mechanism that seems to be particularly important in audiovisual perception of complex, ecologically valid information, is cross-modal prediction. In a natural context, visual information typically precedes auditory information (Chandrasekaran et al., <xref ref-type="bibr" rid="B12">2009</xref>; Stekelenburg and Vroomen, <xref ref-type="bibr" rid="B39">2012</xref>). Visual information leads while the auditory one is lagging behind. Thereby, visual information allows generating predictions about several aspects of a subsequent sound, such as the time of its onset and content (e.g., Arnal et al., <xref ref-type="bibr" rid="B1">2009</xref>; Stekelenburg and Vroomen, <xref ref-type="bibr" rid="B39">2012</xref>). Due to this preparatory information flow, the following auditory information processing is facilitated. This mechanism can be seen as an instance of predictive coding as has been discussed for sensory perception in general (see Summerfield and Egner, <xref ref-type="bibr" rid="B40">2009</xref>).</p>
<p>The success and efficiency of cross-modal prediction is influenced by several factors, including attention, motivation, and the emotional state of the observer. Schroeder et al. (<xref ref-type="bibr" rid="B34">2008</xref>) for instance suggest an influence of attention on cross-modal prediction in speech perception. In the present paper, however, we will focus on a different aspect of cross-modal prediction that has largely been neglected: How does the emotional content of the perceived signal influence cross-modal prediction, or, vice versa, what role does cross-modal prediction play in the multisensory perception of emotions? Do emotions lead to a stronger prediction than comparable neutral stimuli or are emotions just another instance of complex salient information?</p>
<p>In the following, we will provide a short overview of recent findings on cross-modal prediction, focusing on electroencephalographic (EEG) and magnetoencephalographic (MEG) results. We will then discuss the role of affective information in cross-modal prediction before outlining necessary further steps to closer investigate this phenomenon.</p>
<sec>
<title>Cross-modal prediction</title>
<p>The most common setting, in which cross-modal prediction of complex stimuli is studied, is in audiovisual speech perception (Bernstein et al., <xref ref-type="bibr" rid="B3">2008</xref>; Arnal et al., <xref ref-type="bibr" rid="B1">2009</xref>, <xref ref-type="bibr" rid="B2">2011</xref>). Typically, videos are presented, in which a person is uttering a single syllable. As visual information starts before a sound&#x00027;s onset, its influence on auditory processing can be investigated.</p>
<p>In EEG and MEG studies, it has been shown that the predictability of an auditory signal by visual information affects the brain&#x00027;s response to the auditory information within 100 ms after a sound&#x00027;s onset. Especially the N1 has been studied in this context (e.g., Klucharev et al., <xref ref-type="bibr" rid="B23">2003</xref>; Besle et al., <xref ref-type="bibr" rid="B5">2004</xref>; van Wassenhove et al., <xref ref-type="bibr" rid="B43">2005</xref>), and a reduction of the N1 amplitude has been linked to facilitated processing of audiovisual speech (Besle et al., <xref ref-type="bibr" rid="B4">2009</xref>). Furthermore, the more predictable visual information is, the stronger such facilitation seems to be, as suggested in MEG studies that reported a reduction in M100 latency (Arnal et al., <xref ref-type="bibr" rid="B1">2009</xref>) and amplitude (Davis et al., <xref ref-type="bibr" rid="B14">2008</xref>). Similar results have been obtained in EEG studies; when syllables of different predictability are presented, the syllables with the highest predictability based on visual features lead to the strongest reduction in N1/P2 latency (van Wassenhove et al., <xref ref-type="bibr" rid="B43">2005</xref>).</p>
<p>Cross-modal prediction in complex settings has not only been investigated in speech perception, but also in the perception of other audiovisual events, such as everyday actions (e.g., Stekelenburg and Vroomen, <xref ref-type="bibr" rid="B38">2007</xref>, <xref ref-type="bibr" rid="B39">2012</xref>). Only if sufficiently predictive dynamic visual information is present, a reduction in the auditory N1 can be observed (Stekelenburg and Vroomen, <xref ref-type="bibr" rid="B38">2007</xref>).</p>
<p>Regarding the mechanisms underlying such cross-modal prediction, two distinct pathways have been suggested (Arnal et al., <xref ref-type="bibr" rid="B1">2009</xref>). In a first, indirect pathway, information from early visual areas influences activations in auditory areas via a third, relay area such as the superior temporal sulcus (STS). In a second, direct pathway, a cortico-cortical connection between early visual and early auditory areas is posited without the involvement of any additional area. Interestingly, these two pathways seem to cover different aspects of prediction; while the direct pathway is involved in generating predictions regarding the onset of an auditory stimulus, the indirect pathway rather predicts auditory information at the content-level, for instance, which syllable or sound will be uttered (Arnal et al., <xref ref-type="bibr" rid="B1">2009</xref>). Evidence for a distinction between two pathways also arises from EEG data (Klucharev et al., <xref ref-type="bibr" rid="B23">2003</xref>; Stekelenburg and de Gelder, <xref ref-type="bibr" rid="B37">2004</xref>): while the N1 is assumed to be modulated by predictability of physical stimulus parameters, the P2 seems to be sensitive to the content or the semantic features of the signal (Stekelenburg and Vroomen, <xref ref-type="bibr" rid="B39">2012</xref>).</p>
<p>In recent years, neural oscillations as a crucial mechanism underlying cross-modal prediction have come into focus (e.g., Doesburg et al., <xref ref-type="bibr" rid="B16">2008</xref>; Schroeder et al., <xref ref-type="bibr" rid="B34">2008</xref>; Senkowski et al., <xref ref-type="bibr" rid="B35">2008</xref>; Arnal et al., <xref ref-type="bibr" rid="B2">2011</xref>; Thorne et al., <xref ref-type="bibr" rid="B42">2011</xref>). While the analysis of event-related potentials offers a straight-forward and reliable way to investigate brain responses closely time-locked to a specific event, the analysis of oscillatory activity provides a way to analyze changes in the EEG data with more flexible timing. Furthermore, oscillatory brain activity has been suggested as a potential mechanism to mediate the influence of one brain area onto another (Buzsaki and Draguhn, <xref ref-type="bibr" rid="B6">2004</xref>). Such a mechanism may, for instance, underlie cross-modal prediction, where information from one sensory area affects the activity in a different sensory area (Kayser et al., <xref ref-type="bibr" rid="B21">2008</xref>; Schroeder et al., <xref ref-type="bibr" rid="B34">2008</xref>; Lakatos et al., <xref ref-type="bibr" rid="B25">2009</xref>). In the case of audiovisual prediction, visual information, processed in primary visual areas, thereby has the capacity to prepare auditory areas for incoming auditory information. However, such an operation takes time (Schroeder et al., <xref ref-type="bibr" rid="B34">2008</xref>), and it is therefore essential that visual information precedes the auditory one. Further, it has to provide some information about the upcoming auditory stimulus, such as an expected onset and, preferably, more detailed specification of a sound.</p>
<p>In summary, cross-modal prediction has been extensively studied in audiovisual speech perception and also in the perception of lower-level audiovisual stimuli. Along with an increasing interest in neural oscillations and their function(s) in recent years, new approaches and possibilities to investigate its underlying mechanisms have been developed. However, the role of cross-modal prediction in emotion perception has received hardly any attention. In the following, we will outline what is known regarding the role of emotions in cross-modal predictions.</p>
</sec>
<sec>
<title>Emotions and cross-modal prediction</title>
<p>Emotion perception is a case that involves cross-modal prediction. Cross-modal prediction likely contributes to the ease and efficiency with which others&#x00027; emotions are recognized. One question that arises is whether emotion perception is just one case of cross-modal prediction among others, or whether it differs substantially from cases of non-emotional cross-modal prediction.</p>
<p>Numerous recent studies have investigated the combined perception of emotions from different modalities (e.g., de Gelder et al., <xref ref-type="bibr" rid="B15">1999</xref>; Pourtois et al., <xref ref-type="bibr" rid="B31">2000</xref>, <xref ref-type="bibr" rid="B30">2002</xref>; for a recent review, see Klasen et al., <xref ref-type="bibr" rid="B22">2012</xref>). Emotional faces, bodies, and voices influence each other at various processing stages.</p>
<p>First brain responses to a mismatch between facial and vocal expressions (de Gelder et al., <xref ref-type="bibr" rid="B15">1999</xref>; Pourtois et al., <xref ref-type="bibr" rid="B31">2000</xref>) or also between body and facial expressions (Meeren et al., <xref ref-type="bibr" rid="B27">2005</xref>) can be observed around 100 ms after stimulus onset. Interactions of matching emotional faces and voices are typically observed slightly later, between 200 and 300 ms (Paulmann et al., <xref ref-type="bibr" rid="B28">2009</xref>), though some studies also report interaction effects in the range of the N1 (Jessen and Kotz, <xref ref-type="bibr" rid="B19">2011</xref>). Besides these early effects, interactions between different modalities can be observed at later processing stages, presumably in limbic areas and higher association cortices (Pourtois et al., <xref ref-type="bibr" rid="B30">2002</xref>; Chen et al., <xref ref-type="bibr" rid="B13">2010</xref>).</p>
<p>However, while the processing of multisensory emotional information has been amply investigated, only recently the dynamic temporal development of the perceived stimuli has come into focus. Classically, most studies used static facial expressions paired with (by its very nature) dynamic vocal expressions (e.g., de Gelder et al., <xref ref-type="bibr" rid="B15">1999</xref>; Pourtois et al., <xref ref-type="bibr" rid="B31">2000</xref>).</p>
<p>While this allows for investigating several aspects of emotion perception under controlled conditions, it is a strong simplification compared to a dynamic multisensory environment. In a natural setting, emotional information usually obeys the same patterns as outlined above: visual information precedes the auditory one. We see an angry face, see a mouth opening, see a breath-intake before we actually hear an outcry or an angry exclamation.</p>
<p>One aspect of such natural emotion perception that cannot be investigated using static stimulus material is the role of prediction in emotion perception. If auditory and visual onsets occur at the same time, we cannot investigate the influence of preceding visual information on the subsequent auditory one. However, two aspects of these studies using static facial expression render them particularly interesting and relevant in the present case.</p>
<p>First, several studies introduced a delay between the onset of a picture and a voice onset in order to differentiate between brain responses to the visual onset and brain responses to the auditory onset (de Gelder et al., <xref ref-type="bibr" rid="B15">1999</xref>; Pourtois et al., <xref ref-type="bibr" rid="B31">2000</xref>, <xref ref-type="bibr" rid="B30">2002</xref>). At the same time, however, such a delay introduces visual, albeit static, information, which allows for the generation of predictions. At which level these predictions can be made depends on the precise experimental setup. While some studies chose a variable delay (de Gelder et al., <xref ref-type="bibr" rid="B15">1999</xref>; Pourtois et al., <xref ref-type="bibr" rid="B31">2000</xref>), allowing for predictions only at the content, but not at the temporal level, others presented auditory information at a fixed delay, which allows for predictions both at the temporal and at a content level (Pourtois et al., <xref ref-type="bibr" rid="B30">2002</xref>). In either case, one can conceive of the results as investigating the influence of static emotional information on subsequent matching or mismatching auditory information.</p>
<p>Second, most studies used a mismatch paradigm, that is, a face and a voice were either of different emotions or one modality was emotional while the other was neutral (de Gelder et al., <xref ref-type="bibr" rid="B15">1999</xref>; Pourtois et al., <xref ref-type="bibr" rid="B31">2000</xref>, <xref ref-type="bibr" rid="B30">2002</xref>). These mismatch settings were then contrasted to matching stimuli, were a face and a voice conveyed the same emotion (or both did not show any emotional information, in a neutral case). While probably not intended by the researchers, such a design may reduce predictive validity to a rather large degree; after the first number of trials, the participant learns that a given facial expression may be followed either by the same or by a different emotion with equal probability. Conscious predictions cannot be made, neither at the content (emotional) level, nor at a more physical level based on facial features. Hence, visual information provides only limited information about subsequent auditory information. Therefore, data obtained from these studies informs us about multisensory emotion processing under conditions, in which predictive capacities are reduced. Note, however, that it is unclear to what extent one experimental session can reduce the predictions generated by facial expressions, or rather, how much of these predictions are automatic (either innate or due to high familiarity) so that they cannot be overwritten by a few trials, in which they are violated. In fact, the violation responses observed in these studies show that predictions about an upcoming sound are retained to a certain degree. However, some modulation of prediction does seem to take place, as for instance a mismatch negativity can be observed for matching face&#x02014;voice pairing preceded by a number of mismatching pairings (de Gelder et al., <xref ref-type="bibr" rid="B15">1999</xref>).</p>
<p>The results of these studies are inconsistent with respect to the influence visual information has on auditory information processing. While some report larger N1 responses for matching compared to non-matching face&#x02014;voice pairings (Pourtois et al., <xref ref-type="bibr" rid="B31">2000</xref>), others do not find differences in the N1 (Pourtois et al., <xref ref-type="bibr" rid="B30">2002</xref>). Instead, they report later differences between matching and non-matching face&#x02014;voice pairings, for instance in the P2b (Pourtois et al., <xref ref-type="bibr" rid="B30">2002</xref>).</p>
<p>A different approach to investigate the face&#x02014;voice interaction has been to present emotional facial expressions either alone or combined with matching vocal information (Paulmann et al., <xref ref-type="bibr" rid="B28">2009</xref>). In this study, the onset of visual and auditory information was synchronized, thereby excluding any visual prediction before the sound onset. In such a setting, first effects of emotional information were observed in the P2, showing larger amplitudes for angry compared to neutral stimuli. While the use of matching stimuli presented in either a uni- or a multisensory way provides a promising design to investigate cross-modal prediction, the lack of any audiovisual delay prevents us from drawing any specific conclusions regarding predictive mechanisms.</p>
<p>Overall, visual emotional information does seem to influence auditory processing at a very early stage. However, studies investigating this influence in a natural setting are largely missing.</p>
<p>In two recent EEG-studies, we investigated the interaction between emotional body and voice information by means of video material in order to overcome some of the limitations of previous studies (Jessen and Kotz, <xref ref-type="bibr" rid="B19">2011</xref>; Jessen et al., <xref ref-type="bibr" rid="B20">2012</xref>). Videos, in which actors expressed different emotional states with or without matching vocal expressions were presented. The emotional states &#x0201C;anger&#x0201D; and &#x0201C;fear&#x0201D; were depicted via body-expressions as well as short vocalizations (e.g., &#x0201C;ah&#x0201D;). Furthermore, we included a non-emotional control condition (&#x0201C;neutral&#x0201D;), in which the actor performed a movement that did not express any specific emotion and uttered the same vocalization with a neutral tone of voice. The delay between visual and auditory onsets was different for each stimulus, as the timing of the original recording of the videos was not manipulated. Hence, the vocalization occurred with a variable delay after the actor had started to move. In both studies, we observed smaller N1 amplitudes for emotional compared to neutral stimuli, as well as for audiovisual compared to unisensory auditory stimuli, irrespective of the emotional content. The amplitude reduction for audiovisual stimuli resembles that observed by Stekelenburg and Vroomen (<xref ref-type="bibr" rid="B38">2007</xref>) for non-emotional stimuli, supporting the notion that the observed effect can be attributed to predictive visual information. However, we did not find an interaction with emotional content.</p>
<p>While we did not manipulate predictive validity of the visual information in these studies, we were still interested in whether the amount of available visual information influences auditory processing. We therefore correlated the length of the audiovisual delay for each stimulus with the N1 amplitude in response to that stimulus obtained in the audiovisual condition of the experiment reported in Jessen et al. (<xref ref-type="bibr" rid="B20">2012</xref>) (Figure <xref ref-type="fig" rid="F1">1</xref>).</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p><bold>Correlation between audiovisual delay and N1 amplitude.</bold> In one of our studies (Jessen et al., <xref ref-type="bibr" rid="B20">2012</xref>), we presented 24 participants with videos, in which different emotions were expressed by body and vocal expressions simultaneously. The delay between the visual and the auditory onset was different for each stimulus. In order to investigate the influence that a different amount of visual information has on the subsequent auditory processing, we correlated the length of the audiovisual delay with the N1 amplitude separately for each emotion. Trials in which the N1 amplitude differed more than 3 standard deviations from the mean were excluded from further analysis. Dots represent individual trials. A linear mixed model including the random factor subject and the fixed factors emotion and delay reveals a significant interaction between the fixed factors [<italic>F</italic><sub>(1, 2408)</sub> &#x0003D; 33.43, <italic>p</italic> &#x0003C; 0.0001]. It can be seen that for both emotions, an inverse relation between N1 amplitude and delay exists: the longer the delay, the smaller the N1 amplitude [anger: <italic>F</italic><sub>(1, 805)</sub> &#x0003D; 10.98, <italic>p</italic> &#x0003C; 0.001; fear: <italic>F</italic><sub>(1, 773)</sub> &#x0003D; 32.50, <italic>p</italic> &#x0003C; 0.0001]. The reverse pattern occurs in the neutral condition; here, longer delays correspond to larger N1 amplitudes [<italic>F</italic><sub>(1, 784)</sub> &#x0003D; 17.19, <italic>p</italic> &#x0003C; 0.0001].</p></caption>
<graphic xlink:href="fnhum-07-00369-g0001.tif"/>
</fig>
<p>We found a positive correlation for both emotion conditions, that is, the longer the delay between visual and auditory onset, the <italic>smaller</italic> the amplitude of the subsequent N1. The opposite pattern was observed in the neutral condition; the longer the delay, the <italic>larger</italic> the N1 amplitude.</p>
<p>As outlined above, reduced N1 amplitudes in cross-modal predictive settings have commonly been interpreted as increased (temporal) prediction. If we assume that a longer stretch of visual information allows for a stronger prediction, this increase in prediction can explain the reduction in N1 amplitude observed with increasing visual information for emotional stimuli. However, this pattern does not seem to hold for non-emotional stimuli. When the duration of visual information increases, the amplitude of the N1 also increases. Hence, only in the case of emotional stimuli, an increase in visual information seems to correspond to an increase in visual predictability.</p>
<p>Interestingly, this is the case although neutral stimuli, on average, have a longer audiovisual delay (mean delay for stimuli presented in the audiovisual condition: anger: 1032 ms, fear: 863 ms, neutral: 1629 ms), and thus more visual information is available. Therefore, emotional content rather than pure amount of information seems to drive the observed correlation.</p>
<p>Support for the idea that emotional information may have an influence on cross-modal prediction also comes from priming research. The affective content of a prime strongly influences target effects (Carroll and Young, <xref ref-type="bibr" rid="B11">2005</xref>), leading to differences in activation as evidenced by several EEG studies (e.g., Schirmer et al., <xref ref-type="bibr" rid="B33">2002</xref>; Werheid et al., <xref ref-type="bibr" rid="B44">2005</xref>). Schirmer et al. (<xref ref-type="bibr" rid="B33">2002</xref>), for instance, observed smaller N400 amplitudes in response to words that matched a preceding prime in contrast to words that violated the prediction. Also, for facial expressions, a decreased ERP response in frontal areas within 200 ms has been observed in response to primed as compared to non-primed emotion expressions (Werheid et al., <xref ref-type="bibr" rid="B44">2005</xref>).</p>
<p>However, priming studies strongly differ from real multisensory interactions. Visual and auditory information are presented subsequently rather than simultaneously, and typically, visual and auditory stimuli do not originate from the same event. Priming research therefore only allows for investigating prediction at the content level, at which for instance the perception of an angry face primes the perception of an angry voice. It does not allow investigating temporal prediction as no natural temporal relation between visual and auditory information is present.</p>
<p>Neither our study referenced above (Jessen et al., <xref ref-type="bibr" rid="B20">2012</xref>) nor the mentioned priming studies were thus designed to explicitly investigate the influence of affective information on cross-modal prediction in naturalistic settings. Hence, the reported data just offer a glimpse into this field. Nevertheless, they highlight the potential role cross-modal prediction may play in the multisensory perception of emotions. We believe that this role may be essential for our understanding of emotion perception, and in the following suggest several approaches suited to illuminate this role.</p>
</sec>
<sec>
<title>Future directions</title>
<p>Different aspects of multisensory emotion perception need to be further investigated in order to understand the role of cross-modal prediction in this context. First, it is essential to establish the influence that emotional content has on cross-modal prediction, especially in contrast to other complex and salient information. Second, it will be necessary to investigate, which aspects of cross-modal prediction are influenced by emotional content. And finally, it is essential to consider how much or how little emotional information is sufficient to influence such predictions. We will take a closer look at all three propositions in the following.</p>
<sec>
<title>Affective influence on cross-modal prediction</title>
<p>First, it is necessary to investigate the degree to which affective content influences prediction. The correlation analysis reported above suggests that visual emotions seem to have some influence on subsequent auditory processing, but further studies are clearly needed.</p>
<p>In order to investigate this aspect, it is crucial to use appropriate stimulus material. Most importantly, such stimulus material has to be dynamic in order to allow for the investigation of temporal as well as content-level predictions. Only dynamic material can cover temporal as well as content predictions and, at the same time, retain the natural temporal relation between visual and auditory onsets. While the use of videos has become increasingly popular in recent years in fMRI studies (e.g., Kreifelts et al., <xref ref-type="bibr" rid="B24">2007</xref>; Pichon et al., <xref ref-type="bibr" rid="B29">2009</xref>; Robins et al., <xref ref-type="bibr" rid="B32">2009</xref>), most EEG (and MEG) studies still rely on static material. One reason for this is probably the very advantage of EEG over fMRI, namely its high temporal resolution. While this allows for close tracking of the time course of information processing, it is also vulnerable to confounds arising from the processing of the preceding visual information. However, this problem can be countered by choosing well-suited control conditions (such as comparably complex and moving non-emotional stimuli). Furthermore, it will be helpful to not exclusively rely on ERP data, but to broaden the analysis to include neural oscillations that can be analyzed in ways less dependent on fixed event onsets (e.g., induced activity, see for instance Tallon-Baudry and Bertrand, <xref ref-type="bibr" rid="B41">1999</xref>). Of particular interest in this context would be the influence emotional visual information has on the phase of oscillatory activity in auditory areas, as well as the relation between low- and high-frequency oscillations. Is, for instance, auditory processing influenced by the phase of the oscillatory activity during visual presentation?</p>
<p>Furthermore, it is necessary to tease apart cross-modal prediction from other forms of multisensory interaction that most likely occur in multisensory emotion perception. Here, it will be essential to manipulate the predictability of the preceding visual information, either at the content level (by for instance using different intensities of emotion expression) or at a temporal level (by providing more or less visual information, see below).</p>
<p>Finally, another important factor may be the role that different types of visual stimuli play, such as facial in comparison to body expressions. Both are visual sources, naturally co-occurring with auditory information, and therefore both can potentially predict auditory information. However, they differ in that facial expressions are more closely linked to vocal utterances. Body expressions, in contrast, may provide more coarse information about emotional states, essential at larger distances. Hence, while facial expressions seem the most obvious candidate, body expressions are not be forgotten (in fact, the correlation reported above shows brain data in response to body&#x02014;voice pairings, Jessen et al., <xref ref-type="bibr" rid="B20">2012</xref>).</p>
<p>Insight from these different approaches will allow us to get a general appreciation of how cross-modal prediction influences multisensory emotion perception.</p>
</sec>
<sec>
<title>Different pathways</title>
<p>At a more specific level, one essential question is which aspect of cross-modal prediction can be influenced by emotional content.</p>
<p>One aspect that is highly relevant in this context is the notion of different pathways as outlined by Arnal et al. (<xref ref-type="bibr" rid="B1">2009</xref>). For cross-modal emotional prediction, at least three different levels of prediction become relevant. Predictions may occur at a simple, physical level, comparable to any other stimulus: by the movement of face and body, we can predict when an auditory event onset will occur. This prediction would correspond to the direct pathways posited by Arnal et al. (<xref ref-type="bibr" rid="B1">2009</xref>). This direct pathway seems to be involved in cross-modal prediction irrespective of emotional content. Emotions may render temporal predictions possibly even more reliable, as emotional facial expressions are very common, well-rehearsed stimuli and hence may allow for a more precise prediction of the onset in comparison to less frequent stimuli. However, the emotional content itself most likely plays only a minor role in the generation of temporal predictions.</p>
<p>Secondly, predictions may occur at the sound level. Based on the shape of the mouth (and to a certain degree other facial features), predictions can be made regarding the following utterance, be it a word, an interjection, or just a vocalization such as laughter. This type of prediction is specific to complex stimuli, for which the production of a sound can be observed visually, for instance in speech production and actions. When this is not the case, for example, if the button on a radio is pushed, we can predict the sound onset, but not the type of sound we will hear.</p>
<p>For this second type of predictions, emotions are expected to play a more important role, as the content of the vocalization is closely tied to the emotion expressed. Still, they not only predict emotional aspects, but also properties of the upcoming sound that are not mainly related to its emotional quality. Hence predictions specifically related to the affective content are rather a byproduct of general predicting sound features. Nevertheless, quickly determining emotional aspects is essential for fast and efficient emotion processing, and based on this necessity, affective content of the visual signal may lead to a prioritized content processing for sound information.</p>
<p>A third type of prediction is closely related to the prediction of a sound; with respect to cross-modal emotional prediction, we cannot only predict whether an &#x0201C;ah&#x0201D; or and &#x0201C;oh&#x0201D; will occur (as in speech perception), but also whether this &#x0201C;ah&#x0201D; will be uttered in an angry or fearful tone of voice. We can thus predict the emotional content. Both of these latter types of prediction invoke an indirect pathway (Arnal et al., <xref ref-type="bibr" rid="B1">2009</xref>). However, while content prediction can occur in several settings, emotion prediction is specific to human face-to-face interaction.</p>
<p>This last type of predictions, emotion prediction proper, is devoted exclusively to predicting the emotional content of an upcoming signal. Hence, the strongest influence of emotional content is expected to occur at this level.</p>
<p>Nevertheless, in order to better understand cross-modal emotion prediction, it will be necessary to further disentangle the relation between these two types of indirect predictions (i.e., the prediction of speech content such as &#x0201C;ah&#x0201D; and the prediction of emotional content from the tone of voice).</p>
</sec>
<sec>
<title>Duration of visual information</title>
<p>Another important aspect is the amount of visual information necessary to generate reliable predictions. It has been shown that the delay between the onset of mouth movement and the onset of speech sound typically varies between 100 and 300 ms (Chandrasekaran et al., <xref ref-type="bibr" rid="B12">2009</xref>). Accordingly, most studies using speech stimuli use an audiovisual delay within that time range (Besle et al., <xref ref-type="bibr" rid="B5">2004</xref>; Stekelenburg and Vroomen, <xref ref-type="bibr" rid="B38">2007</xref>; Arnal et al., <xref ref-type="bibr" rid="B1">2009</xref>). The same holds true for the perception of actions (Stekelenburg and Vroomen, <xref ref-type="bibr" rid="B38">2007</xref>). However, the question arises as to how much delay is actually <italic>necessary</italic> to allow for cross-modal prediction to occur. Stekelenburg and Vroomen (<xref ref-type="bibr" rid="B38">2007</xref>), who used speech stimuli with an auditory delay of 160&#x02013;200 ms as well as action stimuli with an auditory delay of 280&#x02013;320 ms observed stronger N1 suppression effects for action compared to speech stimuli. They suggested that this difference may be due to the longer stretch of visual information preceding a sound onset. Somewhat shorter optimal delays have been observed using simpler stimulus material and/or more invasive recording. In human EEG, an audiovisual lag of 30 to 75 ms has been found to reliably elicit a phase reset in auditory cortex (Thorne et al., <xref ref-type="bibr" rid="B42">2011</xref>). A similar time window has been found in a study of local field potential in the auditory cortex of macaque monkeys; the strongest modulation by preceding visual information was observed for a delay between 20 and 80 ms (Kayser et al., <xref ref-type="bibr" rid="B21">2008</xref>).</p>
<p>Hence, providing more visual information may (at least up to some point) allow for a better prediction formation. At the same time, if affective information enhances cross-modal prediction, emotional content may reduce the length of required visual information. Determining the necessary temporal constraints can therefore provide crucial insight onto the effect of emotional information on multisensory information processing.</p>
<p>In summary, we suggest that in order to fully understand multisensory emotion perception, it is essential to take into account the role of cross-modal prediction. It will therefore be necessary to bring together approaches and findings from two flourishing fields that have so far been largely kept separate: cross-modal prediction and emotion perception. Only if we understand the role of prediction, we will be able to fully understand multisensory emotion perception.</p>
</sec>
<sec>
<title>Conflict of interest statement</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</sec>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Arnal</surname> <given-names>L. H.</given-names></name> <name><surname>Morillon</surname> <given-names>B.</given-names></name> <name><surname>Kell</surname> <given-names>C. A.</given-names></name> <name><surname>Giraud</surname> <given-names>A.-L.</given-names></name></person-group> (<year>2009</year>). <article-title>Dual neural routing of visual facilitation in speech processing</article-title>. <source>J. Neurosci</source>. <volume>29</volume>, <fpage>13445</fpage>&#x02013;<lpage>13453</lpage>. <pub-id pub-id-type="doi">10.1523/JNEUROSCI.3194-09.2009</pub-id><pub-id pub-id-type="pmid">19864557</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Arnal</surname> <given-names>L. H.</given-names></name> <name><surname>Wyart</surname> <given-names>V.</given-names></name> <name><surname>Giraud</surname> <given-names>A. L.</given-names></name></person-group> (<year>2011</year>). <article-title>Transitions in neural oscillations reflect prediction errors generated in audiovisual speech</article-title>. <source>Nat. Neurosci</source>. <volume>14</volume>, <fpage>797</fpage>&#x02013;<lpage>801</lpage>. <pub-id pub-id-type="doi">10.1038/nn.2810</pub-id><pub-id pub-id-type="pmid">21552273</pub-id></citation>
</ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bernstein</surname> <given-names>L. E.</given-names></name> <name><surname>Auer</surname> <given-names>E. T.</given-names> <suffix>Jr.</suffix></name> <name><surname>Wagner</surname> <given-names>M.</given-names></name> <name><surname>Ponton</surname> <given-names>C. W.</given-names></name></person-group> (<year>2008</year>). <article-title>Spatiotemporal dynamics of audiovisual speech processing</article-title>. <source>Neuroimage</source> <volume>39</volume>, <fpage>423</fpage>&#x02013;<lpage>435</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuroimage.2007.08.035</pub-id><pub-id pub-id-type="pmid">17920933</pub-id></citation>
</ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Besle</surname> <given-names>J.</given-names></name> <name><surname>Bertrand</surname> <given-names>O.</given-names></name> <name><surname>Giard</surname> <given-names>M. H.</given-names></name></person-group> (<year>2009</year>). <article-title>Electrophysiological (EEG, sEEG, MEG) evidence for multiple audiovisual interactions in the human auditory cortex</article-title>. <source>Hear. Res</source>. <volume>258</volume>, <fpage>143</fpage>&#x02013;<lpage>151</lpage>. <pub-id pub-id-type="doi">10.1016/j.heares.2009.06.016</pub-id><pub-id pub-id-type="pmid">19573583</pub-id></citation>
</ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Besle</surname> <given-names>J.</given-names></name> <name><surname>Fort</surname> <given-names>A.</given-names></name> <name><surname>Delpuech</surname> <given-names>C.</given-names></name> <name><surname>Giard</surname> <given-names>M.</given-names></name></person-group> (<year>2004</year>). <article-title>Bimodal speech: early suppressive visual effects in human auditory cortex</article-title>. <source>Eur. J. Neurosci</source>. <volume>20</volume>, <fpage>2225</fpage>&#x02013;<lpage>2234</lpage>. <pub-id pub-id-type="doi">10.1111/j.1460-9568.2004.03670.x</pub-id><pub-id pub-id-type="pmid">15450102</pub-id></citation>
</ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Buzsaki</surname> <given-names>G.</given-names></name> <name><surname>Draguhn</surname> <given-names>A.</given-names></name></person-group> (<year>2004</year>). <article-title>Neuronal oscillations in cortical networks</article-title>. <source>Science</source> <volume>304</volume>, <fpage>1926</fpage>&#x02013;<lpage>1929</lpage>. <pub-id pub-id-type="doi">10.1126/science.1099745</pub-id><pub-id pub-id-type="pmid">15218136</pub-id></citation>
</ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Callan</surname> <given-names>D. E.</given-names></name> <name><surname>Jones</surname> <given-names>J. A.</given-names></name> <name><surname>Munhall</surname> <given-names>K.</given-names></name> <name><surname>Callan</surname> <given-names>A. M.</given-names></name> <name><surname>Kroos</surname> <given-names>C.</given-names></name> <name><surname>Vatikiotis-Bateson</surname> <given-names>E.</given-names></name></person-group> (<year>2003</year>). <article-title>Neural processes underlying perceptual enhancement by visual speech gestures</article-title>. <source>Neuroreport</source> <volume>14</volume>, <fpage>2213</fpage>&#x02013;<lpage>2218</lpage>. <pub-id pub-id-type="doi">10.1097/00001756-200312020-00016</pub-id><pub-id pub-id-type="pmid">14625450</pub-id></citation>
</ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Calvert</surname> <given-names>G. A.</given-names></name> <name><surname>Brammer</surname> <given-names>M. J.</given-names></name> <name><surname>Bullmore</surname> <given-names>E. T.</given-names></name> <name><surname>Campbell</surname> <given-names>R.</given-names></name> <name><surname>Iversen</surname> <given-names>S. D.</given-names></name> <name><surname>David</surname> <given-names>A. S.</given-names></name></person-group> (<year>1999</year>). <article-title>Response amplification in sensory-specific cortices during crossmodal binding</article-title>. <source>Neuroreport</source> <volume>10</volume>, <fpage>2619</fpage>&#x02013;<lpage>2623</lpage>. <pub-id pub-id-type="doi">10.1097/00001756-199908200-00033</pub-id><pub-id pub-id-type="pmid">10574380</pub-id></citation>
</ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Calvert</surname> <given-names>G. A.</given-names></name> <name><surname>Brammer</surname> <given-names>M. J.</given-names></name> <name><surname>Iversen</surname> <given-names>S. D.</given-names></name></person-group> (<year>1998</year>). <article-title>Crossmodal identification</article-title>. <source>Trends Cogn. Sci</source>. <volume>2</volume>, <fpage>247</fpage>&#x02013;<lpage>253</lpage>. <pub-id pub-id-type="doi">10.1016/S1364-6613(98)01189-9</pub-id><pub-id pub-id-type="pmid">21244923</pub-id></citation>
</ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Calvert</surname> <given-names>G. A.</given-names></name> <name><surname>Campbell</surname> <given-names>R.</given-names></name> <name><surname>Brammer</surname> <given-names>M. J.</given-names></name></person-group> (<year>2000</year>). <article-title>Evidence from functional magnetic resonance imaging of crossmodal binding in the human heteromodal cortex</article-title>. <source>Curr. Biol</source>. <volume>10</volume>, <fpage>649</fpage>&#x02013;<lpage>657</lpage>. <pub-id pub-id-type="doi">10.1016/S0960-9822(00)00513-3</pub-id><pub-id pub-id-type="pmid">10837246</pub-id></citation>
</ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Carroll</surname> <given-names>N. C.</given-names></name> <name><surname>Young</surname> <given-names>A. W.</given-names></name></person-group> (<year>2005</year>). <article-title>Priming of emotion recognition</article-title>. <source>Q. J. Exp. Psychol. A</source> <volume>58</volume>, <fpage>1173</fpage>&#x02013;<lpage>1197</lpage>. <pub-id pub-id-type="doi">10.1080/02724980443000539</pub-id><pub-id pub-id-type="pmid">16194954</pub-id></citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chandrasekaran</surname> <given-names>C.</given-names></name> <name><surname>Trubanova</surname> <given-names>A.</given-names></name> <name><surname>Stillittano</surname> <given-names>S.</given-names></name> <name><surname>Caplier</surname> <given-names>A.</given-names></name> <name><surname>Ghazanfar</surname> <given-names>A. A.</given-names></name></person-group> (<year>2009</year>). <article-title>The natural statistics of audiovisual speech</article-title>. <source>PLoS Comput. Biol</source>. <volume>5</volume>:<fpage>e1000436</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pcbi.1000436</pub-id><pub-id pub-id-type="pmid">19609344</pub-id></citation>
</ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>Y. H.</given-names></name> <name><surname>Edgar</surname> <given-names>J. C.</given-names></name> <name><surname>Holroyd</surname> <given-names>T.</given-names></name> <name><surname>Dammers</surname> <given-names>J.</given-names></name> <name><surname>Thonnessen</surname> <given-names>H.</given-names></name> <name><surname>Roberts</surname> <given-names>T. P.</given-names></name> <etal/></person-group>. (<year>2010</year>). <article-title>Neuromagnetic oscillations to emotional faces and prosody</article-title>. <source>Eur. J. Neurosci</source>. <volume>31</volume>, <fpage>1818</fpage>&#x02013;<lpage>1827</lpage>. <pub-id pub-id-type="doi">10.1111/j.1460-9568.2010.07203.x</pub-id><pub-id pub-id-type="pmid">20584186</pub-id></citation>
</ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Davis</surname> <given-names>C.</given-names></name> <name><surname>Kislyuk</surname> <given-names>D.</given-names></name> <name><surname>Kim</surname> <given-names>J.</given-names></name> <name><surname>Sams</surname> <given-names>M.</given-names></name></person-group> (<year>2008</year>). <article-title>The effect of viewing speech on auditory speech processing is different in the left and right hemispheres</article-title>. <source>Brain Res</source>. <volume>1242</volume>, <fpage>151</fpage>&#x02013;<lpage>161</lpage>. <pub-id pub-id-type="doi">10.1016/j.brainres.2008.04.077</pub-id><pub-id pub-id-type="pmid">18538750</pub-id></citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>de Gelder</surname> <given-names>B.</given-names></name> <name><surname>B&#x000F6;cker</surname> <given-names>K. B. E.</given-names></name> <name><surname>Tuomainen</surname> <given-names>J.</given-names></name> <name><surname>Hensen</surname> <given-names>M.</given-names></name> <name><surname>Vroomen</surname> <given-names>J.</given-names></name></person-group> (<year>1999</year>). <article-title>The combined perception of emotion from voice and face: early interaction revealed by human electric brain responses</article-title>. <source>Neurosci. Lett</source>. <volume>260</volume>, <fpage>133</fpage>&#x02013;<lpage>136</lpage>. <pub-id pub-id-type="doi">10.1016/S0304-3940(98)00963-X</pub-id><pub-id pub-id-type="pmid">10025717</pub-id></citation>
</ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Doesburg</surname> <given-names>S. M.</given-names></name> <name><surname>Emberson</surname> <given-names>L. L.</given-names></name> <name><surname>Rahi</surname> <given-names>A.</given-names></name> <name><surname>Cameron</surname> <given-names>D.</given-names></name> <name><surname>Ward</surname> <given-names>L. M.</given-names></name></person-group> (<year>2008</year>). <article-title>Asynchrony from synchrony: long-range gamma-band neural synchrony accompanies perception of audiovisual speech asynchrony</article-title>. <source>Exp. Brain Res</source>. <volume>185</volume>, <fpage>11</fpage>&#x02013;<lpage>20</lpage>. <pub-id pub-id-type="doi">10.1007/s00221-007-1127-5</pub-id><pub-id pub-id-type="pmid">17922119</pub-id></citation>
</ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ghazanfar</surname> <given-names>A. A.</given-names></name> <name><surname>Schroeder</surname> <given-names>C. E.</given-names></name></person-group> (<year>2006</year>). <article-title>Is neocortex essentially multisensory?</article-title> <source>Trends Cogn. Sci</source>. <volume>10</volume>, <fpage>278</fpage>&#x02013;<lpage>285</lpage>. <pub-id pub-id-type="doi">10.1016/j.tics.2006.04.008</pub-id><pub-id pub-id-type="pmid">16713325</pub-id></citation>
</ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Giard</surname> <given-names>M. H.</given-names></name> <name><surname>Peronnet</surname> <given-names>F.</given-names></name></person-group> (<year>1999</year>). <article-title>Auditory-visual integration during multimodal object recognition in humans: a behavioral and electrophysiological study</article-title>. <source>J. Cogn. Neurosci</source>. <volume>11</volume>, <fpage>473</fpage>&#x02013;<lpage>490</lpage>. <pub-id pub-id-type="doi">10.1162/089892999563544</pub-id><pub-id pub-id-type="pmid">10511637</pub-id></citation>
</ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jessen</surname> <given-names>S.</given-names></name> <name><surname>Kotz</surname> <given-names>S. A.</given-names></name></person-group> (<year>2011</year>). <article-title>The temporal dynamics of processing emotions from vocal, facial, and bodily expressions</article-title>. <source>Neuroimage</source> <volume>58</volume>, <fpage>665</fpage>&#x02013;<lpage>674</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuroimage.2011.06.035</pub-id><pub-id pub-id-type="pmid">21718792</pub-id></citation>
</ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jessen</surname> <given-names>S.</given-names></name> <name><surname>Obleser</surname> <given-names>J.</given-names></name> <name><surname>Kotz</surname> <given-names>S. A.</given-names></name></person-group> (<year>2012</year>). <article-title>How bodies and voices interact in early emotion perception</article-title>. <source>PLoS ONE</source> <volume>7</volume>:<fpage>e36070</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0036070</pub-id><pub-id pub-id-type="pmid">22558332</pub-id></citation>
</ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kayser</surname> <given-names>C.</given-names></name> <name><surname>Petkov</surname> <given-names>C. I.</given-names></name> <name><surname>Logothetis</surname> <given-names>N. K.</given-names></name></person-group> (<year>2008</year>). <article-title>Visual modulation of neurons in auditory cortex</article-title>. <source>Cereb. Cortex</source> <volume>18</volume>, <fpage>1560</fpage>&#x02013;<lpage>1574</lpage>. <pub-id pub-id-type="doi">10.1093/cercor/bhm187</pub-id><pub-id pub-id-type="pmid">18180245</pub-id></citation>
</ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Klasen</surname> <given-names>M.</given-names></name> <name><surname>Chen</surname> <given-names>Y. H.</given-names></name> <name><surname>Mathiak</surname> <given-names>K.</given-names></name></person-group> (<year>2012</year>). <article-title>Multisensory emotions: perception, combination and underlying neural processes</article-title>. <source>Rev. Neurosci</source>. <volume>23</volume>, <fpage>381</fpage>&#x02013;<lpage>392</lpage>. <pub-id pub-id-type="doi">10.1515/revneuro-2012-0040</pub-id><pub-id pub-id-type="pmid">23089604</pub-id></citation>
</ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Klucharev</surname> <given-names>V.</given-names></name> <name><surname>Mottonen</surname> <given-names>R.</given-names></name> <name><surname>Sams</surname> <given-names>M.</given-names></name></person-group> (<year>2003</year>). <article-title>Electrophysiological indicators of phonetic and non-phonetic multisensory interactions during audiovisual speech perception</article-title>. <source>Brain Res. Cogn. Brain Res</source>. <volume>18</volume>, <fpage>65</fpage>&#x02013;<lpage>75</lpage>. <pub-id pub-id-type="doi">10.1016/j.cogbrainres.2003.09.004</pub-id><pub-id pub-id-type="pmid">14659498</pub-id></citation>
</ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kreifelts</surname> <given-names>B.</given-names></name> <name><surname>Ethofer</surname> <given-names>T.</given-names></name> <name><surname>Grodd</surname> <given-names>W.</given-names></name> <name><surname>Erb</surname> <given-names>M.</given-names></name> <name><surname>Wildgruber</surname> <given-names>D.</given-names></name></person-group> (<year>2007</year>). <article-title>Audiovisual integration of emotional signals in voice and face: an event-related fMRI study</article-title>. <source>Neuroimage</source> <volume>37</volume>, <fpage>1445</fpage>&#x02013;<lpage>1456</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuroimage.2007.06.020</pub-id><pub-id pub-id-type="pmid">17659885</pub-id></citation>
</ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lakatos</surname> <given-names>P.</given-names></name> <name><surname>O&#x00027;Connell</surname> <given-names>M. N.</given-names></name> <name><surname>Barczak</surname> <given-names>A.</given-names></name> <name><surname>Mills</surname> <given-names>A.</given-names></name> <name><surname>Javitt</surname> <given-names>D. C.</given-names></name> <name><surname>Schroeder</surname> <given-names>C. E.</given-names></name></person-group> (<year>2009</year>). <article-title>The leading sense: supramodal control of neurophysiological context by attention</article-title>. <source>Neuron</source> <volume>64</volume>, <fpage>419</fpage>&#x02013;<lpage>430</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuron.2009.10.014</pub-id><pub-id pub-id-type="pmid">19914189</pub-id></citation>
</ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Laurienti</surname> <given-names>P. J.</given-names></name> <name><surname>Wallace</surname> <given-names>M. T.</given-names></name> <name><surname>Maldjian</surname> <given-names>J. A.</given-names></name> <name><surname>Susi</surname> <given-names>C. M.</given-names></name> <name><surname>Stein</surname> <given-names>B.</given-names></name> <name><surname>Burdette</surname> <given-names>J. H.</given-names></name></person-group> (<year>2003</year>). <article-title>Cross-modal sensory processing in the anterior cingulate and medial prefrontal cortices</article-title>. <source>Hum. Brain Mapp</source>. <volume>19</volume>, <fpage>213</fpage>&#x02013;<lpage>223</lpage>. <pub-id pub-id-type="doi">10.1002/hbm.10112</pub-id><pub-id pub-id-type="pmid">12874776</pub-id></citation>
</ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Meeren</surname> <given-names>H. K. M.</given-names></name> <name><surname>van Heijnsbergen</surname> <given-names>C. C. R. J.</given-names></name> <name><surname>de Gelder</surname> <given-names>B.</given-names></name></person-group> (<year>2005</year>). <article-title>Rapid perceptual integration of facial expression and emotional body language</article-title>. <source>Proc. Natl. Acad. Sci. U.S.A</source>. <volume>102</volume>, <fpage>16518</fpage>&#x02013;<lpage>16523</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.0507650102</pub-id><pub-id pub-id-type="pmid">16260734</pub-id></citation>
</ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Paulmann</surname> <given-names>S.</given-names></name> <name><surname>Jessen</surname> <given-names>S.</given-names></name> <name><surname>Kotz</surname> <given-names>S. A.</given-names></name></person-group> (<year>2009</year>). <article-title>Investigating the multimodal nature of human communication. Insights from ERPs</article-title>. <source>J. Psychophysiol</source>. <volume>23</volume>, <fpage>63</fpage>&#x02013;<lpage>76</lpage>. <pub-id pub-id-type="doi">10.1027/0269-8803.23.2.63</pub-id></citation>
</ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pichon</surname> <given-names>S.</given-names></name> <name><surname>de Gelder</surname> <given-names>B.</given-names></name> <name><surname>Gr&#x000E8;zes</surname> <given-names>J.</given-names></name></person-group> (<year>2009</year>). <article-title>Two different faces of threat. Comparing the neural systems for recognizing fear and anger in dynamic body expressions</article-title>. <source>Neuroimage</source> <volume>47</volume>, <fpage>1873</fpage>&#x02013;<lpage>1883</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuroimage.2009.03.084</pub-id><pub-id pub-id-type="pmid">19371787</pub-id></citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pourtois</surname> <given-names>G.</given-names></name> <name><surname>Debatisse</surname> <given-names>D.</given-names></name> <name><surname>Despland</surname> <given-names>P.-A.</given-names></name> <name><surname>de Gelder</surname> <given-names>B.</given-names></name></person-group> (<year>2002</year>). <article-title>Facial expressions modulate the time course of long latency auditory brain potentials</article-title>. <source>Brain Res. Cogn. Brain Res</source>. <volume>14</volume>, <fpage>99</fpage>&#x02013;<lpage>105</lpage>. <pub-id pub-id-type="doi">10.1016/S0926-6410(02)00064-2</pub-id><pub-id pub-id-type="pmid">12063133</pub-id></citation>
</ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pourtois</surname> <given-names>G.</given-names></name> <name><surname>de Gelder</surname> <given-names>B.</given-names></name> <name><surname>Vroomen</surname> <given-names>J.</given-names></name> <name><surname>Rossion</surname> <given-names>B.</given-names></name> <name><surname>Crommelinck</surname> <given-names>M.</given-names></name></person-group> (<year>2000</year>). <article-title>The time-course of intermodal binding between seeing and hearing affective information</article-title>. <source>Neuroreport</source> <volume>11</volume>, <fpage>1329</fpage>&#x02013;<lpage>1333</lpage>. <pub-id pub-id-type="doi">10.1097/00001756-200004270-00036</pub-id><pub-id pub-id-type="pmid">10817616</pub-id></citation>
</ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Robins</surname> <given-names>D. L.</given-names></name> <name><surname>Hunyadi</surname> <given-names>E.</given-names></name> <name><surname>Schultz</surname> <given-names>R. T.</given-names></name></person-group> (<year>2009</year>). <article-title>Superior temporal activation in response to dynamic audio-visual emotional cues</article-title>. <source>Brain Cogn</source>. <volume>69</volume>, <fpage>269</fpage>&#x02013;<lpage>278</lpage>. <pub-id pub-id-type="doi">10.1016/j.bandc.2008.08.007</pub-id><pub-id pub-id-type="pmid">18809234</pub-id></citation>
</ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schirmer</surname> <given-names>A.</given-names></name> <name><surname>Kotz</surname> <given-names>S. A.</given-names></name> <name><surname>Friederici</surname> <given-names>A. D.</given-names></name></person-group> (<year>2002</year>). <article-title>Sex differentiates the role of emotional prosody during word processing</article-title>. <source>Cogn. Brain Res</source>. <volume>14</volume>, <fpage>228</fpage>&#x02013;<lpage>233</lpage>. <pub-id pub-id-type="doi">10.1016/S0926-6410(02)00108-8</pub-id><pub-id pub-id-type="pmid">12067695</pub-id></citation>
</ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schroeder</surname> <given-names>C. E.</given-names></name> <name><surname>Lakatos</surname> <given-names>P.</given-names></name> <name><surname>Kajikawa</surname> <given-names>Y.</given-names></name> <name><surname>Partan</surname> <given-names>S.</given-names></name> <name><surname>Puce</surname> <given-names>A.</given-names></name></person-group> (<year>2008</year>). <article-title>Neuronal oscillations and visual amplification of speech</article-title>. <source>Trends Cogn. Sci</source>. <volume>12</volume>, <fpage>106</fpage>&#x02013;<lpage>113</lpage>. <pub-id pub-id-type="doi">10.1016/j.tics.2008.01.002</pub-id><pub-id pub-id-type="pmid">18280772</pub-id></citation>
</ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Senkowski</surname> <given-names>D.</given-names></name> <name><surname>Schneider</surname> <given-names>T. R.</given-names></name> <name><surname>Foxe</surname> <given-names>J. J.</given-names></name> <name><surname>Engel</surname> <given-names>A. K.</given-names></name></person-group> (<year>2008</year>). <article-title>Crossmodal binding through neural coherence: implications for multisensory processing</article-title>. <source>Trends Neurosci</source>. <volume>31</volume>, <fpage>401</fpage>&#x02013;<lpage>409</lpage>. <pub-id pub-id-type="doi">10.1016/j.tins.2008.05.002</pub-id><pub-id pub-id-type="pmid">18602171</pub-id></citation>
</ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sperdin</surname> <given-names>H. F.</given-names></name> <name><surname>Cappe</surname> <given-names>C.</given-names></name> <name><surname>Foxe</surname> <given-names>J. J.</given-names></name> <name><surname>Murray</surname> <given-names>M. M.</given-names></name></person-group> (<year>2009</year>). <article-title>Early, low-level auditory-somatosensory multisensory interactions impact reaction time speed</article-title>. <source>Front. Integr. Neurosci</source>. <volume>3</volume>:<fpage>2</fpage>. <pub-id pub-id-type="doi">10.3389/neuro.07.002.2009</pub-id><pub-id pub-id-type="pmid">19404410</pub-id></citation>
</ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stekelenburg</surname> <given-names>J. J.</given-names></name> <name><surname>de Gelder</surname> <given-names>B.</given-names></name></person-group> (<year>2004</year>). <article-title>The neural correlates of perceiving human bodies: an ERP study on the body-inversion effect</article-title>. <source>Neuroreport</source> <volume>15</volume>, <fpage>777</fpage>&#x02013;<lpage>780</lpage>. <pub-id pub-id-type="doi">10.1097/00001756-200404090-00007</pub-id><pub-id pub-id-type="pmid">15073513</pub-id></citation>
</ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stekelenburg</surname> <given-names>J. J.</given-names></name> <name><surname>Vroomen</surname> <given-names>J.</given-names></name></person-group> (<year>2007</year>). <article-title>Neural correlates of multisensory integration of ecologically valid audiovisual events</article-title>. <source>J. Cogn. Neurosci</source>. <volume>19</volume>, <fpage>1964</fpage>&#x02013;<lpage>1973</lpage>. <pub-id pub-id-type="doi">10.1162/jocn.2007.19.12.1964</pub-id><pub-id pub-id-type="pmid">17892381</pub-id></citation>
</ref>
<ref id="B39">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stekelenburg</surname> <given-names>J. J.</given-names></name> <name><surname>Vroomen</surname> <given-names>J.</given-names></name></person-group> (<year>2012</year>). <article-title>Electrophysiological correlates of predictive coding of auditory location in the perception of natural audiovisual events</article-title>. <source>Front. Integr. Neurosci</source>. <volume>6</volume>:<fpage>26</fpage>. <pub-id pub-id-type="doi">10.3389/fnint.2012.00026</pub-id><pub-id pub-id-type="pmid">22666195</pub-id></citation>
</ref>
<ref id="B40">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Summerfield</surname> <given-names>C.</given-names></name> <name><surname>Egner</surname> <given-names>T.</given-names></name></person-group> (<year>2009</year>). <article-title>Expectation (and attention) in visual cognition</article-title>. <source>Trends Cogn. Sci</source>. <volume>13</volume>, <fpage>403</fpage>&#x02013;<lpage>409</lpage>. <pub-id pub-id-type="doi">10.1016/j.tics.2009.06.003</pub-id><pub-id pub-id-type="pmid">19716752</pub-id></citation>
</ref>
<ref id="B41">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tallon-Baudry</surname> <given-names>C.</given-names></name> <name><surname>Bertrand</surname> <given-names>O.</given-names></name></person-group> (<year>1999</year>). <article-title>Oscillatory gamma activity in humans and its role in object representation</article-title>. <source>Trends Cogn. Sci</source>. <volume>3</volume>, <fpage>151</fpage>&#x02013;<lpage>162</lpage>. <pub-id pub-id-type="doi">10.1016/S1364-6613(99)01299-1</pub-id><pub-id pub-id-type="pmid">10322469</pub-id></citation>
</ref>
<ref id="B42">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Thorne</surname> <given-names>J. D.</given-names></name> <name><surname>De Vos</surname> <given-names>M.</given-names></name> <name><surname>Viola</surname> <given-names>F. C.</given-names></name> <name><surname>Debener</surname> <given-names>S.</given-names></name></person-group> (<year>2011</year>). <article-title>Cross-modal phase reset predicts auditory task performance in humans</article-title>. <source>J. Neurosci</source>. <volume>31</volume>, <fpage>3853</fpage>&#x02013;<lpage>3861</lpage>. <pub-id pub-id-type="doi">10.1523/JNEUROSCI.6176-10.2011</pub-id><pub-id pub-id-type="pmid">21389240</pub-id></citation>
</ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>van Wassenhove</surname> <given-names>V.</given-names></name> <name><surname>Grant</surname> <given-names>K. W.</given-names></name> <name><surname>Poeppel</surname> <given-names>D.</given-names></name></person-group> (<year>2005</year>). <article-title>Visual speech speeds up the neural processing of auditory speech</article-title>. <source>Proc. Natl. Acad. Sci. U.S.A</source>. <volume>102</volume>, <fpage>1181</fpage>&#x02013;<lpage>1186</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.0408949102</pub-id><pub-id pub-id-type="pmid">15647358</pub-id></citation>
</ref>
<ref id="B44">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Werheid</surname> <given-names>K.</given-names></name> <name><surname>Alpay</surname> <given-names>G.</given-names></name> <name><surname>Jentzsch</surname> <given-names>I.</given-names></name> <name><surname>Sommer</surname> <given-names>W.</given-names></name></person-group> (<year>2005</year>). <article-title>Priming emotional facial expressions as evidenced by event-related brain potentials</article-title>. <source>Int. J. Psychophysiol</source>. <volume>55</volume>, <fpage>209</fpage>&#x02013;<lpage>219</lpage>. <pub-id pub-id-type="doi">10.1016/j.ijpsycho.2004.07.006</pub-id><pub-id pub-id-type="pmid">15649552</pub-id></citation>
</ref>
</ref-list>
</back>
</article>
