<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="brief-report">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neurosci.</journal-id>
<journal-title>Frontiers in Neuroscience</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neurosci.</abbrev-journal-title>
<issn pub-type="epub">1662-453X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnins.2021.678029</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Brief Research Report</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>The Impact of Temporally Coherent Visual Cues on Speech Perception in Complex Auditory Environments</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Yuan</surname> <given-names>Yi</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/1180101/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Lleo</surname> <given-names>Yasneli</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/1346922/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Daniel</surname> <given-names>Rebecca</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/1347025/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>White</surname> <given-names>Alexandra</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/1260442/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Oh</surname> <given-names>Yonghee</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/588057/overview"/>
</contrib>
</contrib-group>
<aff><institution>Department of Speech, Language, and Hearing Sciences, University of Florida</institution>, <addr-line>Gainesville, FL</addr-line>, <country>United States</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Josef P. Rauschecker, Georgetown University, United States</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Patrik Alexander Wikman, University of Helsinki, Finland; Ken W. Grant, Walter Reed National Military Medical Center, United States</p></fn>
<corresp id="c001">&#x002A;Correspondence: Yonghee Oh, <email>yoh@ufl.edu</email></corresp>
<fn fn-type="other" id="fn004"><p>This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience</p></fn>
</author-notes>
<pub-date pub-type="epub">
<day>07</day>
<month>06</month>
<year>2021</year>
</pub-date>
<pub-date pub-type="collection">
<year>2021</year>
</pub-date>
<volume>15</volume>
<elocation-id>678029</elocation-id>
<history>
<date date-type="received">
<day>08</day>
<month>03</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>04</day>
<month>05</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2021 Yuan, Lleo, Daniel, White and Oh.</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Yuan, Lleo, Daniel, White and Oh</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>Speech perception often takes place in noisy environments, where multiple auditory signals compete with one another. The addition of visual cues such as talkers&#x2019; faces or lip movements to an auditory signal can help improve the intelligibility of speech in those suboptimal listening environments. This is referred to as audiovisual benefits. The current study aimed to delineate the signal-to-noise ratio (SNR) conditions under which visual presentations of the acoustic amplitude envelopes have their most significant impact on speech perception. Seventeen adults with normal hearing were recruited. Participants were presented with spoken sentences in babble noise either in auditory-only or auditory-visual conditions with various SNRs at &#x2212;7, &#x2212;5, &#x2212;3, &#x2212;1, and 1 dB. The visual stimulus applied in this study was a sphere that varied in size syncing with the amplitude envelope of the target speech signals. Participants were asked to transcribe the sentences they heard. Results showed that a significant improvement in accuracy in the auditory-visual condition versus the audio-only condition was obtained at the SNRs of &#x2212;3 and &#x2212;1 dB, but no improvement was observed in other SNRs. These results showed that dynamic temporal visual information can benefit speech perception in noise, and the optimal facilitative effects of visual amplitude envelope can be observed under an intermediate SNR range.</p>
</abstract>
<kwd-group>
<kwd>audiovisual speech recognition</kwd>
<kwd>multisensory gain</kwd>
<kwd>temporal coherence</kwd>
<kwd>amplitude envelope</kwd>
<kwd>SNR</kwd>
</kwd-group>
<counts>
<fig-count count="2"/>
<table-count count="0"/>
<equation-count count="1"/>
<ref-count count="45"/>
<page-count count="7"/>
<word-count count="0"/>
</counts>
</article-meta>
</front>
<body>
<sec id="S1">
<title>Introduction</title>
<p>Speech perception is crucial to daily communication (information exchange) for normal-hearing and hearing-impaired people of all ages. The quality of speech perception can vary based on different perceptual phenomena. For instance, speech intelligibility will decrease in demanding listening situations. In their seminal work, <xref ref-type="bibr" rid="B40">Sumby and Pollack (1954)</xref> examined the contribution of visual factors (such as a talker&#x2019;s face) to speech intelligibility in a varied speech-to-noise ratio (which is often referred to as signal-to-noise ratio, SNR, nowadays). They concluded that the visual contribution becomes more critical as the SNR is decreased. Therefore, it is essential to study human communication in the context of audiovisual speech perception (<xref ref-type="bibr" rid="B11">Grant and Bernstein, 2019</xref>, for a review). More importantly, as in actual daily life, studying the audiovisual benefits to speech perception in various noise environments is critical to fully understand the contribution of visual inputs to speech perception.</p>
<p>A wealth of behavioral studies has examined the audiovisual benefits to speech perception utilizing different levels of noises. One line of research has focused on determining the optimal speech perception threshold conferred by visual cues. <xref ref-type="bibr" rid="B27">MacLeod and Summerfield (1990)</xref> measured speech perception threshold for sentences in white noise low pass filtered at 10 kHz using an adaptive tracking procedure. They reasoned that noise plays a role in masking the higher-frequency components of the acoustic signal to which the visual cues of lip shape and tongue position are complementary. The noise level was fixed at 60 dBA. The starting SNR for the list was &#x2212;28 dB in the audiovisual (AV) condition and &#x2212;20 dB in the auditory-only (AO) condition. The average SRT benefits were 6.4 dB, which was measured based on the 50% correct of trials. <xref ref-type="bibr" rid="B13">Grant and Seitz (2000)</xref> also used a three-up, one-down adaptive tracking procedure targeting the 79% point on the psychometric function (<xref ref-type="bibr" rid="B22">Levitt, 1971</xref>) for the noise&#x2019;s intensity and fixed the target at 50-dB sound pressure level (SPL). The visual benefit calculated based on their method was averaged at 1.6 dB.</p>
<p>Another line of research explored audiovisual benefits in terms of setting a range of SNRs or selected SNRs, to determine the level of the most pronounced audiovisual gains. <xref ref-type="bibr" rid="B34">O&#x2019;Neill (1954)</xref> applied four different SNRs (&#x2212;20, &#x2212;10, 0, and 10 dB) and found that visual recognition (AV condition) was greater than nonvisual recognition (AO condition) under all four SNRs. <xref ref-type="bibr" rid="B40">Sumby and Pollack (1954)</xref> applied a range of SNRs from &#x2212;30 to 0 dB and proposed that the visual contribution becomes more critical as the SNR decreased. Similar to <xref ref-type="bibr" rid="B34">O&#x2019;Neill (1954)</xref>, they believed that because the intelligibility is much lower in AO condition at low SNRs (for instance, &#x2212;20 dB), the visual cues&#x2019; contributions are reflected in the percentage of correct responses. However, as pointed out in their own description, the stimuli used in this study were a close set of isolated spondee words, which refer to the words with two equally stressed syllables such as &#x201C;Sunset&#x201D; and &#x201C;Toothache.&#x201D; The spondees were presented to the subjects prior to the testing and as a checklist during the testing. In this case, relatively low SNR would have a higher impact with this procedure as the participants could guess the words from a small and consistent candidate pool. To eliminate the drawbacks of the closed-set testing stimuli, <xref ref-type="bibr" rid="B35">Ross et al. (2007a</xref>, <xref ref-type="bibr" rid="B36">b)</xref> applied a much larger stimulus set to find the highest gain at the intermediate SNRs. They measured the speech perception of normal subjects and schizophrenia patients at SNRs 24, 20, 16, 12, 8, 4, and 0 dB, respectively. They found that the maximum multisensory gain difference between the AV and AO speech perception occurred at the 12-dB SNR; the audiovisual gain showed an inverted U-shaped curve as a function of the SNR. A similar SNR trend was observed in the study of <xref ref-type="bibr" rid="B25">Liu et al. (2013)</xref>. They measured audiovisual benefits with Chinese monosyllabic words in pink noise with different SNRs at &#x2212;16, &#x2212;12, &#x2212;8, &#x2212;4, and 0 dB, in behavioral recognition and event-related potential (ERP) paradigms. This behavioral study found that the maximum difference in speech recognition accuracy between the AV and AO conditions was at the &#x2212;12-dB SNR, which is aligned with the highest ERP evoked that is observed at the same SNR. Taken together, these studies show that audiovisual benefits are not more significant under low SNR conditions, but instead, there was a special zone at a more intermediate SNR (such as &#x2212;12 dB) where audiovisual integration results in substantial benefits.</p>
<p>Regardless of the various selected SNRs applied in the studies mentioned above, they all share a common visual stimulus such as actual talkers&#x2019; faces or lip movements. Lip-reading education prevails in the hearing community (<xref ref-type="bibr" rid="B4">Campbell and Dodd, 1980</xref>). However, according to studies, some articulatory activities cannot be detected through lip-reading (<xref ref-type="bibr" rid="B41">Summerfield, 1992</xref>, for a review). Given the limited articulation information conveyed by lip movements, it is important to know which visual features (for example, lip movements) truly benefit audiovisual integration in speech perception. According to the motor theory (MT, <xref ref-type="bibr" rid="B24">Liberman et al., 1967</xref>; <xref ref-type="bibr" rid="B23">Liberman and Mattingly, 1985</xref>), the authors claim that speech perception entails recovery of the articulatory gestures or the invariant neuromotor commands. Speech in noise is perceived more accurately when the speakers are seen as special language information was encoded in the lip movements (<xref ref-type="bibr" rid="B8">Erber, 1975</xref>). <xref ref-type="bibr" rid="B9">Erber (1979)</xref> synthesized an oscilloscope pattern to resemble the actual lip configurations by tracking F1 in vowels and other signals for lip width. <xref ref-type="bibr" rid="B13">Grant and Seitz (2000)</xref>, <xref ref-type="bibr" rid="B10">Grant (2001)</xref> calculated the correlation between the spatial information of mouth opening, speech amplitude peaks, and different formants. They demonstrated that observation of the lips and face movements yield more phonemic-level details and improve the speech detection threshold. In the study of <xref ref-type="bibr" rid="B20">Jaekl et al. (2015)</xref>, the researchers explored the contribution from the dynamic configural information to speech perception by applying point-light displays to a motion-captured real talker&#x2019;s face. They suggested that the global processing of the face shape changes contributed significantly to the perception of articulation gestures in speech perception.</p>
<p>In contrast to MT, the proponents of the general auditory learning approaches (GA, <xref ref-type="bibr" rid="B5">Diehl and Kluender, 1989</xref>; <xref ref-type="bibr" rid="B6">Diehl et al., 2004</xref>) to speech perception contend that speech perception is performed by mechanisms related to all environmental stimuli. They argue that perceptual constancy is the result of the system&#x2019;s general ability to combine multiple imperfect acoustic cues without recovery of the articulatory gestures. To be more specific, from our study&#x2019;s perspective, the essence of lip-reading information is delivering the temporal cues of the acoustic signals, which is a shared characteristic across visual and auditory modalities (<xref ref-type="bibr" rid="B1">Atilgan et al., 2018</xref>). Whereas some visual stimulus may impact individual phoneme recognition (e.g., the McGurk effect; <xref ref-type="bibr" rid="B29">McGurk and MacDonald, 1976</xref>), when it comes to running speech, the dynamic movements of the mouth provide an analog of the acoustic amplitude envelope, which conveys the temporal information as audiovisual binding foundation cross modalities (<xref ref-type="bibr" rid="B28">Maddox et al., 2015</xref>; <xref ref-type="bibr" rid="B3">Bizley et al., 2016</xref>).</p>
<p>In our previous study (<xref ref-type="bibr" rid="B44">Yuan et al., 2020</xref>), we used an abstract visual presentation (a sphere) of the amplitude envelope cues from target sentences to assist speech perception in a fixed &#x2212;3-dB SNR background noise. Significant speech performance improvements were observed with the visual analog of the amplitude envelope, without any actual face or lip movements. Our research&#x2019;s central hypothesis is that dynamic temporal visual information provides benefits for speech perception, independent of particular articulation movements. <xref ref-type="bibr" rid="B2">Bernstein et al. (2004)</xref> used several abstract visual representations of speech amplitude envelopes, such as a Lissajous curve and a rectangle in their research. Even though their results showed a decrease in speech detection thresholds under AV conditions, they did not find greater audiovisual benefits when comparing abstract visual cues with the actual talker&#x2019;s face. These results partly reflect the limitations of the test materials, which included isolated phonemic combinations. We hypothesize that it is the tracking of temporal cues of visual signals synced with auditory signals that plays a key role in the perception of continuous speech and speech intelligibility enhancement. Continuous speech tracking relies more on audio-to-visual correlation from the temporal domain, different from the single phoneme recognition tasks (<xref ref-type="bibr" rid="B45">Zion Golumbic et al., 2013</xref>). Therefore, our studies used speech sentences instead of phonemic combinations. In our previous study (<xref ref-type="bibr" rid="B44">Yuan et al., 2020</xref>), eight separately recruited subjects were tested on 30 sentences at SNRs of &#x2212;1, &#x2212;3, and &#x2212;5 dB from both male and female speakers presented in the AO condition. Those behavioral piloting data were used to select appropriate SNRs for testing. The results indicated that an SNR of &#x2212;3 dB for both female (mean = 77.03%, SD = 20.64%) and male (mean = 62.11%, SD = 28.64%) speakers yielded an appropriate level of performance to avoid ceiling and floor effects. However, in real-life settings, listeners are facing a fast-changing listening environment with various levels of noise. These findings lead us to the question of whether our previous results would hold with other SNR conditions. In the current study, we hypothesized that similar to previous research testing with word stimuli (<xref ref-type="bibr" rid="B35">Ross et al., 2007a</xref>; <xref ref-type="bibr" rid="B25">Liu et al., 2013</xref>), intermediate SNRs for optimal gain by audiovisual enhancement would also be observed on sentence-level speech perception in noise.</p>
</sec>
<sec id="S2" sec-type="materials|methods">
<title>Materials and Methods</title>
<sec id="S2.SS1">
<title>Subjects</title>
<p>Seventeen adult subjects, 14 female and three male adult subjects (average age 21 &#x00B1; 1.6 years old), participated in the experiment. All subjects had normal audiometric hearing thresholds (air conduction thresholds &#x2264; 25 dB hearing level) and were screened for normal cognitive function using the Montreal Cognitive Assessment (MoCA, <xref ref-type="bibr" rid="B33">Nasreddine et al., 2005</xref>) with a minimum score of 29 out of 30 required to qualify (mean MoCA 28.76). All subjects were native, monolingual English speakers with normal vision. All experiments were conducted in accordance with the guidelines for the protection of human subjects as established by the Institutional Review Board (IRB) of the University of Florida, and the methods employed were approved by that IRB.</p>
</sec>
<sec id="S2.SS2">
<title>Stimuli and Procedures</title>
<p>Auditory stimuli consisted of five lists of speech sentences (each list consists of 10 sentences) from the Harvard sentences (<xref ref-type="bibr" rid="B19">Institute of Electrical and Electronic Engineers, 1969</xref>). Each list was recorded by male and female native English speakers (5 sentences for each speaker, 25 sentences in total for the male speaker, and 25 for the female speaker). Ten sentences were chosen for the practice section and 40 sentences for testing (20 sentences for AO condition and 20 for AV condition). An example speech sentence is described in <xref ref-type="fig" rid="F1">Figure 1A</xref>. The target sentences were sampled at 44.1 kHz, and root-mean-square (RMS) matched through MATLAB (R2019a, MathWorks, Natick, MA, United States) at a fixed 65-dB SPL for presentation. Target sentences were embedded within a multi-talker babble noise with a 200-ms duration of the noise added before and after. Eight-talkers babble noise was prerecorded and normalized by MATLAB with various intensities at 72, 70, 68, 66, and 64 dB SPL. The obtained SNR was &#x2212;7, &#x2212;5, &#x2212;3, 1, and 1 dB, respectively.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption><p>Schematic representation of the stimuli used in this study. <bold>(A)</bold> The waveform and the extracted temporal envelope of a sample speech sentence &#x201C;Steam hissed from the broke valve.&#x201D; <bold>(B)</bold> Sphere-shaped visual symbol synchronized with the acoustic speech amplitude envelope at time intervals 400, 950, 1,550, and 1,950 ms indicated by the dotted vertical lines.</p></caption>
<graphic xlink:href="fnins-15-678029-g001.tif"/>
</fig>
<p>For visual stimuli in the AV condition, instead of videos of the actual talker&#x2019;s face or lips, a visual analog of the amplitude envelope was applied as in our previous study (<xref ref-type="bibr" rid="B44">Yuan et al., 2020</xref>). Based on <xref ref-type="bibr" rid="B43">Yuan et al. (in press)</xref>, amplitude envelopes were first extracted from the wideband target sentences and then passed through a low-pass filter (fourth-order Butterworth) with a fixed cutoff of 10 Hz. This cutoff frequency was found to be the optimal cutoff frequency cue for modulating the visual analog and getting better AV benefits than the others tested in the study (4 and 30 Hz). This parameter was also consistent with the findings from <xref ref-type="bibr" rid="B7">Drullman et al. (1994)</xref>, which indicated that a 4&#x2013;16-Hz modulation rate significantly benefits speech perception. An example envelope (red) extracted from the sentence (blue) is described in <xref ref-type="fig" rid="F1">Figure 1A</xref>. A sphere was then generated based on the filtered amplitude envelope information with a fixed amplitude modulation depth at 75%. Changes in the volume of the sphere were synced with the changes in the acoustic amplitude envelope of the sentences (in isolation). See the schematic diagram of the visual stimuli in <xref ref-type="fig" rid="F1">Figure 1B</xref>. The videos were rendered into 896 <sup>&#x2217;</sup> 896-pixel movies at 30 frames/s. The audio and video files were initially temporally aligned and combined as a video format in the AVS video editor (Online Media Technologies Ltd. software). For the AO condition, a video with a blank background was presented. A fixation was shown at the beginning of the videos for both conditions to alert participants to pay attention to the coming stimuli (see <xref ref-type="supplementary-material" rid="SM1">Supplementary Material</xref> for AO and AV example stimuli used in this study).</p>
<p>Speech perception tasks were conducted in a single-walled, sound-attenuated booth. All testing trials were presented through MATLAB. Audio stimuli were processed through an RME Babyface Pro sound card (RME Audio, Haimhausen, Germany) and presented through a speaker (Rokit 5, KRK Systems, Deerfield Beach, FL, United States) positioned at 0&#x00B0; (azimuth) in front of the listener&#x2019;s head. The loudspeaker was calibrated with a Br&#x00FC;el &#x0026; Kj&#x00E6;r sound level meter (Br&#x00FC;el &#x0026; Kj&#x00E6;r Sound &#x0026; Vibration Measurement A/S, N&#x00E6;rum, Denmark). Visual stimuli were shown on a 24-in. touch screen monitor (P2418HT, Dell, Austin, TX, United States).</p>
<p>For practice trials, subjects were presented with 10 audiovisual stimuli presented with clear speech. This was done to familiarize the subjects with both the visual stimuli and the auditory signals. They were then asked to type the sentence they heard into the input window, which appears once the stimulus presentation is finished. Correct answers and feedback were provided to the subject after each trial. For the testing session, 40 sentences were presented to the participant: 20 sentences in AO condition and 20 sentences in AV condition with multi-talker babble noise. At each of the SNR levels (SNR &#x2212;7, &#x2212;5, &#x2212;3, &#x2212;1, and 1 dB), four sentences were presented in the AO condition and four in the AV condition, resulting in five SNR blocks and a total of 40 target sentences. The presentation order of different SNR blocks was randomized for each participant. Target sentences with different conditions were also randomly presented within each block. Participants were asked to pay attention to the monitor until the stimulus was fully presented and then type down what they heard. Each target sentence was only presented once. We emphasized that punctuation and capitalization were not required; however, spelling was a priority. Data scoring was calculated based on complete sentences. The percentage of words accurately identified in each sentence was calculated and cross-checked by two trained research assistants in the lab. All of the scoring processes were based on our project-scoring guide. For instance, if a word is missing a phoneme or has a typo but is still clearly the same word (like photograph versus photography), it was scored as correct.</p>
</sec>
</sec>
<sec id="S3">
<title>Results</title>
<p><xref ref-type="fig" rid="F2">Figure 2A</xref> shows the averaged word recognition accuracy as a function of various SNR levels in AO condition and AV condition. As seen in <xref ref-type="fig" rid="F2">Figure 2A</xref>, the average subject responses in both the AV (red curve) and AO (blue curve) conditions demonstrate an S-shaped perceptual curve (i.e., psychometric curve). In the intermediate SNR levels between &#x2212;3 and &#x2212;1 dB, approximately 20% improvement of the mean word recognition accuracy was observed when synchronized visual cues were provided (AO to AV: 43 to 63% at the &#x2212;3-dB SNR condition and 67&#x2013;87% at the &#x2212;1-dB SNR condition). No audiovisual benefit was observed in other SNRs.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption><p><bold>(A)</bold> Average word recognition accuracy results as a function of various SNR levels. The blue circle and red square symbols represent accuracy results at the audio-only (AO) and audiovisual (AV) conditions, respectively. <bold>(B)</bold> Average AV benefit scores as a function of various SNR levels calculated by Eq. (1) (i.e., <inline-formula><mml:math id="INEQ2"><mml:mfrac><mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>V</mml:mi></mml:mrow><mml:mo>-</mml:mo><mml:mrow><mml:mi>A</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>O</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>100</mml:mn><mml:mo>-</mml:mo><mml:mrow><mml:mi>A</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>O</mml:mi></mml:mrow></mml:mrow></mml:mfrac></mml:math></inline-formula>). Error bars represent standard deviations of the mean.</p></caption>
<graphic xlink:href="fnins-15-678029-g002.tif"/>
</fig>
<p>A two-way repeated-measures analysis of variance (RM-ANOVA) with factors of condition (AO and AV) and SNR (&#x2212;7, &#x2212;5, &#x2212;3, &#x2212;1, and 1 dB) were employed to analyze the data, in order to see the impacts on sentence recognition accuracy (dependent variable). The results showed significant main effects of the condition (<italic>F</italic><sub>1,16</sub> = 52.9, <italic>p</italic> &#x003C; 0.001, <italic>&#x03B7;</italic><sup>2</sup> = 0.768) and SNR (<italic>F</italic><sub>4,64</sub> = 297.1, <italic>p</italic> &#x003C; 0.001, <italic>&#x03B7;</italic><sup>2</sup> = 0.949) and significant interaction between the condition and SNR (<italic>F</italic><sub>4,64</sub> = 14.7, <italic>p</italic> &#x003C; 0.001, <italic>&#x03B7;</italic><sup>2</sup> = 0.479). <italic>Post-hoc</italic> pairwise comparisons using Bonferroni correction were performed to better understand the main effect of SNR. In the intermediate SNR levels between &#x2212;3 and &#x2212;1 dB, there were significant accuracy performance benefits in AV conditions compared with AO conditions (<italic>p</italic> &#x003C; 0.001 for both cases). However, no significant difference was observed in other SNR conditions.</p>
<p>In particular, a significant interaction between condition and SNR indicates that audiovisual benefits were optimal at certain SNR ranges. The lowest (&#x2212;7 dB) and the highest (1 dB) SNRs were counted as floor and ceiling effects; the intermediate zone for optimal audiovisual benefits is from &#x2212;3 to &#x2212;1 dB SNR. In order to demonstrate the relative amount of gain from integrating an auditory and visual cue, we applied the formula for audiovisual benefits scores from <xref ref-type="bibr" rid="B40">Sumby and Pollack (1954)</xref>:</p>
<disp-formula id="S3.E1">
<label>(1)</label>
<mml:math id="M1">
<mml:mrow>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>u</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>d</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>o</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>v</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>u</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+2.8pt">
<mml:mi>l</mml:mi>
</mml:mpadded>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>B</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>f</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+5.6pt">
<mml:mi>s</mml:mi>
</mml:mpadded>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mpadded width="+2.8pt">
<mml:mfrac>
<mml:mrow>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>V</mml:mi>
</mml:mrow>
<mml:mo>-</mml:mo>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>O</mml:mi>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>100</mml:mn>
<mml:mo>-</mml:mo>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>O</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mpadded>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p><xref ref-type="fig" rid="F2">Figure 2B</xref> displays the audiovisual benefit curve corrected for the ceiling effect as a function of SNR levels. The results show that significant audiovisual benefits were observed only at intermediate SNRs of &#x2212;3 and &#x2212;1 dB [<italic>t</italic>(16) &#x003E; 3.1, <italic>p</italic> &#x003C; 0.008 for both cases, one-sample <italic>t</italic>-tests]. It should be noted that there was significant audiovisual interference observed at the highest SNR of 1 dB [<italic>t</italic>(16) = &#x2212;2.9, <italic>p</italic> = 0.01, one-sample <italic>t</italic>-tests). This very intriguing observation will be discussed in detail later.</p>
</sec>
<sec id="S4">
<title>Discussion</title>
<p>We explored the optimal gain provided by audiovisual integration on speech perception in noise. An intermediate SNR zone was established at which audiovisual integration generated the greatest benefits. Our study found that, on the sentence level, the audiovisual integration benefits from temporal coherence across two modalities can be achieved optimally at the &#x2212;3- and &#x2212;1-dB SNR. Meanwhile, no audiovisual benefits were observed at &#x2212;7 dB SNR and a significant interference between auditory and visual signals occurred at 1 dB SNR. In very early studies, it was suggested that visual contributions play more important roles than the SNR (<xref ref-type="bibr" rid="B40">Sumby and Pollack, 1954</xref>). Similar findings were observed in <xref ref-type="bibr" rid="B12">Grant and Braida&#x2019;s (1991)</xref> study. When evaluating the articulation index for AV input, they tested VO, AO, and AV conditions using IEEE sentences with various SNRs (approximately from &#x2212;11 to +2 dB). They found a consistent result as <xref ref-type="bibr" rid="B40">Sumby and Pollack (1954)</xref> that the absolute contribution of lip-reading is greatest when the auditory signal is most degraded. However, in more recent works on this topic, researchers have pointed out that for the maximum amount of audiovisual integration to be gained, a special zone of SNRs is needed. <xref ref-type="bibr" rid="B35">Ross et al. (2007a)</xref> found that the optimal window for maximal integration is from &#x2212;24 to 0 dB SNR and the locus of the audiovisual benefit was at &#x2212;12 dB SNR. <xref ref-type="bibr" rid="B25">Liu et al. (2013)</xref> found that audiovisual enhancement was achieved the most for both behavioral and EEG data at &#x2212;12 dB SNR, aligned with the findings from <xref ref-type="bibr" rid="B35">Ross et al. (2007a)</xref>. Taken together with our findings, a special range of SNRs was observed at which the audiovisual benefits can be gained at the optimal level on both word- and sentence-level speech perception.</p>
<p>More importantly, the current findings show that temporal envelope information delivered through the visual channel can serve as a reliable cue for audiovisual speech perception performance in various noise level conditions. It should be noted that our previous studies revealed that a visual analog that is temporally synchronized with the acoustic amplitude envelope (i.e., congruent condition) significantly improved speech intelligibility in noise; however, incongruent visual stimuli disrupted the integration foundation for auditory and visual stimuli (<xref ref-type="bibr" rid="B44">Yuan et al., 2020</xref>). Furthermore, the benefit from the congruent visual analog was optimized with specific temporal envelope characteristics of 10-Hz modulation rate and 75% modulation depth (<xref ref-type="bibr" rid="B43">Yuan et al., in press</xref>). In sum, the findings in a series of our studies support that the temporal characteristic of the visual inputs plays a fundamental role in audiovisual integration in speech perception and audiovisual benefits have a nonlinear relationship with the various noise levels.</p>
<p><xref ref-type="bibr" rid="B35">Ross et al. (2007a)</xref> mentioned that their audiovisual speech perception behavioral results were not completely consistent with the interpretation of inverse effectiveness (<xref ref-type="bibr" rid="B31">Meredith and Stein, 1986</xref>); that is, the multisensory gain was not most significant when the unisensory input (auditory signal in this case) was weakest. In the literature, the underlying mechanism of multisensory integration has been vastly investigated, and the principles of multisensory integration (spatial, temporal, and inverse effectiveness) were wildly accepted (<xref ref-type="bibr" rid="B31">Meredith and Stein, 1986</xref>; <xref ref-type="bibr" rid="B30">Meredith et al., 1992</xref>; <xref ref-type="bibr" rid="B42">Wallace et al., 1993</xref>). According to the principle of inverse effectiveness, multisensory interaction at the cellular level can be superadditive when each unisensory input alone elicits a relatively weak neural discharge (<xref ref-type="bibr" rid="B31">Meredith and Stein, 1986</xref>). Studies of audiovisual interactions in early evoked brain activity followed the principle of inverse effectiveness (<xref ref-type="bibr" rid="B37">Senkowski et al., 2011</xref>). The presentation method in this study was to modulate the volume of sound presentation or the luminance of visual presentation with nonspeech stimuli, under unisensory and bisensory conditions. However, different from the electroencephalography (EEG) studies, it would be arbitrary to simply generalize the principles of multisensory integration from a single cellular response level to human behavioral studies (<xref ref-type="bibr" rid="B38">Stein and Stanford, 2008</xref>; <xref ref-type="bibr" rid="B16">Holmes, 2009</xref>; <xref ref-type="bibr" rid="B39">Stein et al., 2009</xref>), especially with speech perception tasks. In our present study, when the SNR was the lowest (&#x2212;7 dB SNR), no audiovisual benefit was observed. From this, it is reasonable to suggest that there are minimal levels of auditory input required before speech perception can be most effectively enhanced by visual input (<xref ref-type="bibr" rid="B35">Ross et al., 2007a</xref>). It should be noted that our study tested only four sentences (20 words) per SNR condition, and thus, the small numbers of the sentences can lead to inappropriately characterizing audiovisual gain. <xref ref-type="bibr" rid="B35">Ross et al. (2007a)</xref> study explored this issue with various gain functions [gain in percent (AV-AO)<sup>&#x2217;</sup>100/AV; gain corrected for the ceiling effect (AV-AO)/(100-AO); gain in dB] and found that all three functions characterize the audiovisual gain well even with a small sample size (25 words per noise level). Future work will need to examine the effects of the numbers of stimuli and types of audiovisual gain functions in audiovisual speech perception with analog visual cues.</p>
<p>Speech perception is a continuous perceptual categorization that is context-sensitive (<xref ref-type="bibr" rid="B18">Holt and Lotto, 2010</xref>) rather than binary responses of categorical perception (<xref ref-type="bibr" rid="B26">Ma et al., 2009</xref>). It has been reported that nonspeech sounds similar in their spectral or temporal characteristics of speech signals can influence speech categorization (<xref ref-type="bibr" rid="B17">Holt, 2005</xref>). This finding demonstrates that general auditory processes are involved in relating speech signals and their contexts. Hence, <xref ref-type="bibr" rid="B3">Bizley et al. (2016)</xref> proposed that there is an early multisensory integration that may form a physiological substrate for the bottom-up grouping of auditory and visual stimuli into audiovisual objects. Based on our results, we found that a significant interference across auditory and visual modalities occurred at the highest SNR (+1 dB SNR), which means no audiovisual benefit was observed at this level of SNR. There are two possible explanations behind this phenomenon. First, when a noise in the environment was of sufficient magnitude to mask speech signals, visually delivered amplitude envelope information would contribute to a co-modulation masking release function (<xref ref-type="bibr" rid="B14">Hall and Grose, 1988</xref>, <xref ref-type="bibr" rid="B15">1990</xref>; <xref ref-type="bibr" rid="B32">Moore et al., 1990</xref>) to enhance masked target auditory signals from the noise. The visual analog of the amplitude envelope itself was a complementary cue to the auditory cues. In other words, the auditory event was the primary perception source as proposed in the GA theory (<xref ref-type="bibr" rid="B5">Diehl and Kluender, 1989</xref>), and visual inputs provided extra assistance to the auditory signals. The additional visual channel transferred the same or a subset of the amplitude envelope information of the target signal. Therefore, the target signals were enhanced and released from the background noise. We defined this function as a bimodal co-modulation masking self-release (BCMSR, <xref ref-type="bibr" rid="B43">Yuan et al., in press</xref>). If the auditory signal was intelligible and unambiguous by itself (for instance, the target signal is 1 dB louder than the background noise), the co-modulation across visual and auditory signals would not show significant enhancement. In addition, a higher cognitive process may be essential in top-down attention-shifting in visual&#x2013;tactile interaction (<xref ref-type="bibr" rid="B21">Kanayama et al., 2012</xref>) or audiovisual interaction (<xref ref-type="bibr" rid="B3">Bizley et al., 2016</xref>). Since the target auditory signal is clear enough, subjects might shift their attention from co-modulating the visual presentation of the amplitude envelope with target auditory signals (which were already very intelligible) to the background noise signals (multi-taker babble noise in the present study). This attention-shifting may cause significant interference in audiovisual benefits at higher SNRs (see <xref ref-type="bibr" rid="B28">Maddox et al., 2015</xref>; <xref ref-type="bibr" rid="B3">Bizley et al., 2016</xref>, for a review on divided attention tasks in multisensory integration).</p>
</sec>
<sec id="S5">
<title>Conclusion</title>
<p>Speech perception frequently occurs in nonoptimal listening conditions and understanding speech in noisy environments is challenging for everyone. As a multisensory integration process, audiovisual integration in speech perception requires salient temporal cues to enhance both speech detection and tracking ability. Amplitude envelope, serving as a reliable temporal cue source, can be applied through different sensory modalities when the auditory ability is compromised. The integration across different sensory modalities also requires certain levels of SNRs to gain their optimal integration benefits. The SNRs could neither be too low, as a minimal level of auditory input being required before the speech perception can be most effectively enhanced by visual input, nor too high, as the essential top-down modulation from the higher cognitive process may shift attention from targets to background noise. Further research will focus on testing with more individualized SNR conditions. In conclusion, the temporal cue is a critical visual characteristic in facilitating speech perception in noise. Listeners can benefit from dynamic temporal visual information correlated with the auditory speech signal but do not contain any information about articulatory gestures in adverse hearing environments.</p>
</sec>
<sec id="S6">
<title>Data Availability Statement</title>
<p>The original contributions presented in the study are included in the article/<xref ref-type="supplementary-material" rid="SM1">Supplementary Material</xref>, further inquiries can be directed to the corresponding author/s.</p>
</sec>
<sec id="S7">
<title>Ethics Statement</title>
<p>The studies involving human participants were reviewed and approved by Institutional Review Board of University of Florida. The patients/participants provided their written informed consent to participate in this study.</p>
</sec>
<sec id="S8">
<title>Author Contributions</title>
<p>YY and YO designed the experiments, analyzed the data, wrote the article, and discussed the results at all states. YY, YL, RD, and AW performed the experiments. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</body>
<back>
<fn-group>
<fn fn-type="financial-disclosure">
<p><bold>Funding.</bold> This work was supported by an internal funding source from the College of Public Health &#x0026; Health Professions of the University of Florida.</p>
</fn>
</fn-group>
<ack>
<p>The authors would like to thank Laurie Gauger for her valuable comments which greatly improved the quality of the manuscript.</p>
</ack>
<sec id="S11" sec-type="supplementary material"><title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fnins.2021.678029/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fnins.2021.678029/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.ZIP" id="SM1" mimetype="application/zip" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Atilgan</surname> <given-names>H.</given-names></name> <name><surname>Town</surname> <given-names>S. M.</given-names></name> <name><surname>Wood</surname> <given-names>K. C.</given-names></name> <name><surname>Jones</surname> <given-names>G. P.</given-names></name> <name><surname>Maddox</surname> <given-names>R. K.</given-names></name> <name><surname>Lee</surname> <given-names>A. K. C.</given-names></name><etal/></person-group> (<year>2018</year>). <article-title>Integration of visual information in auditory cortex promotes auditory scene analysis through multisensory binding.</article-title> <source><italic>Neuron</italic></source> <volume>97</volume> <fpage>640</fpage>&#x2013;<lpage>655.e4</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuron.2017.12.034</pub-id> <pub-id pub-id-type="pmid">29395914</pub-id></citation></ref>
<ref id="B2"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bernstein</surname> <given-names>L. E.</given-names></name> <name><surname>Auer</surname> <given-names>E. T.</given-names> <suffix>Jr.</suffix></name> <name><surname>Takayanagi</surname> <given-names>S.</given-names></name></person-group> (<year>2004</year>). <article-title>Auditory speech detection in noise enhanced by lipreading.</article-title> <source><italic>Speech Commun.</italic></source> <volume>44</volume> <fpage>5</fpage>&#x2013;<lpage>18</lpage>. <pub-id pub-id-type="doi">10.1016/j.specom.2004.10.011</pub-id></citation></ref>
<ref id="B3"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bizley</surname> <given-names>J. K.</given-names></name> <name><surname>Maddox</surname> <given-names>R. K.</given-names></name> <name><surname>Lee</surname> <given-names>A. K. C.</given-names></name></person-group> (<year>2016</year>). <article-title>Defining auditory-visual objects?: behavioral tests and physiological mechanisms.</article-title> <source><italic>Trends Neurosci.</italic></source> <volume>39</volume> <fpage>74</fpage>&#x2013;<lpage>85</lpage>. <pub-id pub-id-type="doi">10.1016/j.tins.2015.12.007</pub-id> <pub-id pub-id-type="pmid">26775728</pub-id></citation></ref>
<ref id="B4"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Campbell</surname> <given-names>R.</given-names></name> <name><surname>Dodd</surname> <given-names>B.</given-names></name></person-group> (<year>1980</year>). <article-title>Hearing by eye.</article-title> <source><italic>Q. J. Exp. Psychol.</italic></source> <volume>32</volume> <fpage>85</fpage>&#x2013;<lpage>99</lpage>.</citation></ref>
<ref id="B5"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Diehl</surname> <given-names>R. L.</given-names></name> <name><surname>Kluender</surname> <given-names>K. R.</given-names></name></person-group> (<year>1989</year>). <article-title>On the objects of speech perception.</article-title> <source><italic>Ecol. Psychol.</italic></source> <volume>1</volume> <fpage>121</fpage>&#x2013;<lpage>144</lpage>. <pub-id pub-id-type="doi">10.1207/s15326969eco0102_2</pub-id></citation></ref>
<ref id="B6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Diehl</surname> <given-names>R. L.</given-names></name> <name><surname>Lotto</surname> <given-names>A. J.</given-names></name> <name><surname>Holt</surname> <given-names>L. L.</given-names></name></person-group> (<year>2004</year>). <article-title>Speech perception.</article-title> <source><italic>Annu. Rev. Psychol</italic></source> <volume>55</volume> <fpage>149</fpage>&#x2013;<lpage>179</lpage>. <pub-id pub-id-type="doi">10.1146/annurev.psych.55.090902.142028</pub-id> <pub-id pub-id-type="pmid">14744213</pub-id></citation></ref>
<ref id="B7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Drullman</surname> <given-names>R.</given-names></name> <name><surname>Festen</surname> <given-names>J. M.</given-names></name> <name><surname>Plomp</surname> <given-names>R.</given-names></name></person-group> (<year>1994</year>). <article-title>Effect of temporal envelope smearing on speech reception.</article-title> <source><italic>J. Acoust. Soc. Am.</italic></source> <volume>95</volume> <fpage>1053</fpage>&#x2013;<lpage>1064</lpage>. <pub-id pub-id-type="doi">10.1121/1.408467</pub-id></citation></ref>
<ref id="B8"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Erber</surname> <given-names>N. P.</given-names></name></person-group> (<year>1975</year>). <article-title>Auditory-visual perception of speech.</article-title> <source><italic>J. Speech Hear. Disord.</italic></source> <volume>40</volume> <fpage>481</fpage>&#x2013;<lpage>492</lpage>. <pub-id pub-id-type="doi">10.1044/jshd.4004.481</pub-id> <pub-id pub-id-type="pmid">1234963</pub-id></citation></ref>
<ref id="B9"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Erber</surname> <given-names>N. P.</given-names></name></person-group> (<year>1979</year>). <article-title>Real-time synthesis of optical lip shapes from vowel sounds.</article-title> <source><italic>J. Acoust. Soc. Am.</italic></source> <volume>66</volume> <fpage>1542</fpage>&#x2013;<lpage>1544</lpage>. <pub-id pub-id-type="doi">10.1121/1.383511</pub-id></citation></ref>
<ref id="B10"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Grant</surname> <given-names>K. W.</given-names></name></person-group> (<year>2001</year>). <article-title>The effect of speechreading on masked detection thresholds for filtered speech.</article-title> <source><italic>J. Acoust. Soc. Am.</italic></source> <volume>109</volume> <fpage>2272</fpage>&#x2013;<lpage>2275</lpage>. <pub-id pub-id-type="doi">10.1121/1.1362687</pub-id></citation></ref>
<ref id="B11"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Grant</surname> <given-names>K. W.</given-names></name> <name><surname>Bernstein</surname> <given-names>J. G. W.</given-names></name></person-group> (<year>2019</year>). &#x201C;<article-title>Toward a model of auditory-visual speech intelligibility</article-title>,&#x201D; in <source><italic>Multisensory Processes</italic></source>, <role>eds</role> <person-group person-group-type="editor"><name><surname>Lee</surname> <given-names>A. K. C.</given-names></name> <name><surname>Wallace</surname> <given-names>M. T.</given-names></name> <name><surname>Coffin</surname> <given-names>A. B.</given-names></name> <name><surname>Popper</surname> <given-names>A. N.</given-names></name> <name><surname>Fay</surname> <given-names>R. R.</given-names></name></person-group> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>), <fpage>33</fpage>&#x2013;<lpage>57</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-10461-0_3</pub-id></citation></ref>
<ref id="B12"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Grant</surname> <given-names>K. W.</given-names></name> <name><surname>Braida</surname> <given-names>L. D.</given-names></name></person-group> (<year>1991</year>). <article-title>Evaluating the articulation index for auditory&#x2013;visual input.</article-title> <source><italic>J. Acoust. Soc. Am.</italic></source> <volume>89</volume> <fpage>2952</fpage>&#x2013;<lpage>2960</lpage>. <pub-id pub-id-type="doi">10.1121/1.400733</pub-id></citation></ref>
<ref id="B13"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Grant</surname> <given-names>K. W.</given-names></name> <name><surname>Seitz</surname> <given-names>P. F.</given-names></name></person-group> (<year>2000</year>). <article-title>The use of visible speech cues for improving auditory detection of spoken sentences.</article-title> <source><italic>J. Acoust. Soc. Am.</italic></source> <volume>108</volume>:<issue>1197</issue>. <pub-id pub-id-type="doi">10.1121/1.1288668</pub-id></citation></ref>
<ref id="B14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hall</surname> <given-names>J. W.</given-names></name> <name><surname>Grose</surname> <given-names>J. H.</given-names></name></person-group> (<year>1988</year>). <article-title>Comodulation masking release: evidence for multiple cues.</article-title> <source><italic>J. Acoust. Soc. Am.</italic></source> <volume>84</volume> <fpage>1669</fpage>&#x2013;<lpage>1675</lpage>. <pub-id pub-id-type="doi">10.1121/1.397182</pub-id></citation></ref>
<ref id="B15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hall</surname> <given-names>J. W.</given-names></name> <name><surname>Grose</surname> <given-names>J. H.</given-names></name></person-group> (<year>1990</year>). <article-title>Comodulation masking release and auditory grouping.</article-title> <source><italic>J. Acoust. Soc. Am.</italic></source> <volume>88</volume> <fpage>119</fpage>&#x2013;<lpage>125</lpage>. <pub-id pub-id-type="doi">10.1121/1.399957</pub-id></citation></ref>
<ref id="B16"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Holmes</surname> <given-names>N. P.</given-names></name></person-group> (<year>2009</year>). <article-title>The principle of inverse effectiveness in multisensory integration: some statistical considerations.</article-title> <source><italic>Brain Topogr.</italic></source> <volume>21</volume> <fpage>168</fpage>&#x2013;<lpage>176</lpage>. <pub-id pub-id-type="doi">10.1007/s10548-009-0097-2</pub-id> <pub-id pub-id-type="pmid">19404728</pub-id></citation></ref>
<ref id="B17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Holt</surname> <given-names>L. L.</given-names></name></person-group> (<year>2005</year>). <article-title>Temporally nonadjacent nonlinguistic sounds affect speech categorization.</article-title> <source><italic>Psychol. Sci.</italic></source> <volume>16</volume> <fpage>305</fpage>&#x2013;<lpage>312</lpage>. <pub-id pub-id-type="doi">10.1111/j.0956-7976.2005.01532.x</pub-id> <pub-id pub-id-type="pmid">15828978</pub-id></citation></ref>
<ref id="B18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Holt</surname> <given-names>L. L.</given-names></name> <name><surname>Lotto</surname> <given-names>A. J.</given-names></name></person-group> (<year>2010</year>). <article-title>Speech perception as categorization.</article-title> <source><italic>Atten. Percept. Psychophys.</italic></source> <volume>72</volume> <fpage>1218</fpage>&#x2013;<lpage>1227</lpage>. <pub-id pub-id-type="doi">10.3758/app.72.5.1218</pub-id> <pub-id pub-id-type="pmid">20601702</pub-id></citation></ref>
<ref id="B19"><citation citation-type="journal"><collab>Institute of Electrical and Electronic Engineers.</collab> (<year>1969</year>). <article-title>IEEE recommended practice for speech quality measures.</article-title> <source><italic>IEEE</italic></source> <volume>297</volume> <fpage>1</fpage>&#x2013;<lpage>24</lpage>. <pub-id pub-id-type="doi">10.1109/IEEESTD.1969.7405210</pub-id></citation></ref>
<ref id="B20"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jaekl</surname> <given-names>P.</given-names></name> <name><surname>Pesquita</surname> <given-names>A.</given-names></name> <name><surname>Alsius</surname> <given-names>A.</given-names></name> <name><surname>Munhall</surname> <given-names>K.</given-names></name> <name><surname>Soto-Faraco</surname> <given-names>S.</given-names></name></person-group> (<year>2015</year>). <article-title>The contribution of dynamic visual cues to audiovisual speech perception.</article-title> <source><italic>Neuropsychologia</italic></source> <volume>75</volume> <fpage>402</fpage>&#x2013;<lpage>410</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuropsychologia.2015.06.025</pub-id> <pub-id pub-id-type="pmid">26100561</pub-id></citation></ref>
<ref id="B21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kanayama</surname> <given-names>N.</given-names></name> <name><surname>Tam&#x00E8;</surname> <given-names>L.</given-names></name> <name><surname>Ohira</surname> <given-names>H.</given-names></name> <name><surname>Pavani</surname> <given-names>F.</given-names></name></person-group> (<year>2012</year>). <article-title>Top down influence on visuo-tactile interaction modulates neural oscillatory responses.</article-title> <source><italic>Neuroimage</italic></source> <volume>59</volume> <fpage>3406</fpage>&#x2013;<lpage>3417</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuroimage.2011.11.076</pub-id> <pub-id pub-id-type="pmid">22173297</pub-id></citation></ref>
<ref id="B22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Levitt</surname> <given-names>H.</given-names></name></person-group> (<year>1971</year>). <article-title>Transformed up-down methods in psychoacoustics.</article-title> <source><italic>J. Acoust. Soc. Am.</italic></source> <volume>49</volume> <fpage>467</fpage>&#x2013;<lpage>477</lpage>. <pub-id pub-id-type="doi">10.1121/1.1912375</pub-id></citation></ref>
<ref id="B23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liberman</surname> <given-names>A. M.</given-names></name> <name><surname>Mattingly</surname> <given-names>I. G.</given-names></name></person-group> (<year>1985</year>). <article-title>The motor theory of speech perception revised<sup>&#x2217;</sup>.</article-title> <source><italic>Cognition</italic></source> <volume>21</volume> <fpage>1</fpage>&#x2013;<lpage>36</lpage>. <pub-id pub-id-type="doi">10.1016/0010-0277(85)90021-6</pub-id></citation></ref>
<ref id="B24"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liberman</surname> <given-names>A. M.</given-names></name> <name><surname>Cooper</surname> <given-names>F. S.</given-names></name> <name><surname>Shankweiler</surname> <given-names>D. P.</given-names></name> <name><surname>Studdert-Kennedy</surname> <given-names>M.</given-names></name></person-group> (<year>1967</year>). <article-title>Perception of the speech code.</article-title> <source><italic>Psychol. Rev.</italic></source> <volume>74</volume>:<issue>431</issue>.</citation></ref>
<ref id="B25"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>B.</given-names></name> <name><surname>Lin</surname> <given-names>Y.</given-names></name> <name><surname>Gao</surname> <given-names>X.</given-names></name> <name><surname>Dang</surname> <given-names>J.</given-names></name></person-group> (<year>2013</year>). <article-title>Correlation between audio-visual enhancement of speech in different noise environments and SNR: a combined behavioral and electrophysiological study.</article-title> <source><italic>Neuroscience</italic></source> <volume>247</volume> <fpage>145</fpage>&#x2013;<lpage>151</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuroscience.2013.05.007</pub-id> <pub-id pub-id-type="pmid">23673276</pub-id></citation></ref>
<ref id="B26"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ma</surname> <given-names>W. J.</given-names></name> <name><surname>Zhou</surname> <given-names>X.</given-names></name> <name><surname>Ross</surname> <given-names>L. A.</given-names></name> <name><surname>Foxe</surname> <given-names>J. J.</given-names></name> <name><surname>Parra</surname> <given-names>L. C.</given-names></name></person-group> (<year>2009</year>). <article-title>Lip-reading aids word recognition most in moderate noise: a Bayesian explanation using high-dimensional feature space.</article-title> <source><italic>PLoS One</italic></source> <volume>4</volume>:<issue>e4638</issue>. <pub-id pub-id-type="doi">10.1371/journal.pone.0004638</pub-id> <pub-id pub-id-type="pmid">19259259</pub-id></citation></ref>
<ref id="B27"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>MacLeod</surname> <given-names>A.</given-names></name> <name><surname>Summerfield</surname> <given-names>Q.</given-names></name></person-group> (<year>1990</year>). <article-title>A procedure for measuring auditory and audio-visual speech-reception thresholds for sentences in noise: rationale, evaluation, and recommendations for use.</article-title> <source><italic>Br. J. Audiol.</italic></source> <volume>24</volume> <fpage>29</fpage>&#x2013;<lpage>43</lpage>. <pub-id pub-id-type="doi">10.3109/03005369009077840</pub-id> <pub-id pub-id-type="pmid">2317599</pub-id></citation></ref>
<ref id="B28"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Maddox</surname> <given-names>R. K.</given-names></name> <name><surname>Atilgan</surname> <given-names>H.</given-names></name> <name><surname>Bizley</surname> <given-names>J. K.</given-names></name> <name><surname>Lee</surname> <given-names>A. K. C.</given-names></name></person-group> (<year>2015</year>). <article-title>Auditory selective attention is enhanced by a task-irrelevant temporally coherent visual stimulus in human listeners.</article-title> <source><italic>ELife</italic></source> <volume>4</volume>:<issue>e04995</issue>. <pub-id pub-id-type="doi">10.7554/eLife.04995</pub-id> <pub-id pub-id-type="pmid">25654748</pub-id></citation></ref>
<ref id="B29"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>McGurk</surname> <given-names>H.</given-names></name> <name><surname>MacDonald</surname> <given-names>J.</given-names></name></person-group> (<year>1976</year>). <article-title>Hearing lips and seeing voices.</article-title> <source><italic>Nature</italic></source> <volume>264</volume>:<issue>746</issue>. <pub-id pub-id-type="doi">10.1038/264746a0</pub-id> <pub-id pub-id-type="pmid">1012311</pub-id></citation></ref>
<ref id="B30"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Meredith</surname> <given-names>A.</given-names></name> <name><surname>Wallace</surname> <given-names>T.</given-names></name> <name><surname>Stein</surname> <given-names>E.</given-names></name></person-group> (<year>1992</year>). <article-title>Visual, auditory and somatosensory convergence in output neurons of the cat superior colliculus: multisensory properties of the tecto-reticulo-spinal projection.</article-title> <source><italic>Exp. Brain Res.</italic></source> <volume>88</volume> <fpage>181</fpage>&#x2013;<lpage>186</lpage>. <pub-id pub-id-type="doi">10.1007/bf02259139</pub-id> <pub-id pub-id-type="pmid">1541354</pub-id></citation></ref>
<ref id="B31"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Meredith</surname> <given-names>M. A.</given-names></name> <name><surname>Stein</surname> <given-names>B. E.</given-names></name></person-group> (<year>1986</year>). <article-title>Visual, Auditory, and somatosensory convergence on cells in superior colliculus results in multisensory integration.</article-title> <source><italic>J. Neurophysiol.</italic></source> <volume>56</volume> <fpage>640</fpage>&#x2013;<lpage>662</lpage>. <pub-id pub-id-type="doi">10.1152/jn.1986.56.3.640</pub-id> <pub-id pub-id-type="pmid">3537225</pub-id></citation></ref>
<ref id="B32"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Moore</surname> <given-names>B. C. J.</given-names></name> <name><surname>Glasberg</surname> <given-names>B. R.</given-names></name> <name><surname>Schooneveldt</surname> <given-names>G. P.</given-names></name></person-group> (<year>1990</year>). <article-title>Across-channel masking and comodulation masking release.</article-title> <source><italic>J. Acoust. Soc. Am.</italic></source> <volume>87</volume> <fpage>1683</fpage>&#x2013;<lpage>1694</lpage>. <pub-id pub-id-type="doi">10.1121/1.399416</pub-id></citation></ref>
<ref id="B33"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nasreddine</surname> <given-names>Z. S.</given-names></name> <name><surname>Phillips</surname> <given-names>N. A.</given-names></name> <name><surname>B&#x00E9;dirian</surname> <given-names>V.</given-names></name> <name><surname>Charbonneau</surname> <given-names>S.</given-names></name> <name><surname>Whitehead</surname> <given-names>V.</given-names></name> <name><surname>Collin</surname> <given-names>I.</given-names></name><etal/></person-group> (<year>2005</year>). <article-title>The montreal cognitive assessment, MoCA: a brief screening tool for mild cognitive impairment.</article-title> <source><italic>J. Am. Geriatr. Soc.</italic></source> <volume>53</volume> <fpage>695</fpage>&#x2013;<lpage>699</lpage>. <pub-id pub-id-type="doi">10.1111/j.1532-5415.2005.53221.x</pub-id> <pub-id pub-id-type="pmid">15817019</pub-id></citation></ref>
<ref id="B34"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>O&#x2019;Neill</surname> <given-names>J. J.</given-names></name></person-group> (<year>1954</year>). <article-title>Contributions of the visual components of oral symbols to speech comprehension.</article-title> <source><italic>J. Speech Hear. Disord.</italic></source> <volume>19</volume> <fpage>429</fpage>&#x2013;<lpage>439</lpage>. <pub-id pub-id-type="doi">10.1044/jshd.1904.429</pub-id> <pub-id pub-id-type="pmid">13222457</pub-id></citation></ref>
<ref id="B35"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ross</surname> <given-names>L. A.</given-names></name> <name><surname>Saint-Amour</surname> <given-names>D.</given-names></name> <name><surname>Leavitt</surname> <given-names>V. M.</given-names></name> <name><surname>Javitt</surname> <given-names>D. C.</given-names></name> <name><surname>Foxe</surname> <given-names>J. J.</given-names></name></person-group> (<year>2007a</year>). <article-title>Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environments.</article-title> <source><italic>Cereb. Cortex</italic></source> <volume>17</volume> <fpage>1147</fpage>&#x2013;<lpage>1153</lpage>. <pub-id pub-id-type="doi">10.1093/cercor/bhl024</pub-id> <pub-id pub-id-type="pmid">16785256</pub-id></citation></ref>
<ref id="B36"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ross</surname> <given-names>L. A.</given-names></name> <name><surname>Saint-Amour</surname> <given-names>D.</given-names></name> <name><surname>Leavitt</surname> <given-names>V. M.</given-names></name> <name><surname>Molholm</surname> <given-names>S.</given-names></name> <name><surname>Javitt</surname> <given-names>D. C.</given-names></name> <name><surname>Foxe</surname> <given-names>J. J.</given-names></name></person-group> (<year>2007b</year>). <article-title>Impaired multisensory processing in schizophrenia: deficits in the visual enhancement of speech comprehension under noisy environmental conditions.</article-title> <source><italic>Schizophr. Res.</italic></source> <volume>97</volume> <fpage>173</fpage>&#x2013;<lpage>183</lpage>. <pub-id pub-id-type="doi">10.1016/j.schres.2007.08.008</pub-id> <pub-id pub-id-type="pmid">17928202</pub-id></citation></ref>
<ref id="B37"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Senkowski</surname> <given-names>D.</given-names></name> <name><surname>Saint-Amour</surname> <given-names>D.</given-names></name> <name><surname>H&#x00F6;fle</surname> <given-names>M.</given-names></name> <name><surname>Foxe</surname> <given-names>J. J.</given-names></name></person-group> (<year>2011</year>). <article-title>Multisensory interactions in early evoked brain activity follow the principle of inverse effectiveness.</article-title> <source><italic>Neuroimage</italic></source> <volume>56</volume> <fpage>2200</fpage>&#x2013;<lpage>2208</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuroimage.2011.03.075</pub-id> <pub-id pub-id-type="pmid">21497200</pub-id></citation></ref>
<ref id="B38"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stein</surname> <given-names>B. E.</given-names></name> <name><surname>Stanford</surname> <given-names>T. R.</given-names></name></person-group> (<year>2008</year>). <article-title>Multisensory integration: current issues from the perspective of the single neuron.</article-title> <source><italic>Nat. Rev. Neurosci.</italic></source> <volume>9</volume> <fpage>255</fpage>&#x2013;<lpage>266</lpage>. <pub-id pub-id-type="doi">10.1038/nrn2377</pub-id></citation></ref>
<ref id="B39"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stein</surname> <given-names>B. E.</given-names></name> <name><surname>Stanford</surname> <given-names>T. R.</given-names></name> <name><surname>Ramachandran</surname> <given-names>R.</given-names></name> <name><surname>Perrault</surname> <given-names>T. J.</given-names></name> <name><surname>Rowland</surname> <given-names>B. A.</given-names></name></person-group> (<year>2009</year>). <article-title>Challenges in quantifying multisensory integration: alternative criteria, models, and inverse effectiveness.</article-title> <source><italic>Exp. Brain Res.</italic></source> <volume>198</volume> <fpage>113</fpage>&#x2013;<lpage>126</lpage>. <pub-id pub-id-type="doi">10.1007/s00221-009-1880-8</pub-id> <pub-id pub-id-type="pmid">19551377</pub-id></citation></ref>
<ref id="B40"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sumby</surname> <given-names>W. H.</given-names></name> <name><surname>Pollack</surname> <given-names>I.</given-names></name></person-group> (<year>1954</year>). <article-title>Visual contribution to speech intelligibility in noise.</article-title> <source><italic>J. Acoust. Soc. Am.</italic></source> <volume>26</volume> <fpage>212</fpage>&#x2013;<lpage>215</lpage>. <pub-id pub-id-type="doi">10.1121/1.1907309</pub-id></citation></ref>
<ref id="B41"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Summerfield</surname> <given-names>Q.</given-names></name></person-group> (<year>1992</year>). <article-title>Audio-visual speech perception, lipreading and artificial stimulation.</article-title> <source><italic>Philos. Trans. Biol. Sci.</italic></source> <volume>335</volume> <fpage>71</fpage>&#x2013;<lpage>78</lpage>. <pub-id pub-id-type="doi">10.1016/b978-0-12-460440-7.50010-7</pub-id></citation></ref>
<ref id="B42"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wallace</surname> <given-names>M. T.</given-names></name> <name><surname>Meredith</surname> <given-names>M. A.</given-names></name> <name><surname>Stein</surname> <given-names>B. E.</given-names></name></person-group> (<year>1993</year>). <article-title>Converging influences from visual, auditory, and somatosensory cortices onto output neurons of the superior colliculus.</article-title> <source><italic>J. Neurophysiol.</italic></source> <volume>69</volume> <fpage>1797</fpage>&#x2013;<lpage>1809</lpage>. <pub-id pub-id-type="doi">10.1152/jn.1993.69.6.1797</pub-id> <pub-id pub-id-type="pmid">8350124</pub-id></citation></ref>
<ref id="B43"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yuan</surname> <given-names>Y.</given-names></name> <name><surname>Meyers</surname> <given-names>K.</given-names></name> <name><surname>Borges</surname> <given-names>K.</given-names></name> <name><surname>Lleo</surname> <given-names>Y.</given-names></name> <name><surname>Fiorentino</surname> <given-names>K.</given-names></name> <name><surname>Oh</surname> <given-names>Y.</given-names></name></person-group> (<year>in press</year>). <article-title>Effects of visual speech envelope on audiovisual speech perception in multi-talker listening environments.</article-title> <source><italic>J. Speech Lang. Hear. Res.</italic></source></citation></ref>
<ref id="B44"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yuan</surname> <given-names>Y.</given-names></name> <name><surname>Wayland</surname> <given-names>R.</given-names></name> <name><surname>Oh</surname> <given-names>Y.</given-names></name></person-group> (<year>2020</year>). <article-title>Visual analog of the acoustic amplitude envelope benefits speech perception in noise.</article-title> <source><italic>J. Acoust. Soc. Am.</italic></source> <volume>147</volume> <fpage>EL246</fpage>&#x2013;<lpage>EL251</lpage>. <pub-id pub-id-type="doi">10.1121/10.0000737</pub-id></citation></ref>
<ref id="B45"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zion Golumbic</surname> <given-names>E. M.</given-names></name> <name><surname>Ding</surname> <given-names>N.</given-names></name> <name><surname>Bickel</surname> <given-names>S.</given-names></name> <name><surname>Lakatos</surname> <given-names>P.</given-names></name> <name><surname>Schevon</surname> <given-names>C. A.</given-names></name> <name><surname>McKhann</surname> <given-names>G. M.</given-names></name><etal/></person-group> (<year>2013</year>). <article-title>Mechanisms underlying selective neuronal tracking of attended speech at a &#x201C;cocktail party.&#x201D;</article-title> <source><italic>Neuron</italic></source> <volume>77</volume> <fpage>980</fpage>&#x2013;<lpage>991</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuron.2012.12.037</pub-id> <pub-id pub-id-type="pmid">23473326</pub-id></citation></ref>
</ref-list>
</back>
</article>