<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neurosci.</journal-id>
<journal-title>Frontiers in Neuroscience</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neurosci.</abbrev-journal-title>
<issn pub-type="epub">1662-453X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnins.2021.760611</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>A Speech-Level&#x2013;Based Segmented Model to Decode the Dynamic Auditory Attention States in the Competing Speaker Scenes</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Wang</surname> <given-names>Lei</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/743545/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Wang</surname> <given-names>Yihan</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Liu</surname> <given-names>Zhixing</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Wu</surname> <given-names>Ed X.</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/58493/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Chen</surname> <given-names>Fei</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/561603/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Electrical and Electronic Engineering, Southern University of Science and Technology</institution>, <addr-line>Shenzhen</addr-line>, <country>China</country></aff>
<aff id="aff2"><sup>2</sup><institution>Department of Electrical and Electronic Engineering, The University of Hong Kong</institution>, <addr-line>Pokfulam</addr-line>, <country>Hong Kong SAR, China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Yi Du, Institute of Psychology, Chinese Academy of Sciences (CAS), China</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Junfeng Li, Chinese Academy of Sciences (CAS), China; Behtash Babadi, University of Maryland, College Park, United States</p></fn>
<corresp id="c001">&#x002A;Correspondence: Fei Chen, <email>fchen@sustech.edu.cn</email></corresp>
<fn fn-type="other" id="fn004"><p>This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience</p></fn>
</author-notes>
<pub-date pub-type="epub">
<day>10</day>
<month>02</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2021</year>
</pub-date>
<volume>15</volume>
<elocation-id>760611</elocation-id>
<history>
<date date-type="received">
<day>18</day>
<month>08</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>30</day>
<month>12</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2022 Wang, Wang, Liu, Wu and Chen.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Wang, Wang, Liu, Wu and Chen</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>In the competing speaker environments, human listeners need to focus or switch their auditory attention according to dynamic intentions. The reliable cortical tracking ability to the speech envelope is an effective feature for decoding the target speech from the neural signals. Moreover, previous studies revealed that the root mean square (RMS)&#x2013;level&#x2013;based speech segmentation made a great contribution to the target speech perception with the modulation of sustained auditory attention. This study further investigated the effect of the RMS-level&#x2013;based speech segmentation on the auditory attention decoding (AAD) performance with both sustained and switched attention in the competing speaker auditory scenes. Objective biomarkers derived from the cortical activities were also developed to index the dynamic auditory attention states. In the current study, subjects were asked to concentrate or switch their attention between two competing speaker streams. The neural responses to the higher- and lower-RMS-level speech segments were analyzed <italic>via</italic> the linear temporal response function (TRF) before and after the attention switching from one to the other speaker stream. Furthermore, the AAD performance decoded by the unified TRF decoding model was compared to that by the speech-RMS-level&#x2013;based segmented decoding model with the dynamic change of the auditory attention states. The results showed that the weight of the typical TRF component approximately 100-ms time lag was sensitive to the switching of the auditory attention. Compared to the unified AAD model, the segmented AAD model improved attention decoding performance under both the sustained and switched auditory attention modulations in a wide range of signal-to-masker ratios (SMRs). In the competing speaker scenes, the TRF weight and AAD accuracy could be used as effective indicators to detect the changes of the auditory attention. In addition, with a wide range of SMRs (i.e., from 6 to &#x2013;6 dB in this study), the segmented AAD model showed the robust decoding performance even with short decision window length, suggesting that this speech-RMS-level&#x2013;based model has the potential to decode dynamic attention states in the realistic auditory scenarios.</p>
</abstract>
<kwd-group>
<kwd>auditory attention decoding</kwd>
<kwd>speech-RMS-level segments</kwd>
<kwd>auditory attention switching</kwd>
<kwd>temporal response function</kwd>
<kwd>EEG signals</kwd>
</kwd-group>
<counts>
<fig-count count="4"/>
<table-count count="1"/>
<equation-count count="4"/>
<ref-count count="62"/>
<page-count count="15"/>
<word-count count="11948"/>
</counts>
</article-meta>
</front>
<body>
<sec id="S1" sec-type="intro">
<title>Introduction</title>
<p>In a competing speaker environment, the target speech perception relies on the modulation of selective auditory attention. A large number of behavioral and neuroimaging studies have investigated the human abilities to selectively track the particular speech stream with sustained auditory attention (e.g., <xref ref-type="bibr" rid="B9">Cherry, 1953</xref>; <xref ref-type="bibr" rid="B48">Shamma and Micheyl, 2010</xref>; <xref ref-type="bibr" rid="B51">Szab&#x00F3; et al., 2016</xref>). Nevertheless, the dynamic change of the auditory attention states often occurs in the real-life environments, which requires the auditory system to reorganize the relevant information of specific auditory objects and reallocate attention resources when the focus of attention switches between different speaker streams (e.g., <xref ref-type="bibr" rid="B23">Fritz et al., 2007</xref>, <xref ref-type="bibr" rid="B22">2013</xref>; <xref ref-type="bibr" rid="B1">Ahveninen et al., 2013</xref>). Some studies also suggested that, in the dynamic auditory scenes, the salient speech features played an important role in the target speech perception through the bottom-up auditory pathways (<xref ref-type="bibr" rid="B34">Kaya and Elhilali, 2014</xref>; <xref ref-type="bibr" rid="B49">Shuai and Elhilali, 2014</xref>). However, it remains unknown whether the dynamic change of the auditory attention states can be reliably decoded from the cortical signals when subjects focus their attention to the natural sentences in the complex auditory scenes. Besides, it needs to further uncover the underlying neural mechanisms of the sensitive tracking ability to the target speech stream in the complex auditory scenes.</p>
<p>Several methods have been proposed to detect selective auditory attention on the basis of the typical electroencephalograph (EEG) features with diverse experimental tasks (e.g., <xref ref-type="bibr" rid="B43">N&#x00E4;&#x00E4;t&#x00E4;nen et al., 1992</xref>; <xref ref-type="bibr" rid="B10">Choi et al., 2013</xref>; <xref ref-type="bibr" rid="B38">Larson and Lee, 2014</xref>; <xref ref-type="bibr" rid="B26">Geravanchizadeh and Roushan, 2021</xref>). In earlier electrophysiological studies, the dynamic states of the auditory attention were captured by comparing the morphology of event-related potential (ERP) components (e.g., the P1&#x2013;N1&#x2013;P2 complex, P300) elicited by the acoustic properties within different auditory stimuli (e.g., <xref ref-type="bibr" rid="B46">Polich et al., 1986</xref>; <xref ref-type="bibr" rid="B53">Tse et al., 2004</xref>; <xref ref-type="bibr" rid="B10">Choi et al., 2013</xref>). Although such ERP-based measurements were extensively used in the brain&#x2013;computer interface speller system (e.g., <xref ref-type="bibr" rid="B20">Donchin et al., 2000</xref>; <xref ref-type="bibr" rid="B32">Hoffmann et al., 2008</xref>), it was an inappropriate method for detecting the dynamic attention changes in the continuous natural speech streams. Recently, some researchers further developed proper experimental paradigms and analytical methods to explore the dynamic switching of the auditory attention under the multi-talker conditions using the EEG signals (e.g., <xref ref-type="bibr" rid="B39">Lee et al., 2014</xref>; <xref ref-type="bibr" rid="B16">Deng et al., 2019</xref>; <xref ref-type="bibr" rid="B24">Geirnaert et al., 2020</xref>; <xref ref-type="bibr" rid="B28">Getzmann et al., 2020</xref>). Specifically, two typical characteristics of EEG signals, i.e., the stronger N2 subcomponent and the lateralization of posterior alpha power, were significantly correlated with the spatial auditory attention switching (e.g., <xref ref-type="bibr" rid="B16">Deng et al., 2019</xref>; <xref ref-type="bibr" rid="B28">Getzmann et al., 2020</xref>). Nevertheless, these ERP-based features required average cortical responses over multiple experimental trials to obtain the high-quality time-locked characteristics. Hence, because of the time-consuming process of extracting attention-related features, these ERP-based methods were limited to be used in the realistic auditory scenes. Many studies also used common spatial patterns and effective connectivity to decode the dynamic attention states in single-trial EEG signals when subjects performed the dichotic listening tasks (e.g., <xref ref-type="bibr" rid="B24">Geirnaert et al., 2020</xref>; <xref ref-type="bibr" rid="B25">Geravanchizadeh and Gavgani, 2020</xref>). The spatial differences among speakers evoked distinct brain activity patterns and such features provided crucial cues to decode the selective auditory attention. However, in the absence of spatial cues, there was little understanding about the effect of dynamic attention modulation on the target speech perception in the multi-speaker conditions.</p>
<p>The recent understanding of the selective auditory attention in the cocktail party problem and the advances of electrophysiological technologies make it possible to decode the auditory attention from EEG signals in the complex auditory scenarios. In the natural continuous speech streams, the extensively used auditory attention decoding (AAD) methods were based on the mapping functions between the speech envelope and the corresponding EEG responses <italic>via</italic> linear and non-linear computational models (e.g., <xref ref-type="bibr" rid="B19">Ding and Simon, 2012b</xref>; <xref ref-type="bibr" rid="B44">O&#x2019;Sullivan et al., 2015</xref>; <xref ref-type="bibr" rid="B13">Crosse et al., 2016</xref>; <xref ref-type="bibr" rid="B11">Ciccarelli et al., 2019</xref>; <xref ref-type="bibr" rid="B14">Das et al., 2020</xref>; <xref ref-type="bibr" rid="B26">Geravanchizadeh and Roushan, 2021</xref>). Specifically, the linear decoder models, such as the temporal response function (TRF), were widely used to decode auditory attention with reasonable accuracy under a wide range of signal-to-masker ratios (SMRs) (<xref ref-type="bibr" rid="B13">Crosse et al., 2016</xref>). Generally, the estimation procedure of linear models was simpler and faster than that of non-linear models. The linear models also provided the interpretable relations between the continuous auditory stimulus and the corresponding EEG responses (e.g., <xref ref-type="bibr" rid="B19">Ding and Simon, 2012b</xref>; <xref ref-type="bibr" rid="B44">O&#x2019;Sullivan et al., 2015</xref>). The non-linear decoding models using deep neural networks (DNNs) can achieve higher AAD accuracies compared to the linear AAD approaches even with short decoding window lengths (e.g., <xref ref-type="bibr" rid="B11">Ciccarelli et al., 2019</xref>; <xref ref-type="bibr" rid="B14">Das et al., 2020</xref>). Nevertheless, it was still difficult to interpret the underlying mechanisms for the decoding results by the DNN-based models. Besides, most non-linear decoding models concentrated on feature extraction from EEG signals but ignored the features carried by speech temporal envelopes. Briefly, these effective AAD methods have successfully decoded the auditory attention when subjects kept their attention to a specific target stream throughout the experimental procedure. Several magnetoencephalography and EEG studies also indicated that the AAD methods could track the dynamic changes of attentional states when the competing speakers were presented at the same or different spatial locations (e.g., <xref ref-type="bibr" rid="B2">Akram et al., 2016</xref>; <xref ref-type="bibr" rid="B41">Miran et al., 2018</xref>, <xref ref-type="bibr" rid="B42">2020</xref>; <xref ref-type="bibr" rid="B52">Teoh and Lalor, 2019</xref>). Nevertheless, it remains unclear how the neural responses are affected by the dynamic change of attention states and which speech features make great contributions to capturing changes in auditory attention states (i.e., before or after the auditory attention switching) in the absence of the spatial cues between the competing speakers under different SMR conditions.</p>
<p>In general, selective auditory attention can realize successful perception of the target auditory object by activating the target-related information and inhibiting the irrelevant information (<xref ref-type="bibr" rid="B23">Fritz et al., 2007</xref>; <xref ref-type="bibr" rid="B48">Shamma and Micheyl, 2010</xref>; <xref ref-type="bibr" rid="B51">Szab&#x00F3; et al., 2016</xref>). The target speech perception in noise depends on the robust representation regions of the target signal and the regions that are least affected by the competing speaker stream (<xref ref-type="bibr" rid="B12">Cooke, 2006</xref>; <xref ref-type="bibr" rid="B40">Li and Loizou, 2007</xref>). Specifically, in the competing speaker environments, the salient auditory cues and silent gaps of the auditory stimuli play an important role in target speech perception (e.g., <xref ref-type="bibr" rid="B40">Li and Loizou, 2007</xref>; <xref ref-type="bibr" rid="B55">Vestergaard et al., 2011</xref>; <xref ref-type="bibr" rid="B47">Seibold et al., 2018</xref>). The speech temporal information at low frequency containing the syllable rhythms can also facilitate target speech perception in noisy conditions (e.g., <xref ref-type="bibr" rid="B29">Greenberg et al., 2003</xref>; <xref ref-type="bibr" rid="B55">Vestergaard et al., 2011</xref>). As indicated in the investigations from previous studies (e.g., <xref ref-type="bibr" rid="B33">Kates and Arehart, 2005</xref>; <xref ref-type="bibr" rid="B7">Chen and Loizou, 2012</xref>; <xref ref-type="bibr" rid="B8">Chen and Wong, 2013</xref>), speech envelopes not only revealed the change of relative root mean square (RMS) intensity but also conveyed the phonetic distribution of the whole sentences. The analysis of different speech segments on the basis of relative RMS intensity provided an effective way to understand the attentional modulation of target speech perception in the competing speaker environments (<xref ref-type="bibr" rid="B6">Chen and Loizou, 2011</xref>; <xref ref-type="bibr" rid="B58">Wang et al., 2020a</xref>,<xref ref-type="bibr" rid="B59">b</xref>). According to previous studies, the higher- and lower-RMS-level speech segments could be extracted with a threshold of &#x2013;10 dB relative to the overall RMS level of the speech signal (e.g., <xref ref-type="bibr" rid="B33">Kates and Arehart, 2005</xref>; <xref ref-type="bibr" rid="B8">Chen and Wong, 2013</xref>). Higher-RMS-level speech segments contained the voicing parts of the sentences (i.e., the most proportion of vowels and vowel&#x2013;consonant transitions), whereas most silent gaps and weak consonants were located in lower-RMS-level speech segments (<xref ref-type="bibr" rid="B6">Chen and Loizou, 2011</xref>; <xref ref-type="bibr" rid="B8">Chen and Wong, 2013</xref>). Previous studies also demonstrated that higher- and lower-RMS-level&#x2013;based speech segments had different effects on the encoding and decoding of the target speech from the corresponding EEG signals (<xref ref-type="bibr" rid="B57">Wang et al., 2019</xref>, <xref ref-type="bibr" rid="B58">2020a</xref>,<xref ref-type="bibr" rid="B59">b</xref>). Moreover, in cases where the listeners were required to maintain their attention on the target speech stream, the AAD sensitivity and accuracy could be improved by using the time-variant segmented model to decode different types of RMS-level&#x2013;based speech segments (<xref ref-type="bibr" rid="B56">Wang, 2021</xref>). Accordingly, it is valuable to further explore whether the speech-RMS-level&#x2013;based segmented AAD model could reliably track the dynamic change of the auditory attention states in the competing speaker scenes. The contribution of different RMS-level&#x2013;based speech segments on attention decoding needs to be studied in the auditory attentional switching tasks, so as to expand the potential application of the neurofeedback-based AAD system in the realistic auditory scenarios.</p>
<p>In the present study, we hypothesized that effective biomarkers can be extracted from the cortical responses to index the dynamic auditory attention states in the competing speaker scenes with a wide range of SMRs. Furthermore, RMS-level&#x2013;dependent speech segmentation would have a significant influence on the decoding performance of selective auditory attention. Hence, the speech-RMS-level&#x2013;based segmented model could have the potential to improve the AAD accuracy and sensitivity with both the sustained and switched auditory attention modulations. In addition, the auditory attention states and the relative SMR levels could jointly affect the AAD abilities in the competing speaker scenes.</p>
</sec>
<sec id="S2" sec-type="materials|methods">
<title>Materials and Methods</title>
<sec id="S2.SS1">
<title>Participants</title>
<p>Sixteen participants (10 males and 6 females) aged between 16 and 27 years old participated in this experiment. All participants had normal hearing abilities with the pure-tone threshold less than 25 dB at 125&#x2013;8,000 Hz. All subjects were native speakers of Mandarin Chinese and provided informed written consent before their participations. The Institution&#x2019;s Ethical Review Board of Southern University of Science and Technology approved the experimental procedures.</p>
</sec>
<sec id="S2.SS2">
<title>Stimuli and Experimental Procedure</title>
<p>The stimuli used in this work were extracted from two Chinese stories narrated by a female Mandarin speaker and a male Mandarin speaker. These stories were divided into approximately 60-s segments. Each experimental trial contained a 60-s speech fragment. The silent gaps within each 60-s fragment were less than 300 ms to avoid unexpected auditory attention shifts. To test the neural responses with the switching of attention, subjects were required to shift their attention from the male speaker to the female speaker at the middle time of each 60-s segment. Hence, the auditory attention switching divided the whole trial into two different sections (i.e., the first half and the latter half). Specifically, each trial contained a 30-s speech fragment with the attention to the male speaker in the first half, followed with a silent gap with random duration (1&#x223C;2 s), and a 30-s speech fragment with the attention to the female speaker in the latter half. <xref ref-type="fig" rid="F1">Figure 1A</xref> displays the detailed experimental procedure. The male-to-female ratio (MFR) was fixed in each condition, and there were three MFR conditions (i.e., 6, 0, and &#x2013;6 dB) in this study. More specifically, for the conditions at 6- and &#x2013;6-dB MFR levels, the SMR level was changed with the switching of attention from the male to the female speaker stream, whereas the SMR level was unchanged before and after the switching of the auditory attention for the 0-dB MFR condition. The detailed experimental settings about the three MFR conditions are shown in <xref ref-type="fig" rid="F1">Figure 1B</xref>. During the whole experiment, visual instructions were displayed on the screen to control the experimental procedure. The visual instructions were represented on the screen with white color against the black background. In each trial, a white cross was displayed in the middle of the screen without auditory stimuli. Then, the character &#x201C;male&#x201D; appeared on the screen to remind the listener to focus on the male speaker stream. Subsequently, the instruction on the screen was changed to &#x201C;female&#x201D; to remind the listener to switch his/her attention to the female speech stream. To avoid the influence of visual changes on the neural responses, the auditory stimuli in the second stage played 1&#x223C;2 s after the change of visual instruction. Each trial was played once to each subject. Five trials were included in each block. At the end of each block, three questions about the target speech streams with four choices were asked to the participant. The block with all corrected answers was reserved for further analyses. Two blocks (i.e., 10 trials) were obtained for each condition.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption><p><bold>(A)</bold> An illustration of the experimental procedure. <bold>(B)</bold> Three conditions used in this study. Three conditions fixed the MFR levels at 6, 0, and &#x2013;6 dB, and the SMR level were changed in the first half and the latter half with the switching of the auditory attention from the male to the female speaker streams. The icons used in <bold>(A)</bold> obtained from <ext-link ext-link-type="uri" xlink:href="https://thenounproject.com/">https://thenounproject.com/</ext-link>.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-15-760611-g001.tif"/>
</fig>
<p>The experiment was performed in a double-walled acoustically shielded room. Mixed auditory stimuli were presented bilaterally <italic>via</italic> earphones at 65-dB sound pressure level. The whole experimental procedures were controlled by the software E-Prime 2. This experiment used 62 electrodes to record the scalp EEG signals at the 500-Hz sampling rate. Two external reference electrodes were placed at the left and right mastoids. An online reference electrode was attached at the nose tip, and the electrooculography signals were recorded by two electrodes located below and up the left eye. The impedance of all EEG electrodes was kept less than 5 k&#x03A9;. During the experiment, all participants were required to reduce body movements.</p>
</sec>
<sec id="S2.SS3">
<title>Data Analyses</title>
<sec id="S2.SS3.SSS1">
<title>Electroencephalograph Signals and Auditory Stimuli Preprocessing</title>
<p>The preprocessing of the EEG signals was conducted with the EEGLAB toolbox (<xref ref-type="bibr" rid="B15">Delorme and Makeig, 2004</xref>). First, a high-pass filter with the cutoff frequency of 0.5 Hz was implemented with the function of windowed sinc finite impulse response (FIR) filter in the EEGLAB toolbox. Independent component analysis was implemented to remove typical artifacts (e.g., eye movements) using the ICLabel toolbox (<xref ref-type="bibr" rid="B45">Pion-Tonachini et al., 2019</xref>). On average, three independent components were removed for each subject. The EEG signals were then filtered at low-frequency bands because the cortical responses at these low frequencies could reliably track the speech envelopes (e.g., <xref ref-type="bibr" rid="B17">Di Liberto et al., 2015</xref>; <xref ref-type="bibr" rid="B44">O&#x2019;Sullivan et al., 2015</xref>; <xref ref-type="bibr" rid="B57">Wang et al., 2019</xref>). Specifically, the EEG signals were high-pass filtered with a zero-phase FIR filter at a cutoff frequency of 2 Hz and low-pass filtered with a zero-phase FIR filter at a cutoff frequency of 8 Hz.</p>
<p>Speech envelopes were extracted as the primary feature to calculate the cortical tracking ability (e.g., <xref ref-type="bibr" rid="B44">O&#x2019;Sullivan et al., 2015</xref>; <xref ref-type="bibr" rid="B13">Crosse et al., 2016</xref>; <xref ref-type="bibr" rid="B14">Das et al., 2020</xref>). This study further investigated the effects of RMS-level&#x2013;based segmentation on the phase-locking performance between cortical responses and speech envelopes at low frequencies. First, speech signals were divided into the higher- and lower-RMS-level&#x2013;based segments on the basis of the threshold of &#x2013;10 dB relative to the overall RMS level of the whole utterance. The detailed segmentation procedures can also refer to <xref ref-type="bibr" rid="B33">Kates and Arehart (2005)</xref> and <xref ref-type="bibr" rid="B56">Wang (2021)</xref>. <xref ref-type="fig" rid="F2">Figure 2A</xref> shows the RMS level of a continuous utterance and higher- and lower-RMS-level segments within this sentence. This segmentation threshold (i.e., &#x2013;10 dB relative to the RMS level of the whole sentence) was determined according to the distribution of perceptual information in different RMS-level&#x2013;based speech segments, which was originally proposed in <xref ref-type="bibr" rid="B33">Kates and Arehart (2005)</xref> and extensively studied in many behavioral and electrophysiological experiments (e.g., <xref ref-type="bibr" rid="B33">Kates and Arehart, 2005</xref>; <xref ref-type="bibr" rid="B6">Chen and Loizou, 2011</xref>, <xref ref-type="bibr" rid="B7">2012</xref>; <xref ref-type="bibr" rid="B8">Chen and Wong, 2013</xref>; <xref ref-type="bibr" rid="B57">Wang et al., 2019</xref>, <xref ref-type="bibr" rid="B58">2020a</xref>,<xref ref-type="bibr" rid="B59">b</xref>; <xref ref-type="bibr" rid="B56">Wang, 2021</xref>). Previous studies have found that higher-RMS-level&#x2013;speech segments mainly contained the vowels and transitions between vowels and consonants, whereas lower-RMS-level speech segments carried the weak consonants and silent gaps of the continuous utterance (<xref ref-type="bibr" rid="B6">Chen and Loizou, 2011</xref>, <xref ref-type="bibr" rid="B7">2012</xref>; <xref ref-type="bibr" rid="B8">Chen and Wong, 2013</xref>). In Mandarin sentences, most voicing parts of the whole sentence were in higher-RMS-level speech segments, which contained the vital speech intelligibility information (<xref ref-type="bibr" rid="B6">Chen and Loizou, 2011</xref>; <xref ref-type="bibr" rid="B59">Wang et al., 2020b</xref>). Some syllabic onsets and the silences of the continuous Mandarin sentences were primarily contained in lower-RMS-level speech segments, which carried the dynamic temporal structure of target speech in noisy conditions (<xref ref-type="bibr" rid="B21">Fogerty and Kewley-Port, 2009</xref>; <xref ref-type="bibr" rid="B30">Hamilton et al., 2018</xref>). Subsequently, speech envelopes were calculated using the Hilbert transform function in higher- and lower-RMS-level speech segments, respectively. Because the envelope onsets made great contributions to the neural-speech tracking performance (e.g., <xref ref-type="bibr" rid="B30">Hamilton et al., 2018</xref>), speech envelopes were then half-wave rectified and the first-order derivative was calculated to extract the increased envelope fluctuations (i.e., the positive derivate values). Then, speech envelopes were resampled to the EEG sampling rate (i.e., 500 Hz) and filtered band-pass filtered from 2 to 8 Hz using the zero-shifted FIR. To reduce the processing time, the processed EEG and speech signals were then downsampled at the sampling rate of 100 Hz.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption><p><bold>(A)</bold> The root mean square (RMS) level of a short fragment in the continuous speech stimulus. The dashed line indicates the threshold (&#x2013;10 dB relative RMS level) to classify higher- and lower-RMS-level speech segments within the continuous utterance. The right figure shows the higher- and lower-RMS-level segments in the temporal series of a sentence. <bold>(B)</bold> The TRF responses calculated between speech envelopes and the corresponding EEG signals in 6&#x2013;, 0&#x2013;, and &#x2013;6-dB MFR conditions. The TRF responses between higher-RMS-level speech segments and EEG signals were displayed with the black solid line in the first half (i.e., before attention switching) and with the black dashed line in the latter half (i.e., after attention switching) of the whole trial. The right solid line in the first half (i.e., before attention switching) and the right dashed line in the latter half (i.e., after attention switching) of the whole trial were represented the TRF responses derived from lower-RMS-level speech segments and the corresponding EEG signals, respectively. <bold>(C)</bold> The amplitudes of the three typical TRF components with the lower-RMS-level segments (right lines) and higher-RMS-level segments (black lines) in &#x2013;6 dB (dot lines with the rectangular sign), 0 dB (dashed lines with the square sign), and 6 dB (solid lines with the circle sign) before and after the switching of the auditory attention from the male speaker (the first half) to the female speaker (the latter half).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-15-760611-g002.tif"/>
</fig>
</sec>
<sec id="S2.SS3.SSS2">
<title>Forward Temporal Response Function Models and Neural Response Predictions</title>
<p>The relationships between speech envelopes and the corresponding EEG activities were analyzed with the linear TRF model using the mTRF toolbox (<xref ref-type="bibr" rid="B13">Crosse et al., 2016</xref>). The forward TRF was used to map the cortical responses elicited by the continuous speech stimuli. In this study, how cortical activity encoded different segments in the target speech (i.e., higher- and lower-RMS-level speech segments) and attentional switching (i.e., attention switching from one speaker to the other) was analyzed through TRF responses under various MFR conditions (i.e., 6, 0, and &#x2013;6 dB). Specifically, the linear transformation of the stimulus envelopes <italic>S(t)</italic> to the corresponding cortical responses <italic>R(t)</italic> can be represented by the linear regression model TRF, as</p>
<disp-formula id="S2.E1">
<label>(1)</label>
<mml:math id="M1">
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>R</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:msup>
<mml:mi>F</mml:mi>
<mml:mo>&#x002A;</mml:mo>
</mml:msup>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>S</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where &#x002A; indicates the convolution operator. The TRF can be calculated as</p>
<disp-formula id="S2.E2">
<label>(2)</label>
<mml:math id="M2">
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>R</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>F</mml:mi>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mi>S</mml:mi>
<mml:mi>T</mml:mi>
</mml:msup>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi mathvariant="normal">&#x03BB;</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mtext>I</mml:mtext>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2062;</mml:mo>
<mml:msup>
<mml:mi>S</mml:mi>
<mml:mi>T</mml:mi>
</mml:msup>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>and the ridge regression is used to prevent overfitting, where <italic>I</italic> is the identity function and &#x03BB; represents the ridge parameter. The ridge parameter is determined by the minimum mean-square error between the predicted and original neural signals using the leave-one-out cross-validation. The weights in the TRF model indicate the neural responses relative to the auditory stimulus onsets, and the time lags between &#x2013;100 and 800 ms were used in this work to show the TRF responses under different experimental conditions. The processing step refers to previous studies (e.g., <xref ref-type="bibr" rid="B17">Di Liberto et al., 2015</xref>; <xref ref-type="bibr" rid="B59">Wang et al., 2020b</xref>) and the detailed descriptions can also be seen in <xref ref-type="bibr" rid="B13">Crosse et al. (2016)</xref>. The TRF components show similar response patterns as those in ERPs with specific time lags (e.g., <xref ref-type="bibr" rid="B37">Lalor et al., 2009</xref>; <xref ref-type="bibr" rid="B36">Kong et al., 2014</xref>; <xref ref-type="bibr" rid="B17">Di Liberto et al., 2015</xref>). The TRF weights indicate the correlation coefficients between the speech envelope and the corresponding neural response. The TRF polarity represents the relationship between the cortical current directions and the speech envelope fluctuated trends (<xref ref-type="bibr" rid="B19">Ding and Simon, 2012b</xref>). In this study, the TRF weights averaged across all electrodes were statistically analyzed in three typical components, i.e., the first positive component (80&#x223C;150 ms), the first negative component (170&#x223C;240 ms), and the second positive component (250&#x223C;350 ms), with higher- and lower-RMS-level speech segments before and after the attention switching between two speaker streams in 6&#x2013;, 0&#x2013;, and &#x2013;6-dB MFR conditions.</p>
</sec>
<sec id="S2.SS3.SSS3">
<title>Higher- and Lower-Root Mean Square-Level Speech Segments Classification</title>
<p>Higher- and lower-RMS-level segments of the target speech streams can be classified with the corresponding EEG signals, according to the different neural response patterns to these speech segments in clean and noisy environments (e.g., <xref ref-type="bibr" rid="B57">Wang et al., 2019</xref>, <xref ref-type="bibr" rid="B58">2020a</xref>). The subject-specific support vector machine (SVM) classifier was used to classify higher- and lower-RMS-level speech segments on the basis of the cross-correlations between speech envelopes and neural responses. In the training procedure, binary speech labels were generated to represent higher- and lower-RMS-level segments of the clean target speech. Then, the feature vector of each channel was composed of the maximum cross-correlation values between the EEG signals and the relevant speech envelopes at each short frame. Specifically, the EEG signals and speech envelopes were divided into 400-ms short frames with a 20% overlapping ratio because the cortical activity mainly responded to the auditory stimulus in the time lag interval (from 0 to 400 ms) as shown in the <xref ref-type="fig" rid="F2">Figure 2B</xref> and the related results in previous studies (e.g., <xref ref-type="bibr" rid="B59">Wang et al., 2020b</xref>; <xref ref-type="bibr" rid="B56">Wang, 2021</xref>). For each subject, the SVM classifier with a Gaussian radial kernel function was trained to predict higher- and lower-RMS-level segments of the target speech stream on the basis of the corresponding EEG signals using the leave-one-out cross-validation approach. During the testing phase, the analyzed features were derived from the maximum cross-correlation coefficients between the EEG signals and the auditory envelopes from mixed speech sources. The trained SVM model and the calculated feature vectors were used to predict higher- and lower-RMS-level segments within the continuous auditory stimuli. The classification accuracies were calculated by the percentage of correctly identified labels relative to the labels of the target speech source before and after the attentional shifts at different SMR conditions. The SVM classification was computed with the functions in the Statistics and Machine Learning Toolbox Release 2017b of MATLAB (MathWorks Inc., United States).</p>
</sec>
<sec id="S2.SS3.SSS4">
<title>Backward Temporal Response Function Methods and Speech Reconstruction</title>
<p>The backward linear TRF models were widely used in decoding of the auditory attention under the competing speaker environments. The envelope of the target speech (i.e., the male speaker stream in the first half and the female speaker stream in the latter half) was reconstructed by the spatiotemporal filters <italic>g</italic>(&#x03C4;,<italic>n</italic>) and the EEG responses <italic>r</italic>(<italic>t</italic>,<italic>n</italic>) at each electrode channel <italic>n</italic> over a range of time lag &#x03C4;. The reconstructed speech envelope <inline-formula><mml:math id="INEQ3"><mml:mrow><mml:mover accent="true"><mml:mi>s</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> can be calculated in discrete time as</p>
<disp-formula id="S2.E3">
<label>(3)</label>
<mml:math id="M3">
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>s</mml:mi>
<mml:mo stretchy="false">^</mml:mo>
</mml:mover>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:munder>
<mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo>
<mml:mi>n</mml:mi>
</mml:munder>
<mml:mrow>
<mml:munder>
<mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo>
<mml:mi mathvariant="normal">&#x03C4;</mml:mi>
</mml:munder>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">&#x03C4;</mml:mi>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>g</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathvariant="normal">&#x03C4;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>The linear mapping function <italic>g</italic>(&#x03C4;,<italic>n</italic>) is estimated by ridge regression to avoid the overfitting and ill-posed problems, and the detailed procedure of ridge regression was referred to previous studies (e.g., <xref ref-type="bibr" rid="B13">Crosse et al., 2016</xref>). The leave-one-out cross-validation approach was implemented for optimizing the regularization parameter across subjects and conditions. Different regularization parameters searching from 2<sup>0</sup>, 2<sup>2</sup>, &#x2026;, 2<sup>12</sup> were used to reconstruct the auditory stimulus, respectively. The optimal regularization parameter was determined as 2<sup>6</sup> because this value yielded the highest averaged correlation coefficient between the actual and reconstructed speech envelopes across the trained trials. The range of time lags was consistent with that contained in the major responses in the forward TRF, i.e., from 0 to 400 ms post-stimulus in this study.</p>
<p>After the processing of the leave-one-out cross-validation, the unified decoding model (<italic>D<italic>unified</italic></italic>) was used to predict the speech envelopes before and after attentional switching under different MFR conditions in the testing procedure. On the basis of the different effects of higher- and lower-RMS-level speech segments on cortical-envelope tracking ability to target speech streams, a segmented linear decoding model (<italic>D</italic><sub><italic>segmented</italic></sub>) was proposed to separately reconstruct speech envelopes in higher- and lower-RMS-level segments, respectively (<xref ref-type="bibr" rid="B56">Wang, 2021</xref>). The decoder model of higher-RMS-level speech segments was generated by the EEG signals and auditory stimulus that only included higher-RMS-level segments. Similarly, lower-RMS-level speech segments and the corresponding EEG signals were used to train the specific model to decode lower-RMS-level speech segments. The training and validation procedures of these two decoders were the same as those used in <italic>D</italic><sub><italic>unified</italic></sub>. In the testing procedure, the prior-trained SVM classifier was used to predict higher- and lower-RMS-level speech segments on the basis of the mixed speech and EEG responses. The speech envelopes were then reconstructed by the segmented decoders according to the boundaries of higher- and lower-RMS-levels speech segments. Finally, the reconstructed speech envelopes using <italic>D</italic><sub><italic>segmented</italic></sub> were generated by the concatenation of the predicted envelopes from different decoders. Subsequently, the AAD performance was determined by comparing the correlation coefficients between the reconstructed speech envelopes and the original envelopes of the target speech streams (<italic>r</italic><sub><italic>tar</italic></sub>) or the ignored speech streams (<italic>r</italic><sub><italic>ign</italic></sub>).</p>
</sec>
<sec id="S2.SS3.SSS5">
<title>Performance of Auditory Attention Decoding</title>
<p>AAD accuracy was computed as the percentage of correctly identified trials (i.e., <italic>r</italic><sub><italic>tar</italic></sub> &#x003E; <italic>r</italic><sub><italic>ign</italic></sub>) in each condition. The AAD accuracies derived from <italic>D</italic><sub><italic>segmented</italic></sub> and <italic>D</italic><sub><italic>unified</italic></sub> were analyzed to show the effect of the attention switching between speakers under different MFR levels. The AAD accuracy could be an indicator to reveal the dynamic changes of the auditory attention states. In addition, to further test the sensitivity and reliability of the AAD systems, AAD accuracies were calculated with short to long decision window lengths (i.e., 1, 2, 5, 20, and 30 s) in different conditions. The Wolpaw information transfer rate (ITR) was used to assess the transmitted bits per time unit (<xref ref-type="bibr" rid="B61">Wolpaw and Ramoser, 1998</xref>). It was a metric that jointly evaluated the decoding accuracy and the decision time length of the AAD systems with different conditions. In this study, ITR was represented as bits per minute for five different decision window lengths &#x03C4; (1, 2, 5, 20, and 30 s) with the AAD accuracy <italic>p</italic> of classification tasks. The detailed calculated equation was represented as</p>
<disp-formula id="S2.E4">
<label>(4)</label>
<mml:math id="M4">
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>T</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>R</mml:mi>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mi mathvariant="normal">&#x03C4;</mml:mi>
</mml:mfrac>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>log</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>&#x2061;</mml:mo>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>-</mml:mo>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>log</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>&#x2061;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>-</mml:mo>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>The effects of different decoding models, attention switching, and different MFR conditions on the ITR values were further statistically analyzed with the non-parametric Kruskal&#x2013;Wallis test.</p>
</sec>
</sec>
</sec>
<sec id="S3" sec-type="results">
<title>Results</title>
<sec id="S3.SS1">
<title>Temporal Response Function Responses and Neural Encoding Performance</title>
<p>Repeated measures analysis of variance (ANOVA) was used to analyze the effects of the auditory attentional switching, RMS-level&#x2013;based speech segments and the different SMR levels on TRF responses. Analyses of the magnitude of TRF responses in typical components were conducted by a 2 (attentional states: before vs. after attention switching) &#x00D7; 2 (speech feature: higher- vs. lower-RMS-level segments) &#x00D7; 3 (MFR level: &#x2013;6 dB vs. 0 dB vs. 6 dB) within-subject repeated measures ANOVA. The Greenhouse&#x2013;Geisser correction was adjusted the freedom when sphericity was violated, and the <italic>post hoc</italic> analysis was implemented with the Bonferroni correction to adjust <italic>P</italic>-value for multiple comparisons. Compared to the ignored speech stream, the target speech stream could elicit reliable and typical TRF components under various SMR conditions (e.g., <xref ref-type="bibr" rid="B36">Kong et al., 2014</xref>; <xref ref-type="bibr" rid="B44">O&#x2019;Sullivan et al., 2015</xref>). Many studies also indicated that the TRF response obtained from the target speech streams contained biomarkers that could estimate the switching of the auditory attention states (e.g., <xref ref-type="bibr" rid="B2">Akram et al., 2016</xref>; <xref ref-type="bibr" rid="B42">Miran et al., 2020</xref>). Hence, this study showed and analyzed the typical TRF components elicited by the target speech streams in different conditions (see <xref ref-type="fig" rid="F2">Figure 2B</xref>). TRF weights were statistically analyzed across three typical components within a specific window across all scalp electrodes (see <xref ref-type="fig" rid="F2">Figure 2C</xref>).</p>
<p>For the first positive deflection, the average amplitude of the TRF weight was calculated from 80 to 150-ms time lags. ANOVA results revealed that a main effect for different RMS-level&#x2013;based segments [<italic>F</italic><sub>(1, 15)</sub> = 16.77, <italic>P</italic> = 0.01, &#x03B7;<inline-formula><mml:math id="INEQ6"><mml:msubsup><mml:mi/><mml:mi>p</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.53] and attention switching [<italic>F</italic><sub>(1, 15)</sub> = 22.43, <italic>P</italic> &#x003C; 0.001, &#x03B7;<inline-formula><mml:math id="INEQ7"><mml:msubsup><mml:mi/><mml:mi>p</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.60] with a significant interaction effect between these two factors [<italic>F</italic><sub>(1, 15)</sub> = 14.25, <italic>P</italic> = 0.002, &#x03B7;<inline-formula><mml:math id="INEQ8"><mml:msubsup><mml:mi/><mml:mi>p</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.49]. These results suggested that the first positive components of the TRF response were larger with lower-RMS-level speech segments than with higher-RMS-level speech segments, and the TRF amplitudes in the first positive deflection were decreased after the switching of the auditory attention from one speaker stream to the other. There was no significant three-way interaction of different speech segments, attention switching, and MFR levels [<italic>F</italic><sub>(2, 14)</sub> = 0.58, <italic>P</italic> = 0.57, &#x03B7;<inline-formula><mml:math id="INEQ9"><mml:msubsup><mml:mi/><mml:mi>p</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.08]. Neither the different speech segments by MFR level [<italic>F</italic><sub>(2, 30)</sub> = 3.00, <italic>P</italic> = 0.08, &#x03B7;<inline-formula><mml:math id="INEQ10"><mml:msubsup><mml:mi/><mml:mi>p</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.30] nor the attention switching by MFR level interaction had significant effects on the amplitude of first positive deflection [<italic>F</italic><sub>(2, 30)</sub> = 0.80, <italic>P</italic> = 0.47, &#x03B7;<inline-formula><mml:math id="INEQ11"><mml:msubsup><mml:mi/><mml:mi>p</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.10].</p>
<p>For the second positive deflection, the analysis window was set between 250 and 350 ms to compute the average amplitudes. The ANOVA results showed a main effect for different RMS-level&#x2013;based segments [<italic>F</italic><sub>(1, 15)</sub> = 12.41, <italic>P</italic> = 0.003, &#x03B7;<inline-formula><mml:math id="INEQ12"><mml:msubsup><mml:mi/><mml:mi>p</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.45] and MFR levels [<italic>F</italic><sub>(2, 30)</sub> = 10.29, <italic>P</italic> = 0.002, &#x03B7;<inline-formula><mml:math id="INEQ13"><mml:msubsup><mml:mi/><mml:mi>p</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.60], indicating that the TRF amplitude of the second positive component was significantly larger with the lower-RMS-level segments than with higher-RMS-level segments, and this TRF weight was reduced with the decrease of MFR level. There was no main effect for attentional switching [<italic>F</italic><sub>(1, 15)</sub> = 0.97, <italic>P</italic> = 0.34, &#x03B7;<inline-formula><mml:math id="INEQ14"><mml:msubsup><mml:mi/><mml:mi>p</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.06] suggesting that the TRF response around the 300-ms time lag was not significantly affected by the switching of attention in the competing speaker auditory scenes. No significant interactions were found with RMS-level&#x2013;based speech segments, attention switching, and MRF level (all <italic>P</italic> &#x003E; 0.05).</p>
<p>For the first negative deflection, the average TRF weight was computed within 170&#x223C;240 ms. The only significant main effect was revealed for the different RMS-level&#x2013;based speech segments [<italic>F</italic><sub>(1, 15)</sub> = 13.79, <italic>P</italic> = 0.002, &#x03B7;<inline-formula><mml:math id="INEQ15"><mml:msubsup><mml:mi/><mml:mi>p</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.48], showing the larger TRF responses in lower-RMS-level speech segments than those in higher-RMS-level speech segments. The attentional switching and MFR levels showed no main effects on the TRF amplitude of the first negative component (all <italic>P</italic> &#x003E; 0.05). There were no significant three-way and two-way interactions of the three factors, i.e., RMS-level&#x2013;based speech segments, attention switching, and MRF level (all <italic>P</italic> &#x003E; 0.05).</p>
</sec>
<sec id="S3.SS2">
<title>Classification of Higher- and Lower-Root Mean Square-Level Speech Segments</title>
<p>On the basis of the different neural patterns for higher- and lower-RMS-level speech segments of the target speech perception under noisy environments, the current study utilized the corresponding cortical responses to predict the higher- and lower-RMS-level speech segments of the auditory speech stimuli. <xref ref-type="fig" rid="F2">Figure 2A</xref> displays the RMS level of a whole sentence, and the dashed line indicates the RMS threshold to determine higher- and lower-RMS-level segments. By averaging the percentages of all sentences used in this experiment, the duration of higher- and lower-RMS-level segments accounted for 51.22 and 48.78% of the whole utterances, respectively, which was consistent with the previous findings that the higher- and lower-RMS-level segments had similar duration within the continuous sentences (<xref ref-type="bibr" rid="B6">Chen and Loizou, 2011</xref>; <xref ref-type="bibr" rid="B56">Wang, 2021</xref>). The higher-RMS-level speech segments comprised 57.81, 69.43, and 59.66% durations of mixed speech under the 6&#x2013;, 0&#x2013;, and &#x2013;6-dB MFR conditions, respectively. The classified results of higher- and lower-RMS-level speech segments were calculated with the short time fragments using the trained SVM classifier. <xref ref-type="fig" rid="F3">Figure 3</xref> shows the classification accuracy and F1-score of higher- and lower-RMS-level speech segments before and after the attentional switching from male to the female speaker stream under different MFR levels. The effect of attention switching and MFR level on the SVM classification results were examined with the non-parametric Kruskal&#x2013;Wallis test. There were significant effects of attention switching and MFR level on the classification accuracy of different speech segments (all <italic>P</italic> &#x003C; 0.001). Specifically, the classification accuracy was decreased after the switching of the auditory attention from the male speaker to the female speaker with the 6-dB MFR (the first half: mean = 82.50, standard error = 0.46; the latter half: mean = 72.47, standard error = 0.46), the 0-dB MFR (the first half: mean = 81.73, standard error = 1.10; the latter half: mean = 78.13, standard error = 0.48), and the &#x2013;6-dB MFR (the first half: mean = 79.37, standard error = 0.36; the latter half: mean = 74.73, standard error = 0.50). These results indicated that the classification accuracy was significantly affected by the auditory attentional switching with a wide range of MFR conditions (i.e., from 6 to &#x2013;6 dB). The F1-scores in the first 30 s were higher than those in the latter half with the effect of attention switching under the 6-dB MFR (the first half: mean = 86.34, standard error = 0.34; the latter half: mean = 80.34, standard error = 0.43) and the 0-dB MFR (the first half: mean = 87.19, standard error = 0.72; the latter half: mean = 81.68, standard error = 0.43). No significant differences of the F1-score were shown before and after the attention switching between two speaker streams under the &#x2013;6-dB MFR [(&#x03C7;<sup>2</sup> = 1.20, <italic>P</italic> = 0.27); the first half: mean = 85.46, standard error = 0.29; the latter half: mean = 84.87, standard error = 0.36]. Both classification accuracy and F1-score were reduced with the decreased SMR levels in the first half and the latter half (all <italic>P</italic> &#x003C; 0.01), suggesting that the relative SMR level was a critical factor to influence the classification performance of higher- and lower-RMS-level speech segments from the EEG signals.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption><p><bold>(A)</bold> The accuracies for the classification of higher- and lower-RMS-level speech segments using SVM classifier under 6-dB (black signs), 0-dB (dark gray signs), and &#x2013;6-dB (light gray signs) MFR conditions before and after the switching of the auditory attention at the middle time of the 60-s trails. <bold>(B)</bold> The F1-scores for the classification of higher- and lower-RMS-level speech segments under 6-dB (black signs), 0-dB (dark gray signs), and &#x2013;6-dB (light gray signs) MFR conditions in the first half and the latter half. The bars represent the max and min values of each condition, &#x002A;&#x002A;&#x002A; indicates <italic>P</italic> &#x003C; 0.001, &#x002A;&#x002A; indicates <italic>P</italic> &#x003C; 0.01, and n.s. indicates <italic>P</italic> &#x003E; 0.05.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-15-760611-g003.tif"/>
</fig>
</sec>
<sec id="S3.SS3">
<title>Auditory Attention Decoding Performance</title>
<sec id="S3.SS3.SSS1">
<title>Correlation Coefficients Between Actual and Predicted Speech Envelopes</title>
<p><xref ref-type="fig" rid="F4">Figure 4A</xref> shows the correlation coefficients between the reconstructed and original speech envelopes to the target or ignored speech before and after the attention switching under the 6&#x2013;, 0&#x2013;, and &#x2013;6-dB MFR conditions using <italic>D</italic><sub><italic>unified</italic></sub> and <italic>D</italic><sub><italic>segmented</italic></sub>, respectively. The decoding window length of 30 s was used to calculate the <italic>r</italic><sub><italic>tar</italic></sub> and <italic>r</italic><sub><italic>ign</italic></sub> values in <xref ref-type="fig" rid="F4">Figure 4A</xref>, and the relative value of <italic>r</italic><sub><italic>tar</italic></sub> and <italic>r</italic><sub><italic>ign</italic></sub> was the basis for determining the attentional direction in the competing speaker scenes. The ANOVA analysis showed a main effect for the type of reconstructed speech streams, showing that <italic>r</italic><sub><italic>tar</italic></sub> was significantly larger than <italic>r</italic><sub><italic>ign</italic></sub> [<italic>F</italic><sub>(1, 15)</sub> = 93.35, <italic>P</italic> &#x003C; 0.001, &#x03B7;<inline-formula><mml:math id="INEQ17"><mml:msubsup><mml:mi/><mml:mi>p</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.86] under all experimental conditions in this study. A three-way ANOVA analysis was also performed to test the effects of different decoding models, MRF levels, and attentional switching on <italic>r</italic><sub><italic>tar</italic></sub> and <italic>r</italic><sub><italic>ign</italic></sub> values, respectively. There were no significant interactions of these three factors, and the interaction of decoding model by MFR level for both <italic>r</italic><sub><italic>tar</italic></sub> and <italic>r</italic><sub><italic>ign</italic></sub> values (all <italic>P</italic> &#x003E; 0.05). A significant interaction was shown between MFR level and attention switching for the value of <italic>r</italic><sub><italic>tar</italic></sub> [<italic>F</italic><sub>(2, 30)</sub> = 5.19, <italic>P</italic> = 0.01, &#x03B7;<inline-formula><mml:math id="INEQ18"><mml:msubsup><mml:mi/><mml:mi>p</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.26] and <italic>r</italic><sub><italic>ign</italic></sub> [<italic>F</italic><sub>(2, 30)</sub> = 28.01, <italic>P</italic> &#x003C; 0.001, &#x03B7;<inline-formula><mml:math id="INEQ19"><mml:msubsup><mml:mi/><mml:mi>p</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.65]. The attention switching exhibited a main effect on the value of <italic>r</italic><sub><italic>tar</italic></sub> [<italic>F</italic><sub>(1, 15)</sub> = 43.03, <italic>P</italic> &#x003C; 0.001, &#x03B7;<inline-formula><mml:math id="INEQ20"><mml:msubsup><mml:mi/><mml:mi>p</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.74], but no significant main effect for MFR level on the value of r<italic><sub><italic>tar</italic></sub></italic> [<italic>F</italic><sub>(2, 30)</sub> = 0.70, <italic>P</italic> = 0.50, &#x03B7;<inline-formula><mml:math id="INEQ21"><mml:msubsup><mml:mi/><mml:mi>p</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> 0.05]. For the value of <italic>r</italic><sub><italic>ign</italic></sub>, both attention switching [<italic>F</italic><sub>(1, 15)</sub> = 43.03, <italic>P</italic> &#x003C; 0.001, &#x03B7;<inline-formula><mml:math id="INEQ22"><mml:msubsup><mml:mi/><mml:mi>p</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.74] and MFR level [<italic>F</italic><sub>(2, 30)</sub> = 0.70, <italic>P</italic> = 0.004, &#x03B7;<inline-formula><mml:math id="INEQ23"><mml:msubsup><mml:mi/><mml:mi>p</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.31] showed significant main effect. <italic>Post hoc</italic> analysis showed that the <italic>r</italic><sub><italic>tar</italic></sub> values in the latter half were significantly smaller than those in the first half with the 6&#x2013;, 0&#x2013;, and &#x2013;6-dB MFR conditions. The changes of the <italic>r</italic><sub><italic>ign</italic></sub> values after attention switching from male to female speaker streams were dependent on the SMRs, i.e., no significant differences in 0-dB MFR condition, a decrease of <italic>r</italic><sub><italic>ign</italic></sub> value with the SMR reduce (i.e., the 6-dB MFR condition), and increased <italic>r</italic><sub><italic>ign</italic></sub> value with the increase of SMR levels (i.e., the &#x2013;6-dB MFR condition). These results suggested that the <italic>r</italic><sub><italic>tar</italic></sub> values were robustly modulated by auditory attention, and the attentional gains controlled the reliable cortical responses to target speech streams regardless of the relative intensity of the competing streams in a wide range of SMR conditions (i.e., 6 to &#x2013;6 dB in this study), whereas the <italic>r</italic><sub><italic>ign</italic></sub> values showed significant effects of the SMR changes with attentional switching. In addition, the main effect was significant for different decoding models in both <italic>r</italic><sub><italic>tar</italic></sub> [<italic>F</italic><sub>(2, 30)</sub> = 5.19, <italic>P</italic> = 0.01, &#x03B7;<inline-formula><mml:math id="INEQ24"><mml:msubsup><mml:mi/><mml:mi>p</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.26] and <italic>r</italic><sub><italic>ign</italic></sub> values [<italic>F</italic><sub>(2, 30)</sub> = 28.01, <italic>P</italic> &#x003C; 0.001, &#x03B7;<inline-formula><mml:math id="INEQ25"><mml:msubsup><mml:mi/><mml:mi>p</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:math></inline-formula> = 0.65], revealing that the RMS-level&#x2013;based <italic>D</italic><sub><italic>segmented</italic></sub> improved the reconstructed performance of speech envelopes than the <italic>D</italic><sub><italic>unified</italic></sub>.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption><p><bold>(A)</bold> The correlation coefficients of the reconstructed speech envelopes with the target speech envelope (the above figure) and the ignored speech envelopes (the below figure) decoded with the <italic>D</italic><sub><italic>segmented</italic></sub> (red lines) and <italic>D</italic><sub><italic>unified</italic></sub> (black lines) computational models in 6-dB (solid lines with circle signs), 0-dB (dashed lines with square signs), and &#x2013;6-dB (dot lines with rectangular signs) MFR conditions in the first half and the latter half. <bold>(B)</bold> The AAD accuracy calculated by <italic>D</italic><sub><italic>segmented</italic></sub> (gray boxes) and <italic>D</italic><sub><italic>unified</italic></sub> (black boxes) with 2&#x2013;, 5&#x2013;, 10&#x2013;, 20&#x2013;, and 30-s decoding window lengths before and after the switching of auditory attention in 6&#x2013;, 0&#x2013;, and &#x2013;6-dB MFR conditions. The error bars show the standard error in each condition, &#x002A;&#x002A;&#x002A; indicates <italic>P</italic> &#x003C; 0.001, &#x002A;&#x002A; indicates <italic>P</italic> &#x003C; 0.01, &#x002A; indicates <italic>P</italic> &#x003C; 0.05, and n.s. indicates <italic>P</italic> &#x003E; 0.05. <bold>(C)</bold> ITR with the <italic>D</italic><sub><italic>segmented</italic></sub> (red lines) and <italic>D</italic><sub><italic>unified</italic></sub> (black lines) with 2, 5, 10, 20, and 30 s decoding window lengths before and after the switching of the auditory attention in 6&#x2013;, 0&#x2013;, and &#x2013;6-dB MFR conditions.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-15-760611-g004.tif"/>
</fig>
</sec>
<sec id="S3.SS3.SSS2">
<title>Auditory Attention Decoding Accuracy and Sensitivity</title>
<p>To examine the AAD performance of the neuro-steered system with different decoding algorithms (i.e., <italic>D</italic><sub><italic>unified</italic></sub> and <italic>D</italic><sub><italic>segmented</italic></sub>) before and after the attentional switching from the male to the female speaker stream, the non-parametric Kruskal&#x2013;Wallis test was implemented to analyze the AAD accuracy with different decision window lengths (i.e., 2, 5, 10, 20, and 30 s). <xref ref-type="fig" rid="F4">Figure 4B</xref> and <xref ref-type="table" rid="T1">Table 1</xref> show the detailed AAD accuracies under different experimental conditions. The AAD accuracies using <italic>D</italic><sub><italic>segmented</italic></sub> were significantly higher than those using <italic>D</italic><sub><italic>unified</italic></sub> under all experimental conditions (all <italic>P</italic> &#x003C; 0.05), except for the conditions where the decoding window length was 2 s with the 0-dB MFR after attention switching and with the &#x2013;6-dB MFR before and after attention switching between two speech streams (<italic>P</italic> &#x003E; 0.05). The AAD accuracy was significantly increased with the extension of decision window time before and after the auditory attention switching with three MFR conditions (all <italic>P</italic> &#x003C; 0.05). In both the 6- and 0-dB MFR conditions, the AAD accuracies were significantly reduced after attention switching using both the <italic>D</italic><sub><italic>unified</italic></sub> and <italic>D</italic><sub><italic>segmented</italic></sub> (all <italic>P</italic> &#x003C; 0.05), suggesting that the switching of the auditory attention in the competing speaker scenes affected the AAD performance. There was a marginal decrease of AAD accuracy after the switching of attention from the male to the female speaker stream with the &#x2013;6-dB MFR condition using the five decoding window lengths, indicating that the increased SMR level could supplement the decrease of AAD accuracy after attention switching.</p>
<table-wrap position="float" id="T1">
<label>TABLE 1</label>
<caption><p>The averaged AAD accuracies and the standard deviations (mean/standard deviation) decoded by <italic>D</italic><sub><italic>unified</italic></sub> and <italic>D</italic><sub><italic>segmented</italic></sub> using different decoding window length (i.e., 2, 5, 10, 20, and 30 s) before and after the switching of attention from the male to the female speaker streams under the 6&#x2013;, 0&#x2013;, and &#x2013;6-dB MFR conditions.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left" colspan="2">Decoding window length</td>
<td valign="top" align="center" colspan="2">MFR = 6 dB<hr/></td>
<td valign="top" align="center" colspan="2">MFR = 0 dB<hr/></td>
<td valign="top" align="center" colspan="2">MFR = &#x2013;6 dB<hr/></td>
</tr>
<tr>
<td valign="top" colspan="2"/>
<td valign="top" align="center">The first half</td>
<td valign="top" align="center">The latter half</td>
<td valign="top" align="center">The first half</td>
<td valign="top" align="center">The latter half</td>
<td valign="top" align="center">The first half</td>
<td valign="top" align="center">The latter half</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><italic>D</italic><sub><italic>unified</italic></sub></td>
<td valign="top" align="center">2 s</td>
<td valign="top" align="center">61.17/1.37</td>
<td valign="top" align="center">57.58/1.76</td>
<td valign="top" align="center">62.71/1.62</td>
<td valign="top" align="center">58.63/1.85</td>
<td valign="top" align="center">62.04/0.86</td>
<td valign="top" align="center">57.58/1.84</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">5 s</td>
<td valign="top" align="center">65.63/2.78</td>
<td valign="top" align="center">61.88/3.03</td>
<td valign="top" align="center">69.17/2.26</td>
<td valign="top" align="center">65.21/2.21</td>
<td valign="top" align="center">67.40/1.39</td>
<td valign="top" align="center">61.88/3.21</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">10 s</td>
<td valign="top" align="center">69.58/2.82</td>
<td valign="top" align="center">66.67/3.74</td>
<td valign="top" align="center">77.40/2.52</td>
<td valign="top" align="center">70.21/3.31</td>
<td valign="top" align="center">74.58/1.85</td>
<td valign="top" align="center">66.67/3.95</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">20 s</td>
<td valign="top" align="center">78.57/4.84</td>
<td valign="top" align="center">76.25/5.49</td>
<td valign="top" align="center">87.50/2.86</td>
<td valign="top" align="center">76.88/4.98</td>
<td valign="top" align="center">85.00/2.79</td>
<td valign="top" align="center">76.25/6.18</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">30 s</td>
<td valign="top" align="center">88.75/3.04</td>
<td valign="top" align="center">80.00/6.17</td>
<td valign="top" align="center">86.86/2.76</td>
<td valign="top" align="center">77.50/4.72</td>
<td valign="top" align="center">88.00/3.16</td>
<td valign="top" align="center">80.00/5.30</td>
</tr>
<tr>
<td valign="top" align="left"><italic>D</italic><sub><italic>segmented</italic></sub></td>
<td valign="top" align="center">2 s</td>
<td valign="top" align="center">71.08/0.94</td>
<td valign="top" align="center">64.50/1.87</td>
<td valign="top" align="center">71.92/1.38</td>
<td valign="top" align="center">59.17/0.99</td>
<td valign="top" align="center">63.89/0.86</td>
<td valign="top" align="center">56.83/1.81</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">5 s</td>
<td valign="top" align="center">87.50/0.93</td>
<td valign="top" align="center">73.04/2.20</td>
<td valign="top" align="center">87.29/2.41</td>
<td valign="top" align="center">72.29/1.36</td>
<td valign="top" align="center">77.69/1.39</td>
<td valign="top" align="center">77.91/2.69</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">10 s</td>
<td valign="top" align="center">94.17/1.09</td>
<td valign="top" align="center">76.67/3.12</td>
<td valign="top" align="center">92.08/1.77</td>
<td valign="top" align="center">75.88/1.77</td>
<td valign="top" align="center">80.04/1.85</td>
<td valign="top" align="center">79.17/2.69</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">20 s</td>
<td valign="top" align="center">97.50/1.56</td>
<td valign="top" align="center">87.50/3.90</td>
<td valign="top" align="center">95.00/2.06</td>
<td valign="top" align="center">83.75/2.28</td>
<td valign="top" align="center">93.75/2.66</td>
<td valign="top" align="center">86.25/3.23</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">30 s</td>
<td valign="top" align="center">98.75/1.14</td>
<td valign="top" align="center">90.00/3.06</td>
<td valign="top" align="center">93.75/2.21</td>
<td valign="top" align="center">88.13/3.21</td>
<td valign="top" align="center">91.88/1.71</td>
<td valign="top" align="center">91.25/2.94</td>
</tr>
</tbody>
</table></table-wrap>
<p>The ITRs were also statistically analyzed to assess the sensitivity of the AAD system using the non-parametric Kruskal&#x2013;Wallis test. <xref ref-type="fig" rid="F4">Figure 4C</xref> displays the effect of attention switching, different decoding models (<italic>D</italic><sub><italic>unified</italic></sub> and <italic>D</italic><sub><italic>segmented</italic></sub>), and different MFR levels on the ITRs. The <italic>D</italic><sub><italic>segmented</italic></sub> model yielded higher ITRs than the <italic>D</italic><sub><italic>unified</italic></sub> model before and after the switching of the auditory attention with all MFR levels (<italic>P</italic> &#x003C; 0.05), suggesting the significant improvement of AAD accuracy based on the speech-RMS-level&#x2013;based decoding model. Significantly higher ITRs were displayed with the 6- and 0-dB MFR conditions than the &#x2013;6-dB MFR level in the first half (i.e., before the attention switching). <italic>Post hoc</italic> analysis showed that the significant differences occurred with the short decision window lengths (i.e., 2, 5, and 10 s; all <italic>P</italic> &#x003C; 0.01). In the latter half (i.e., after the switching of attention), a significantly higher ITR was shown in the 6-dB MFR than the 0- and &#x2013;6-dB conditions with 2-s length of the decoding decision window (&#x03C7;<sup>2</sup> = 7.02, <italic>P</italic> = 0.03). There were no significant differences in ITRs across the five decision window lengths under all MFR conditions using <italic>D</italic><sub><italic>unified</italic></sub> (all <italic>P</italic> &#x003E; 0.05). For the effect of attention switching, there were significant decreases of ITRs with the 6- and 0-dB MFRs after the switching of the auditory attention between two competing speakers using <italic>D</italic><sub><italic>segmented</italic></sub> (all <italic>P</italic> &#x003C; 0.05). In the &#x2013;6-dB MFR condition, the attention switching had no significant effect on ITR decoded by <italic>D</italic><sub><italic>segmented</italic></sub> (&#x03C7;<sup>2</sup> = 1.33, <italic>P</italic> = 0.25). No significant effects of attention switching were shown with the <italic>D</italic><sub><italic>unified</italic></sub> model in all three MFR conditions (all <italic>P</italic> &#x003E; 0.05).</p>
</sec>
</sec>
</sec>
<sec id="S4" sec-type="discussion">
<title>Discussion</title>
<p>The present study aimed to develop objective biomarkers on the basis of the neural-speech tracking ability to estimate the dynamic auditory attention states under the competing speaker auditory scenes. The present study also explored the effects of the RMS-level&#x2013;based speech segmentation and SMR level on the AAD performance with the dynamic change of attention states. This work provided several important and novel findings for better understanding the neural mechanisms of the target speech perception in the complex auditory scenes. First, the switching of the auditory attention from one speaker stream to the other can be detected from the corresponding EEG responses with short time lags (i.e., the first TRF-positive deflection approximately 100 ms). Second, the cortical tracking ability to the target speech was different between higher- and lower-RMS-level&#x2013;based speech segmentations. On the basis of these different neural responses, the RMS-level&#x2013;based segmented model improved the accuracy and sensitivity of the neuro-steered AAD system. Third, the SMR level and attentional states (before or after the attentional shifting) jointly affected the attention decoding performance in the competing speaker auditory scenes. The robust AAD accuracy was shown with a wide range of SMR levels, and the AAD accuracy was also sensitive to the switching of the auditory attention.</p>
<sec id="S4.SS1">
<title>Effect of Root Mean Square-Level&#x2013;Based Segmentation on Decoding Auditory Attention States</title>
<p>In line with previous findings (e.g., <xref ref-type="bibr" rid="B57">Wang et al., 2019</xref>, <xref ref-type="bibr" rid="B58">2020a</xref>), this study also showed significantly different neural responses to higher- and lower-RMS-level speech segments when subjects concentrated their attention on one of the speaker streams in the competing speaker conditions. Significantly higher TRF weights were shown in lower-RMS-level speech segments than those in higher-RMS-level speech segments, indicating high correlations between neural responses and speech envelopes in lower-RMS-level segments. These results implied that the total energy of neural response evoked by lower-RMS-level speech segments was stronger than that by higher-RMS-level speech segments. Not only the relative RMS level but also the speech features carried in higher- and lower-RMS-level speech segments could be contributing factors to the target speech perception in noisy environments. More specifically, higher-RMS-level speech segments contained most voicing parts of the whole utterance, whereas lower-RMS-level speech segments carried most changeable components such as the abrupt increases and decreases sections of the whole utterance (e.g., <xref ref-type="bibr" rid="B6">Chen and Loizou, 2011</xref>, <xref ref-type="bibr" rid="B7">2012</xref>; <xref ref-type="bibr" rid="B8">Chen and Wong, 2013</xref>). The large TRF responses with lower-RMS-level speech segments were consistent with the previous findings that the cortical responses were sensitive to the abrupt changes within the auditory stimulus (<xref ref-type="bibr" rid="B5">Chait et al., 2005</xref>; <xref ref-type="bibr" rid="B50">Somervail et al., 2021</xref>).</p>
<p>Furthermore, this study found that the switching of the auditory attention had different effects on the cortical responses to higher- and lower-RMS-level speech segments. After the switching of attention from the male to the female speaker stream, the significant decrease of the first positive components in the TRF responses (approximately 100-ms time lag) was illustrated for both higher- and lower-RMS-level speech segments. These results were consistent with previous findings in ERP studies that the early component (e.g., P100) was related to the attention-dependent modulation (<xref ref-type="bibr" rid="B49">Shuai and Elhilali, 2014</xref>). Although lower-RMS-level speech segments showed stronger TRF weights than higher-RMS-level speech segments for all three typical components, attention switching showed no significant modulations of the cortical responses to lower-RMS-level speech segment in the first negative and second positive TRF components. Besides, the TRF weights with lower-RMS-level speech segments were sensitively changed with the SMR levels. These results suggested that the lower-RMS-level segments were easily affected by the environmental factors (e.g., the intensity of the competing speech stream) (<xref ref-type="bibr" rid="B3">Billings et al., 2009</xref>). Compared to cortical response to lower-RMS-level speech segments, the TRF responses with higher-RMS-level segments were robust to the SMR level changes and sensitive to the modulation of the auditory attention. Higher-RMS-level speech segments that included more complex speech cues (e.g., semantic information and language structures) could be primarily influenced by the modulation of endogenous factors (e.g., selective auditory attention) rather than exogenous variables (e.g., SMR levels) (<xref ref-type="bibr" rid="B27">Getzmann et al., 2017</xref>). Briefly, this study demonstrated that, under the dynamic auditory attention states, the auditory system recruited different neural response patterns to track higher- and lower-RMS-level speech segments under different SMR conditions.</p>
<p>The effects of RMS-level&#x2013;based segmentation on the AAD performance were further explored on the basis of the different neural responses to higher- and lower-RMS-level speech segments with the dynamic changes of attentional states. According to our previous investigation, the speech-RMS-level&#x2013;based segmented AAD model could improve AAD sensitivity and accuracy when subjects were concentrated on a specific speech stream during the whole experiment (<xref ref-type="bibr" rid="B56">Wang, 2021</xref>). This study further demonstrated that the segmented AAD model not only improved the AAD accuracy under the conditions modulated by the sustained attention, but also improved the AAD accuracy when attention was transferred from one speech stream to the other in a competing speaker environment (see <xref ref-type="fig" rid="F4">Figure 4B</xref>). The better performance of the segmented AAD model could be attributed to the accurate detection of temporal gaps, because the temporal gaps in continuous sentences can facilitate the target speech perception in noisy environments (e.g., <xref ref-type="bibr" rid="B40">Li and Loizou, 2007</xref>; <xref ref-type="bibr" rid="B55">Vestergaard et al., 2011</xref>). Many neurological studies also suggested that the regular structure of temporal gaps within the continuous sentences entrained the low-frequency neural oscillations to track the target speech streams with the selective attention modulations (<xref ref-type="bibr" rid="B31">Hickok and Poeppel, 2007</xref>; <xref ref-type="bibr" rid="B62">Zoefel, 2018</xref>). Correspondingly, lower-RMS-level speech segments contained the temporal gaps (i.e., the silent regions) and weak consonants (e.g., fricatives, stops, and nasals) of a sentence, whereas higher-RMS-level speech segments carried most sonorous parts within an utterance (<xref ref-type="bibr" rid="B6">Chen and Loizou, 2011</xref>; <xref ref-type="bibr" rid="B8">Chen and Wong, 2013</xref>). Hence, the prior knowledge of speech-RMS-level segmentation provided much detailed temporal information of speech, so that the <italic>D</italic><sub><italic>segmented</italic></sub> method could decode the target speech streams more accurately from neural activities. The AAD accuracy calculated by the <italic>D</italic><sub><italic>segmented</italic></sub> method was not only affected by the reconstructed performance of target speech envelopes but also associated with the classification performance of higher- and lower-RMS-levels segments under different experimental conditions. As displayed in <xref ref-type="fig" rid="F3">Figure 3</xref>, the classification accuracy of higher- and lower-RMS-level speech segments was decreased with the attention switching from the male to the female speaker stream. When the auditory attention was switched between competing speakers, neural resources related to the target auditory object needed to be redistributed through the modulation of selective auditory attention (e.g., <xref ref-type="bibr" rid="B23">Fritz et al., 2007</xref>; <xref ref-type="bibr" rid="B48">Shamma and Micheyl, 2010</xref>). Because the auditory system was required to release the resources related to the prior focused streams and active the resources belonging to the switched auditory objects, a weak gain of the attention modulation could occur and lead to the poor neural tracking ability after the switching of attention (e.g., <xref ref-type="bibr" rid="B27">Getzmann et al., 2017</xref>; <xref ref-type="bibr" rid="B41">Miran et al., 2018</xref>). Hence, the AAD accuracy was reduced after the auditory attention switching from the male to the female speaker stream. This study indicated that the speech-level&#x2013;based segmented decoding model not only had better AAD performance with the sustained auditory attention but also improved the AAD performance after the switching of the auditory attention in the complex auditory scenes. These results provided evidence that the segmented AAD model had the potential to decode auditory attention in real-life applications with the dynamic change of attention states.</p>
</sec>
<sec id="S4.SS2">
<title>Interactions Between Attention Switching and Signal-to-Masker Ratio Levels on the Auditory Attention Decoding System</title>
<p>In a competing speaker environment, the SMR level is an important factor affecting the target speech perception, and the target speech intelligibility is reduced with the decrease of SMR levels (<xref ref-type="bibr" rid="B4">Brungart, 2001</xref>; <xref ref-type="bibr" rid="B3">Billings et al., 2009</xref>). Nevertheless, the cortical responses showed the robust phase locking of the target speech envelopes with a large range of SMR levels (e.g., <xref ref-type="bibr" rid="B19">Ding and Simon, 2012b</xref>; <xref ref-type="bibr" rid="B44">O&#x2019;Sullivan et al., 2015</xref>). These reliably cortical responses to the target speech envelope were associated with the attentional gain control and the long-term integration of the slow temporal modulations in the human auditory cortex (<xref ref-type="bibr" rid="B37">Lalor et al., 2009</xref>; <xref ref-type="bibr" rid="B35">Kerlin et al., 2010</xref>). In line with previous studies (e.g., <xref ref-type="bibr" rid="B19">Ding and Simon, 2012b</xref>; <xref ref-type="bibr" rid="B17">Di Liberto et al., 2015</xref>; <xref ref-type="bibr" rid="B44">O&#x2019;Sullivan et al., 2015</xref>), this study also suggested that the neural responses were reliably synchronized to slow temporal fluctuations of the target speech with the sustained attention under different SMR conditions (i.e., from 6 to &#x2013;6 dB). However, it still remained unclear about the effect of attention switching on the AAD performance under diverse SMR conditions. Studies have illustrated the effect of attention switching between the co-located competing speakers with the equal RMS levels of sound amplitude, suggesting that the TRF response carried effective biomarkers to estimate the auditory attention states (e.g., <xref ref-type="bibr" rid="B2">Akram et al., 2016</xref>; <xref ref-type="bibr" rid="B41">Miran et al., 2018</xref>, <xref ref-type="bibr" rid="B42">2020</xref>). On the basis of these findings, the current study further explored the joint effect of the attention switching and SMR levels on the AAD performance without the spatial difference between speakers. To evaluate the AAD ability with attention switching from moderate to severe SMR conditions, the relative power ratios between male and female speaker streams were fixed in this study, and thus, the SMR level could change with the attention switching from the male to the female speaker stream. Results demonstrated that the cortical responses can be used to decode the switching of the auditory attention with the increased SMRs (from &#x2013;6 to 6 dB SMR in the &#x2013;6-dB MFR condition), the unchanged SMRs (in the 0-dB MFR condition) within the continuous speech streams, and the decreased SMRs (from 6 to &#x2013;6 dB SMR in the 6-dB MFR condition). The marginal decrease of AAD accuracy was displayed after the switching of the auditory attention in all three MFR conditions (see <xref ref-type="table" rid="T1">Table 1</xref> and <xref ref-type="fig" rid="F4">Figure 4B</xref>). It may be associated with the cost of attention switching. Compared to the condition with decreased and unchanged SMRs after attention switching, the increased SMR could alleviate the decrease of AAD accuracy with the switching of attention between two speakers. The AAD accuracy after the switching of the auditory attention also showed the larger individual differences than that before the auditory attention switching. These individual differences implied that the AAD performance with the dynamic changes of auditory states may be related to some endogenous factors such as the attentional control gains and the predicting ability of important cues in the target speech (<xref ref-type="bibr" rid="B35">Kerlin et al., 2010</xref>; <xref ref-type="bibr" rid="B27">Getzmann et al., 2017</xref>), which warrants further investigation in the future.</p>
</sec>
<sec id="S4.SS3">
<title>Objective Neural Markers of Auditory Attention States</title>
<p>Neuroimaging studies using magneto-encephalography have illustrated that the magnitude of the TRF component approximately 100-ms lag was a reliable attention marker, because the TRF responses at 100-ms lag of the target speaker were larger than those of the ignored speaker (<xref ref-type="bibr" rid="B18">Ding and Simon, 2012a</xref>; <xref ref-type="bibr" rid="B2">Akram et al., 2016</xref>; <xref ref-type="bibr" rid="B42">Miran et al., 2020</xref>). In this study, the TRF responses obtained from EEG signals also showed a reliable marker modulated by the switched auditory attention with latency approximately 100-ms lag. Specifically, compared to the other typical TRF components, the TRF weight at the first positive component showed reliable effects of attention switching for both higher- and lower-RMS-level speech segments with a large range of SMR levels (i.e., from &#x2013;6 to 6 dB) in this study. The observed changes of the TRF component approximately 100-ms lag with the attention switching were in agreement with previous findings in ERP studies that the peak of the P1 component was modulated by purely top-down attention and marked the initiation of a new auditory stream of the ongoing stream (<xref ref-type="bibr" rid="B60">Winkler et al., 2009</xref>; <xref ref-type="bibr" rid="B49">Shuai and Elhilali, 2014</xref>). These results suggested that the encoder model not only reflected the precision of neural tracking ability to the target speech but also provided the objective biomarker to index the dynamic attention states (e.g., before and after the switching of attention). In addition, the present study revealed the decrease of AAD accuracy after the auditory attention switching (see <xref ref-type="table" rid="T1">Table 1</xref>), suggesting the fluctuation of AAD accuracy may also be an indicator to estimate the switching of the auditory attention in a competing speaker environment. The <italic>D</italic><sub><italic>segmented</italic></sub> method showed higher ITRs than the <italic>D</italic><sub><italic>unified</italic></sub> method in the neural-based AAD system, especially with the short decoding window length (i.e., 2, 5, and 10 s) in various experimental conditions. The better performance of the segmented model with short decision window lengths suggested that the AAD accuracy derived from the <italic>D</italic><sub><italic>segmented</italic></sub> decoder could also be an effective indicator to evaluate the dynamic change of the auditory attention states.</p>
</sec>
<sec id="S4.SS4">
<title>Limitations of This Work</title>
<p>This study mainly explored the joint effects of the auditory attention states, SMRs, and higher/lower-RMS-level&#x2013;based segments on cortical responses to the target speech streams, and the AAD performance decoded by the speech-level&#x2013;based segmented computational model was investigated under different experimental conditions. Hence, other crucial characteristics of the competing speakers were fixed in this experiment. Specifically, this study only examined the switching of the auditory attention from the male speaker to the female speaker under different MFR conditions. Nevertheless, cortical responses are influenced by a number of voice characteristics (e.g., fundamental frequency differences between the competing speakers) in the complex auditory scenes (e.g., <xref ref-type="bibr" rid="B54">van Canneyt et al., 2021</xref>). Further research should systemically understand the effects of other features (e.g., speaker gender, number of speakers, and target-to-masker ratios) on the cortical tracking ability of the target speech streams in the complex auditory scenarios with the dynamic changes of the auditory attention.</p>
</sec>
</sec>
<sec id="S5" sec-type="conclusion">
<title>Conclusion</title>
<p>This study investigated the effects of different RMS-level&#x2013;based speech segments and SMR levels on the cortical tracking ability to the target speech with sustained and switched auditory attention. The present study also explored effective objective indicators for reflecting dynamic attention states from EEG recordings under the competing speaker environments. The novel findings in this study included the following: (a) the TRF response at 100-time lag could sensitively index the switching of the auditory attention from one speaker stream to the other; (b) higher- and lower-RMS-level speech segments made different and crucial contributions to the cortical tracking of the target speech with both the sustained and switched auditory attention. On the basis of the specific neural patterns to different RMS-level segmentation, the segmented AAD model, which provided more exact temporal structures of the target speech, improved the AAD performance of dynamic attention states; (c) the segmented AAD model could be used to robustly decode the dynamic changed target speech streams according to their intentions under different SMR conditions, even when using a short decoding window length.</p>
<p>In conclusion, TRF responses and AAD accuracies could be considered as objective indicators for estimating the auditory attention states even in poor SMR conditions and with short decision window lengths. The RMS-level&#x2013;based segmented AAD model also showed the sensitive and reliable decoding performance with the attentional switching. Results exhibited in this work provided neural evidence for understanding the contributions of different speech features on cortical response to the target speech with the dynamic modulation of the auditory attention. These results also provided potential guidance for the design of AAD algorithms in the neurofeedback control systems under complex auditory scenarios.</p>
</sec>
<sec id="S6" sec-type="data-availability">
<title>Data Availability Statement</title>
<p>The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.</p>
</sec>
<sec id="S7">
<title>Ethics Statement</title>
<p>The studies involving human participants were reviewed and approved by the Institution&#x2019;s Ethical Review Board of Southern University of Science and Technology approved the experimental procedures. The patients/participants provided their written informed consent to participate in this study.</p>
</sec>
<sec id="S8">
<title>Author Contributions</title>
<p>LW contributed to the design and implementation of the experiments, the analysis and interpretation of data, and the writing of the manuscript. YW and ZL performed data acquisition. EW contributed to the revision of the manuscript and final approval of the submitted version. FC contributed to the design of experiments, the interpretation of data, the revision of the manuscript, and the final approval of the submitted version. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec id="conf1" sec-type="COI-statement">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec id="pudiscl1" sec-type="disclaimer">
<title>Publisher&#x2019;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<sec id="S9" sec-type="funding-information">
<title>Funding</title>
<p>This work was supported by the National Natural Science Foundation of China (Grant No. 61971212), Shenzhen Sustainable Support Program for High-level University (Grant No. 20200925154002001), and High-level University Fund G02236002 of Southern University of Science and Technology.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ahveninen</surname> <given-names>J.</given-names></name> <name><surname>Huang</surname> <given-names>S.</given-names></name> <name><surname>Belliveau</surname> <given-names>J. W.</given-names></name> <name><surname>Chang</surname> <given-names>W. T.</given-names></name> <name><surname>H&#x00E4;m&#x00E4;l&#x00E4;inen</surname> <given-names>M.</given-names></name></person-group> (<year>2013</year>). <article-title>Dynamic oscillatory processes governing cued orienting and allocation of auditory attention.</article-title> <source><italic>J. Cogn. Neurosci.</italic></source> <volume>25</volume> <fpage>1926</fpage>&#x2013;<lpage>1943</lpage>. <pub-id pub-id-type="doi">10.1162/jocn_a_00452</pub-id> <pub-id pub-id-type="pmid">23915050</pub-id></citation></ref>
<ref id="B2"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Akram</surname> <given-names>S.</given-names></name> <name><surname>Simon</surname> <given-names>J. Z.</given-names></name> <name><surname>Babadi</surname> <given-names>B.</given-names></name></person-group> (<year>2016</year>). <article-title>Dynamic estimation of the auditory temporal response function from MEG in competing-speaker environments.</article-title> <source><italic>IEEE Trans. Biomed. Eng.</italic></source> <volume>64</volume> <fpage>1896</fpage>&#x2013;<lpage>1905</lpage>. <pub-id pub-id-type="doi">10.1109/TBME.2016.2628884</pub-id> <pub-id pub-id-type="pmid">28113290</pub-id></citation></ref>
<ref id="B3"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Billings</surname> <given-names>C. J.</given-names></name> <name><surname>Tremblay</surname> <given-names>K. L.</given-names></name> <name><surname>Stecker</surname> <given-names>G. C.</given-names></name> <name><surname>Tolin</surname> <given-names>W. M.</given-names></name></person-group> (<year>2009</year>). <article-title>Human evoked cortical activity to signal-to- ise ratio and absolute signal level.</article-title> <source><italic>Hear. Res.</italic></source> <volume>254</volume> <fpage>15</fpage>&#x2013;<lpage>24</lpage>. <pub-id pub-id-type="doi">10.1016/j.heares.2009.04.002</pub-id> <pub-id pub-id-type="pmid">19364526</pub-id></citation></ref>
<ref id="B4"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Brungart</surname> <given-names>D. S.</given-names></name></person-group> (<year>2001</year>). <article-title>Informational and energetic masking effects in the perception of two simultaneous talkers.</article-title> <source><italic>J. Acoust. Soc. Am.</italic></source> <volume>109</volume> <fpage>1101</fpage>&#x2013;<lpage>1109</lpage>. <pub-id pub-id-type="doi">10.1121/1.1345696</pub-id> <pub-id pub-id-type="pmid">11303924</pub-id></citation></ref>
<ref id="B5"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chait</surname> <given-names>M.</given-names></name> <name><surname>Poeppel</surname> <given-names>D.</given-names></name> <name><surname>de Cheveign&#x00E9;</surname> <given-names>A.</given-names></name> <name><surname>Simon</surname> <given-names>J. Z.</given-names></name></person-group> (<year>2005</year>). <article-title>Human auditory cortical processing of changes in interaural correlation.</article-title> <source><italic>J. Neurosci.</italic></source> <volume>25</volume> <fpage>8518</fpage>&#x2013;<lpage>8527</lpage>.</citation></ref>
<ref id="B6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>F.</given-names></name> <name><surname>Loizou</surname> <given-names>P. C.</given-names></name></person-group> (<year>2011</year>). <article-title>Predicting the intelligibility of vocoded and wideband Mandarin Chinese.</article-title> <source><italic>J. Acoust. Soc. Am.</italic></source> <volume>129</volume> <fpage>3281</fpage>&#x2013;<lpage>3290</lpage>. <pub-id pub-id-type="doi">10.1121/1.3570957</pub-id> <pub-id pub-id-type="pmid">21568429</pub-id></citation></ref>
<ref id="B7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>F.</given-names></name> <name><surname>Loizou</surname> <given-names>P. C.</given-names></name></person-group> (<year>2012</year>). <article-title>Contributions of cochlea-scaled entropy and consonant-vowel boundaries to prediction of speech intelligibility in ise.</article-title> <source><italic>J. Acoust. Soc. Am.</italic></source> <volume>131</volume> <fpage>4104</fpage>&#x2013;<lpage>4113</lpage>. <pub-id pub-id-type="doi">10.1121/1.3695401</pub-id> <pub-id pub-id-type="pmid">22559382</pub-id></citation></ref>
<ref id="B8"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>F.</given-names></name> <name><surname>Wong</surname> <given-names>L. L.</given-names></name></person-group> (<year>2013</year>). &#x201C;<article-title>Contributions of the high-RMS-level segments to the intelligibility of mandarin sentences</article-title>,&#x201D; in <source><italic>2013 IEEE International Conference on Acoustics, Speech and Signal Processing</italic></source>, <publisher-loc>Piscataway</publisher-loc>. <fpage>7810</fpage>&#x2013;<lpage>7814</lpage>.</citation></ref>
<ref id="B9"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cherry</surname> <given-names>E. C.</given-names></name></person-group> (<year>1953</year>). <article-title>Some experiments on the recognition of speech, with one and with two ears.</article-title> <source><italic>J. Acoust. Soc. Am.</italic></source> <volume>25</volume> <fpage>975</fpage>&#x2013;<lpage>979</lpage>. <pub-id pub-id-type="doi">10.1121/1.1907229</pub-id></citation></ref>
<ref id="B10"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Choi</surname> <given-names>I.</given-names></name> <name><surname>Rajaram</surname> <given-names>S.</given-names></name> <name><surname>Varghese</surname> <given-names>L. A.</given-names></name> <name><surname>Shinn-Cunningham</surname> <given-names>B. G.</given-names></name></person-group> (<year>2013</year>). <article-title>Quantifying attentional modulation of auditory-evoked cortical responses from single-trial electroencephalography.</article-title> <source><italic>Front. Human Neurosci.</italic></source> <volume>7</volume>:<issue>115</issue>. <pub-id pub-id-type="doi">10.3389/fnhum.2013.00115</pub-id> <pub-id pub-id-type="pmid">23576968</pub-id></citation></ref>
<ref id="B11"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ciccarelli</surname> <given-names>G.</given-names></name> <name><surname>Lan</surname> <given-names>M.</given-names></name> <name><surname>Perricone</surname> <given-names>J.</given-names></name> <name><surname>Calamia</surname> <given-names>P. T.</given-names></name> <name><surname>Haro</surname> <given-names>S.</given-names></name> <name><surname>O&#x2019;Sullivan</surname> <given-names>J.</given-names></name><etal/></person-group> (<year>2019</year>). <article-title>Comparison of two talker attention decoding from EEG with nlinear neural networks and linear methods,&#x201D;</article-title> <source><italic>Sci. Rep.</italic></source> <volume>9</volume> <fpage>1</fpage>&#x2013;<lpage>10</lpage>. <pub-id pub-id-type="doi">10.1038/s41598-019-47795-0</pub-id> <pub-id pub-id-type="pmid">31395905</pub-id></citation></ref>
<ref id="B12"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cooke</surname> <given-names>M.</given-names></name></person-group> (<year>2006</year>). <article-title>A glimpsing model of speech perception in ise.</article-title> <source><italic>J. Acoust. Soc. Am.</italic></source> <volume>119</volume> <fpage>1562</fpage>&#x2013;<lpage>1573</lpage>. <pub-id pub-id-type="doi">10.1121/1.2166600</pub-id> <pub-id pub-id-type="pmid">16583901</pub-id></citation></ref>
<ref id="B13"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Crosse</surname> <given-names>M. J.</given-names></name> <name><surname>Di Liberto</surname> <given-names>G. M.</given-names></name> <name><surname>Bednar</surname> <given-names>A.</given-names></name> <name><surname>Lalor</surname> <given-names>E. C.</given-names></name></person-group> (<year>2016</year>). <article-title>The multivariate temporal response function (mTRF) toolbox: a MATLAB toolbox for relating neural signals to continuous stimuli.</article-title> <source><italic>Front. Human Neurosci.</italic></source> <volume>10</volume>:<issue>604</issue>. <pub-id pub-id-type="doi">10.3389/fnhum.2016.00604</pub-id> <pub-id pub-id-type="pmid">27965557</pub-id></citation></ref>
<ref id="B14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Das</surname> <given-names>N.</given-names></name> <name><surname>Zegers</surname> <given-names>J.</given-names></name> <name><surname>Francart</surname> <given-names>T.</given-names></name> <name><surname>Bertrand</surname> <given-names>A.</given-names></name></person-group> (<year>2020</year>). <article-title>Linear versusdeep learning methods for isy speech separation for EEG informed attention decoding.</article-title> <source><italic>J. Neural. Eng.</italic></source> <volume>17</volume>:<issue>46039</issue>. <pub-id pub-id-type="doi">10.1088/1741-2552/aba6f8</pub-id> <pub-id pub-id-type="pmid">32679578</pub-id></citation></ref>
<ref id="B15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Delorme</surname> <given-names>A.</given-names></name> <name><surname>Makeig</surname> <given-names>S.</given-names></name></person-group> (<year>2004</year>). <article-title>EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis.</article-title> <source><italic>J. Neurosci. Methods</italic></source> <volume>134</volume> <fpage>9</fpage>&#x2013;<lpage>21</lpage>. <pub-id pub-id-type="doi">10.1016/j.jneumeth.2003.10.009</pub-id> <pub-id pub-id-type="pmid">15102499</pub-id></citation></ref>
<ref id="B16"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Deng</surname> <given-names>Y.</given-names></name> <name><surname>Reinhart</surname> <given-names>R. M.</given-names></name> <name><surname>Choi</surname> <given-names>I.</given-names></name> <name><surname>Shinn-Cunningham</surname> <given-names>B. G.</given-names></name></person-group> (<year>2019</year>). <article-title>Causal links between parietal alpha activity and spatial auditory attention.</article-title> <source><italic>Elife</italic></source> <volume>8</volume>:<issue>e51184</issue>. <pub-id pub-id-type="doi">10.7554/eLife.51184</pub-id> <pub-id pub-id-type="pmid">31782732</pub-id></citation></ref>
<ref id="B17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Di Liberto</surname> <given-names>G. M.</given-names></name> <name><surname>O&#x2019;Sullivan</surname> <given-names>J. A.</given-names></name> <name><surname>Lalor</surname> <given-names>E. C.</given-names></name></person-group> (<year>2015</year>). <article-title>Low-frequency cortical entrainment to speech reflects phoneme-level processing.</article-title> <source><italic>Curr. Biol.</italic></source> <volume>25</volume> <fpage>2457</fpage>&#x2013;<lpage>2465</lpage>. <pub-id pub-id-type="doi">10.1016/j.cub.2015.08.030</pub-id> <pub-id pub-id-type="pmid">26412129</pub-id></citation></ref>
<ref id="B18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ding</surname> <given-names>N.</given-names></name> <name><surname>Simon</surname> <given-names>J. Z.</given-names></name></person-group> (<year>2012a</year>). <article-title>Emergence of neural encoding of auditory objects while listening to competing speakers.</article-title> <source><italic>Proc. Nat. Acad. Sci.</italic></source> <volume>109</volume> <fpage>11854</fpage>&#x2013;<lpage>11859</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.1205381109</pub-id> <pub-id pub-id-type="pmid">22753470</pub-id></citation></ref>
<ref id="B19"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ding</surname> <given-names>N.</given-names></name> <name><surname>Simon</surname> <given-names>J. Z.</given-names></name></person-group> (<year>2012b</year>). <article-title>Neural coding of continuous speech in auditory cortex during monaural and dichotic listening.</article-title> <source><italic>J. Neurophysiol.</italic></source> <volume>107</volume> <fpage>78</fpage>&#x2013;<lpage>89</lpage>. <pub-id pub-id-type="doi">10.1152/jn.00297.2011</pub-id> <pub-id pub-id-type="pmid">21975452</pub-id></citation></ref>
<ref id="B20"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Donchin</surname> <given-names>E.</given-names></name> <name><surname>Spencer</surname> <given-names>K. M.</given-names></name> <name><surname>Wijesinghe</surname> <given-names>R.</given-names></name></person-group> (<year>2000</year>). <article-title>The mental prosthesis: assessing the speed of a P300-based brain-computer interface.</article-title> <source><italic>IEEE Trans. Rehabil. Eng.</italic></source> <volume>8</volume> <fpage>174</fpage>&#x2013;<lpage>179</lpage>. <pub-id pub-id-type="doi">10.1109/86.847808</pub-id> <pub-id pub-id-type="pmid">10896179</pub-id></citation></ref>
<ref id="B21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fogerty</surname> <given-names>D.</given-names></name> <name><surname>Kewley-Port</surname> <given-names>D.</given-names></name></person-group> (<year>2009</year>). <article-title>Perceptual contributions of the consonant-vowel boundary to sentence intelligibility.</article-title> <source><italic>J. Acoust. Soc. Am.</italic></source> <volume>126</volume> <fpage>847</fpage>&#x2013;<lpage>857</lpage>. <pub-id pub-id-type="doi">10.1121/1.3159302</pub-id> <pub-id pub-id-type="pmid">19640049</pub-id></citation></ref>
<ref id="B22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fritz</surname> <given-names>J. B.</given-names></name> <name><surname>David</surname> <given-names>S.</given-names></name> <name><surname>Shamma</surname> <given-names>S.</given-names></name></person-group> (<year>2013</year>). &#x201C;<article-title>Attention and dynamic, task-related receptive field plasticity in adult auditory cortex</article-title>,&#x201D; in <source><italic>Neural correlates of Auditory Cognition</italic></source>, (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>251</fpage>&#x2013;<lpage>291</lpage>.</citation></ref>
<ref id="B23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fritz</surname> <given-names>J. B.</given-names></name> <name><surname>Elhilali</surname> <given-names>M.</given-names></name> <name><surname>David</surname> <given-names>S. V.</given-names></name> <name><surname>Shamma</surname> <given-names>S. A.</given-names></name></person-group> (<year>2007</year>). <article-title>Auditory attention&#x2014;focusing the searchlight on sound.</article-title> <source><italic>Curr. Opin. Neurobiol.</italic></source> <volume>17</volume> <fpage>437</fpage>&#x2013;<lpage>455</lpage>. <pub-id pub-id-type="doi">10.1016/j.conb.2007.07.011</pub-id> <pub-id pub-id-type="pmid">17714933</pub-id></citation></ref>
<ref id="B24"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Geirnaert</surname> <given-names>S.</given-names></name> <name><surname>Francart</surname> <given-names>T.</given-names></name> <name><surname>Bertrand</surname> <given-names>A.</given-names></name></person-group> (<year>2020</year>). <article-title>Fast EEG-based decoding of the directional focus of auditory attention using common spatial patterns.</article-title> <source><italic>IEEE Trans. Biomed. Eng</italic>.</source> <volume>68</volume> <fpage>1557</fpage>&#x2013;<lpage>1568</lpage>. <pub-id pub-id-type="doi">10.1109/TBME.2020.3033446</pub-id> <pub-id pub-id-type="pmid">33095706</pub-id></citation></ref>
<ref id="B25"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Geravanchizadeh</surname> <given-names>M.</given-names></name> <name><surname>Gavgani</surname> <given-names>S. B.</given-names></name></person-group> (<year>2020</year>). <article-title>Selective auditory attention detection based on effective connectivity by single-trial EEG.</article-title> <source><italic>J. Neural Eng</italic>.</source> <volume>17</volume>:<issue>026021</issue>. <pub-id pub-id-type="doi">10.1088/1741-2552/ab7c8d</pub-id> <pub-id pub-id-type="pmid">32131059</pub-id></citation></ref>
<ref id="B26"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Geravanchizadeh</surname> <given-names>M.</given-names></name> <name><surname>Roushan</surname> <given-names>H.</given-names></name></person-group> (<year>2021</year>). <article-title>Dynamic selective auditory attention detection using RNN and reinforcement learning.</article-title> <source><italic>Sci. Rep</italic>.</source> <volume>11</volume> <fpage>1</fpage>&#x2013;<lpage>11</lpage>. <pub-id pub-id-type="doi">10.1038/s41598-021-94876-0</pub-id> <pub-id pub-id-type="pmid">34326401</pub-id></citation></ref>
<ref id="B27"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Getzmann</surname> <given-names>S.</given-names></name> <name><surname>Jasny</surname> <given-names>J.</given-names></name> <name><surname>Falkenstein</surname> <given-names>M.</given-names></name></person-group> (<year>2017</year>). <article-title>Switching of auditory attention in &#x201C;cocktail-party&#x201D; listening: ERP evidence of cueing effects in younger and older adults.</article-title> <source><italic>Brain Cogn.</italic></source> <volume>111</volume> <fpage>1</fpage>&#x2013;<lpage>12</lpage>. <pub-id pub-id-type="doi">10.1016/j.bandc.2016.09.006</pub-id> <pub-id pub-id-type="pmid">27814564</pub-id></citation></ref>
<ref id="B28"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Getzmann</surname> <given-names>S.</given-names></name> <name><surname>Klatt</surname> <given-names>L. I.</given-names></name> <name><surname>Schneider</surname> <given-names>D.</given-names></name> <name><surname>Begau</surname> <given-names>A.</given-names></name> <name><surname>Wascher</surname> <given-names>E.</given-names></name></person-group> (<year>2020</year>). <article-title>EEG correlates of spatial shifts of attention in a dynamic multi-talker speech perception scenario in younger and older adults.</article-title> <source><italic>Hear. Res.</italic></source> <volume>398</volume>:<issue>108077</issue>. <pub-id pub-id-type="doi">10.1016/j.heares.2020.108077</pub-id> <pub-id pub-id-type="pmid">32987238</pub-id></citation></ref>
<ref id="B29"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Greenberg</surname> <given-names>S.</given-names></name> <name><surname>Carvey</surname> <given-names>H.</given-names></name> <name><surname>Hitchcock</surname> <given-names>L.</given-names></name> <name><surname>Chang</surname> <given-names>S.</given-names></name></person-group> (<year>2003</year>). <article-title>Temporal properties of spontaneous speech&#x2014;a syllable-centric perspective.</article-title> <source><italic>J. Phonetics</italic></source> <volume>31</volume> <fpage>465</fpage>&#x2013;<lpage>485</lpage>.</citation></ref>
<ref id="B30"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hamilton</surname> <given-names>L. S.</given-names></name> <name><surname>Edwards</surname> <given-names>E.</given-names></name> <name><surname>Chang</surname> <given-names>E. F.</given-names></name></person-group> (<year>2018</year>). <article-title>A spatial map of onset and sustained responses to speech in the human superior temporal gyrus.</article-title> <source><italic>Curr. Biol.</italic></source> <volume>28</volume> <fpage>1860</fpage>&#x2013;<lpage>1871</lpage>. <pub-id pub-id-type="doi">10.1016/j.cub.2018.04.033</pub-id> <pub-id pub-id-type="pmid">29861132</pub-id></citation></ref>
<ref id="B31"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hickok</surname> <given-names>G.</given-names></name> <name><surname>Poeppel</surname> <given-names>D.</given-names></name></person-group> (<year>2007</year>). <article-title>The cortical organization of speech processing.</article-title> <source><italic>Nat. Rev. Neurosci.</italic></source> <volume>8</volume> <fpage>393</fpage>&#x2013;<lpage>402</lpage>. <pub-id pub-id-type="doi">10.1038/nrn2113</pub-id> <pub-id pub-id-type="pmid">17431404</pub-id></citation></ref>
<ref id="B32"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hoffmann</surname> <given-names>U.</given-names></name> <name><surname>Vesin</surname> <given-names>J. M.</given-names></name> <name><surname>Ebrahimi</surname> <given-names>T.</given-names></name> <name><surname>Diserens</surname> <given-names>K.</given-names></name></person-group> (<year>2008</year>). <article-title>An efficient P300-based brain&#x2013;computer interface for disabled subjects.</article-title> <source><italic>J. Neurosci. Methods</italic></source> <volume>167</volume> <fpage>115</fpage>&#x2013;<lpage>125</lpage>. <pub-id pub-id-type="doi">10.1016/j.jneumeth.2007.03.005</pub-id> <pub-id pub-id-type="pmid">17445904</pub-id></citation></ref>
<ref id="B33"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kates</surname> <given-names>J. M.</given-names></name> <name><surname>Arehart</surname> <given-names>K. H.</given-names></name></person-group> (<year>2005</year>). <article-title>Coherence and the speech intelligibility index.</article-title> <source><italic>J. Acoust. Soc. Am.</italic></source> <volume>117</volume> <fpage>2224</fpage>&#x2013;<lpage>2237</lpage>. <pub-id pub-id-type="doi">10.1121/1.1862575</pub-id> <pub-id pub-id-type="pmid">15898663</pub-id></citation></ref>
<ref id="B34"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kaya</surname> <given-names>E. M.</given-names></name> <name><surname>Elhilali</surname> <given-names>M.</given-names></name></person-group> (<year>2014</year>). <article-title>Investigating bottom-up auditory attention.</article-title> <source><italic>Front. Human Neurosci.</italic></source> <volume>8</volume>:<issue>327</issue>. <pub-id pub-id-type="doi">10.3389/fnhum.2014.00327</pub-id> <pub-id pub-id-type="pmid">24904367</pub-id></citation></ref>
<ref id="B35"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kerlin</surname> <given-names>J. R.</given-names></name> <name><surname>Shahin</surname> <given-names>A. J.</given-names></name> <name><surname>Miller</surname> <given-names>L. M.</given-names></name></person-group> (<year>2010</year>). <article-title>Attentional gain control of ongoing cortical speech representations in a &#x201C;cocktail party.</article-title> <source><italic>J. Neurosci.</italic></source> <volume>30</volume> <fpage>620</fpage>&#x2013;<lpage>628</lpage>. <pub-id pub-id-type="doi">10.1523/JNEUROSCI.3631-09.2010</pub-id> <pub-id pub-id-type="pmid">20071526</pub-id></citation></ref>
<ref id="B36"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kong</surname> <given-names>Y. Y.</given-names></name> <name><surname>Mullangi</surname> <given-names>A.</given-names></name> <name><surname>Ding</surname> <given-names>N.</given-names></name></person-group> (<year>2014</year>). <article-title>Differential modulation of auditoryresponses to attended and unattended speech in different listening conditions.</article-title> <source><italic>Hear. Res.</italic></source> <volume>316</volume> <fpage>73</fpage>&#x2013;<lpage>81</lpage>. <pub-id pub-id-type="doi">10.1016/j.heares.2014.07.009</pub-id> <pub-id pub-id-type="pmid">25124153</pub-id></citation></ref>
<ref id="B37"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lalor</surname> <given-names>E. C.</given-names></name> <name><surname>Power</surname> <given-names>A. J.</given-names></name> <name><surname>Reilly</surname> <given-names>R. B.</given-names></name> <name><surname>Foxe</surname> <given-names>J. J.</given-names></name></person-group> (<year>2009</year>). <article-title>Resolving precise temporal processing properties of the auditory system using continuous stimuli.</article-title> <source><italic>J. Neurophysiol.</italic></source> <volume>102</volume> <fpage>349</fpage>&#x2013;<lpage>359</lpage>. <pub-id pub-id-type="doi">10.1152/jn.90896.2008</pub-id> <pub-id pub-id-type="pmid">19439675</pub-id></citation></ref>
<ref id="B38"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Larson</surname> <given-names>E.</given-names></name> <name><surname>Lee</surname> <given-names>A. K.</given-names></name></person-group> (<year>2014</year>). <article-title>Switching auditory attention using spatial and n-spatial features recruits different cortical networks.</article-title> <source><italic>NeuroImage</italic></source> <volume>84</volume> <fpage>681</fpage>&#x2013;<lpage>687</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuroimage.2013.09.061</pub-id> <pub-id pub-id-type="pmid">24096028</pub-id></citation></ref>
<ref id="B39"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>A. K.</given-names></name> <name><surname>Larson</surname> <given-names>E.</given-names></name> <name><surname>Maddox</surname> <given-names>R. K.</given-names></name> <name><surname>Shinn-Cunningham</surname> <given-names>B. G.</given-names></name></person-group> (<year>2014</year>). <article-title>Using neuroimaging to understand the cortical mechanisms of auditory selective attention.</article-title> <source><italic>Hear. Res.</italic></source> <volume>307</volume> <fpage>111</fpage>&#x2013;<lpage>120</lpage>. <pub-id pub-id-type="doi">10.1016/j.heares.2013.06.010</pub-id> <pub-id pub-id-type="pmid">23850664</pub-id></citation></ref>
<ref id="B40"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>N.</given-names></name> <name><surname>Loizou</surname> <given-names>P. C.</given-names></name></person-group> (<year>2007</year>). <article-title>Factors influencing glimpsing of speech in ise.</article-title> <source><italic>J. Acoust. Soc. Am.</italic></source> <volume>122</volume> <fpage>1165</fpage>&#x2013;<lpage>1172</lpage>. <pub-id pub-id-type="doi">10.1121/1.2749454</pub-id> <pub-id pub-id-type="pmid">17672662</pub-id></citation></ref>
<ref id="B41"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Miran</surname> <given-names>S.</given-names></name> <name><surname>Akram</surname> <given-names>S.</given-names></name> <name><surname>Sheikhattar</surname> <given-names>A.</given-names></name> <name><surname>Simon</surname> <given-names>J. Z.</given-names></name> <name><surname>Zhang</surname> <given-names>T.</given-names></name> <name><surname>Babadi</surname> <given-names>B.</given-names></name></person-group> (<year>2018</year>). <article-title>Real-time tracking of selective auditory attention from M/EEG: A bayesian filtering approach.</article-title> <source><italic>Front. Neurosci.</italic></source> <volume>12</volume>:<issue>262</issue>. <pub-id pub-id-type="doi">10.3389/fnins.2018.00262</pub-id> <pub-id pub-id-type="pmid">29765298</pub-id></citation></ref>
<ref id="B42"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Miran</surname> <given-names>S.</given-names></name> <name><surname>Presacco</surname> <given-names>A.</given-names></name> <name><surname>Simon</surname> <given-names>J. Z.</given-names></name> <name><surname>Fu</surname> <given-names>M. C.</given-names></name> <name><surname>Marcus</surname> <given-names>S. I.</given-names></name> <name><surname>Babadi</surname> <given-names>B.</given-names></name></person-group> (<year>2020</year>). <article-title>Dynamic estimation of auditory temporal response functions <italic>via</italic> state-space models with gaussian mixture process ise.</article-title> <source><italic>PLoS Comp. Biol.</italic></source> <volume>16</volume>:<issue>e1008172</issue>. <pub-id pub-id-type="doi">10.1371/journal.pcbi.1008172</pub-id> <pub-id pub-id-type="pmid">32813712</pub-id></citation></ref>
<ref id="B43"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>N&#x00E4;&#x00E4;t&#x00E4;nen</surname> <given-names>R.</given-names></name> <name><surname>Teder</surname> <given-names>W.</given-names></name> <name><surname>Alho</surname> <given-names>K.</given-names></name> <name><surname>Lavikainen</surname> <given-names>J.</given-names></name></person-group> (<year>1992</year>). <article-title>Auditory attention and selective input modulation: a topographical ERP study.</article-title> <source><italic>Neuroreport</italic></source> <volume>3</volume> <fpage>493</fpage>&#x2013;<lpage>496</lpage>. <pub-id pub-id-type="doi">10.1097/00001756-199206000-00009</pub-id> <pub-id pub-id-type="pmid">1391755</pub-id></citation></ref>
<ref id="B44"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>O&#x2019;Sullivan</surname> <given-names>J. A.</given-names></name> <name><surname>Power</surname> <given-names>A. J.</given-names></name> <name><surname>Mesgarani</surname> <given-names>N.</given-names></name> <name><surname>Rajaram</surname> <given-names>S.</given-names></name> <name><surname>Foxe</surname> <given-names>J. J.</given-names></name> <name><surname>Shinn-Cunningham</surname> <given-names>B. G.</given-names></name><etal/></person-group> (<year>2015</year>). <article-title>Attentional selection in a cocktail party environment can be decoded from single-trial EEG.</article-title> <source><italic>Cereb. Cortex</italic></source> <volume>25</volume> <fpage>1697</fpage>&#x2013;<lpage>1706</lpage>. <pub-id pub-id-type="doi">10.1093/cercor/bht355</pub-id> <pub-id pub-id-type="pmid">24429136</pub-id></citation></ref>
<ref id="B45"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pion-Tonachini</surname> <given-names>L.</given-names></name> <name><surname>Kreutz-Delgado</surname> <given-names>K.</given-names></name> <name><surname>Makeig</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). <article-title>ICLabel: an automated electroencephalographic independent component classifier, dataset, and website.</article-title> <source><italic>NeuroImage</italic></source> <volume>198</volume> <fpage>181</fpage>&#x2013;<lpage>197</lpage>.</citation></ref>
<ref id="B46"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Polich</surname> <given-names>J.</given-names></name> <name><surname>Ehlers</surname> <given-names>C. L.</given-names></name> <name><surname>Otis</surname> <given-names>S.</given-names></name> <name><surname>Mandell</surname> <given-names>A. J.</given-names></name> <name><surname>Bloom</surname> <given-names>F. E.</given-names></name></person-group> (<year>1986</year>). <article-title>P300 latency reflects the degree of cognitive decline in dementing illness.</article-title> <source><italic>Electroencephalograp. Clin. Neurophysiol.</italic></source> <volume>63</volume> <fpage>138</fpage>&#x2013;<lpage>144</lpage>. <pub-id pub-id-type="doi">10.1016/0013-4694(86)90007-6</pub-id> <pub-id pub-id-type="pmid">2417814</pub-id></citation></ref>
<ref id="B47"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Seibold</surname> <given-names>J. C.</given-names></name> <name><surname>Iden</surname> <given-names>S.</given-names></name> <name><surname>Oberem</surname> <given-names>J.</given-names></name> <name><surname>Fels</surname> <given-names>J.</given-names></name> <name><surname>Koch</surname> <given-names>I.</given-names></name></person-group> (<year>2018</year>). <article-title>Intentional preparation of auditory attention-switches: Explicit cueing and sequential switch-predictability,&#x201D;</article-title> <source><italic>Quart. J. Exp. Psychol.</italic></source> <volume>71</volume> <fpage>1382</fpage>&#x2013;<lpage>1395</lpage>. <pub-id pub-id-type="doi">10.1080/17470218.2017.1344867</pub-id> <pub-id pub-id-type="pmid">28631530</pub-id></citation></ref>
<ref id="B48"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shamma</surname> <given-names>S. A.</given-names></name> <name><surname>Micheyl</surname> <given-names>C.</given-names></name></person-group> (<year>2010</year>). <article-title>Behind the scenes of auditory perception.</article-title> <source><italic>Curr. Opin. Neurobiol.</italic></source> <volume>20</volume> <fpage>361</fpage>&#x2013;<lpage>366</lpage>. <pub-id pub-id-type="doi">10.1016/j.conb.2010.03.009</pub-id> <pub-id pub-id-type="pmid">20456940</pub-id></citation></ref>
<ref id="B49"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shuai</surname> <given-names>L.</given-names></name> <name><surname>Elhilali</surname> <given-names>M.</given-names></name></person-group> (<year>2014</year>). <article-title>Task-dependent neural representations of salient events in dynamic auditory scenes.</article-title> <source><italic>Front. Neurosci.</italic></source> <volume>8</volume>:<issue>203</issue>. <pub-id pub-id-type="doi">10.3389/fnins.2014.00203</pub-id> <pub-id pub-id-type="pmid">25100934</pub-id></citation></ref>
<ref id="B50"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Somervail</surname> <given-names>R.</given-names></name> <name><surname>Zhang</surname> <given-names>F.</given-names></name> <name><surname>vembre</surname> <given-names>G.</given-names></name> <name><surname>Bufacchi</surname> <given-names>R. J.</given-names></name> <name><surname>Guo</surname> <given-names>Y.</given-names></name> <name><surname>Crepaldi</surname> <given-names>M.</given-names></name><etal/></person-group> (<year>2021</year>). <article-title>Waves of change: brain sensitivity to differential, t absolute, stimulus intensity is conserved across humans and rats.</article-title> <source><italic>Cereb. Cortex</italic></source> <volume>31</volume> <fpage>949</fpage>&#x2013;<lpage>960</lpage>. <pub-id pub-id-type="doi">10.1093/cercor/bhaa267</pub-id> <pub-id pub-id-type="pmid">33026425</pub-id></citation></ref>
<ref id="B51"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Szab&#x00F3;</surname> <given-names>B. T.</given-names></name> <name><surname>Denham</surname> <given-names>S. L.</given-names></name> <name><surname>Winkler</surname> <given-names>I.</given-names></name></person-group> (<year>2016</year>). <article-title>Computational models of auditory scene analysis: a review.</article-title> <source><italic>Front. Neurosci.</italic></source> <volume>10</volume>:<issue>524</issue>. <pub-id pub-id-type="doi">10.3389/fnins.2016.00524</pub-id> <pub-id pub-id-type="pmid">27895552</pub-id></citation></ref>
<ref id="B52"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Teoh</surname> <given-names>E. S.</given-names></name> <name><surname>Lalor</surname> <given-names>E. C.</given-names></name></person-group> (<year>2019</year>). <article-title>EEG decoding of the target speaker in a cocktail party scenario: Considerations regarding dynamic switching of talker location.</article-title> <source><italic>J. Neural Eng.</italic></source> <volume>16</volume>:<issue>036017</issue>. <pub-id pub-id-type="doi">10.1088/1741-2552/ab0cf1</pub-id> <pub-id pub-id-type="pmid">30836345</pub-id></citation></ref>
<ref id="B53"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tse</surname> <given-names>P. U.</given-names></name> <name><surname>Intriligator</surname> <given-names>J.</given-names></name> <name><surname>Rivest</surname> <given-names>J.</given-names></name> <name><surname>Cavanagh</surname> <given-names>P.</given-names></name></person-group> (<year>2004</year>). <article-title>Attention and the subjective expansion of time.</article-title> <source><italic>Percep. Psychophys</italic>.</source> <volume>66</volume> <fpage>1171</fpage>&#x2013;<lpage>1189</lpage>. <pub-id pub-id-type="doi">10.3758/bf03196844</pub-id> <pub-id pub-id-type="pmid">15751474</pub-id></citation></ref>
<ref id="B54"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>van Canneyt</surname> <given-names>J.</given-names></name> <name><surname>Wouters</surname> <given-names>J.</given-names></name> <name><surname>Francart</surname> <given-names>T.</given-names></name></person-group> (<year>2021</year>). <article-title>Neural tracking of the fundamental frequency of the voice: The effect of voice characteristics.</article-title> <source><italic>Eur. J. Neurosci.</italic></source> <volume>53</volume> <fpage>3640</fpage>&#x2013;<lpage>3653</lpage>. <pub-id pub-id-type="doi">10.1111/ejn.15229</pub-id> <pub-id pub-id-type="pmid">33861480</pub-id></citation></ref>
<ref id="B55"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vestergaard</surname> <given-names>M. D.</given-names></name> <name><surname>Fyson</surname> <given-names>N. R.</given-names></name> <name><surname>Patterson</surname> <given-names>R. D.</given-names></name></person-group> (<year>2011</year>). <article-title>The mutual roles of temporal glimpsing and vocal characteristics in cocktail-party listening.</article-title> <source><italic>J. Acoust. Soc. Am.</italic></source> <volume>130</volume> <fpage>429</fpage>&#x2013;<lpage>439</lpage>. <pub-id pub-id-type="doi">10.1121/1.3596462</pub-id> <pub-id pub-id-type="pmid">21786910</pub-id></citation></ref>
<ref id="B56"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>L.</given-names></name></person-group> (<year>2021</year>). <article-title>Wu Ed X., and Chen F., &#x201C;EEG-based auditory attention decoding using speech level based segmented computational models,&#x201D;</article-title> <source><italic>J. Neural Eng.</italic></source> <volume>18</volume>:<issue>46066</issue>. <pub-id pub-id-type="doi">10.1088/1741-2552/abfeba</pub-id> <pub-id pub-id-type="pmid">33957606</pub-id></citation></ref>
<ref id="B57"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>L.</given-names></name> <name><surname>Li</surname> <given-names>H.</given-names></name> <name><surname>Wu</surname> <given-names>E. X.</given-names></name> <name><surname>Chen</surname> <given-names>F.</given-names></name></person-group> (<year>2019</year>). <article-title>Cortical auditory responses index the contributions of different RMS-level-dependent segments to speech intelligibility.</article-title> <source><italic>Hear. Res.</italic></source> <volume>383</volume>:<issue>107808</issue>. <pub-id pub-id-type="doi">10.1016/j.heares.2019.107808</pub-id> <pub-id pub-id-type="pmid">31606583</pub-id></citation></ref>
<ref id="B58"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>L.</given-names></name> <name><surname>Wu</surname> <given-names>E. X.</given-names></name> <name><surname>Chen</surname> <given-names>F.</given-names></name></person-group> (<year>2020a</year>). &#x201C;<article-title>Contribution of RMS-level-based speech segments to target speech decoding under isy conditions</article-title>,&#x201D; in <source><italic>Proc. of 21th Annual Conference of the International Speech Communication Association (InterSpeech).</italic></source> <publisher-loc>Shenzhen</publisher-loc>. <pub-id pub-id-type="doi">10.1016/j.heares.2019.107808</pub-id> <pub-id pub-id-type="pmid">31606583</pub-id></citation></ref>
<ref id="B59"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>L.</given-names></name> <name><surname>Wu</surname> <given-names>E. X.</given-names></name> <name><surname>Chen</surname> <given-names>F.</given-names></name></person-group> (<year>2020b</year>). <article-title>Robust EEG-based decoding of auditory attention with high-RMS-level speech segments in isy conditions.</article-title> <source><italic>Front. Human Neurosci.</italic></source> <volume>14</volume>:<issue>557534</issue>. <pub-id pub-id-type="doi">10.3389/fnhum.2020.557534</pub-id> <pub-id pub-id-type="pmid">33132874</pub-id></citation></ref>
<ref id="B60"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Winkler</surname> <given-names>I.</given-names></name> <name><surname>Denham</surname> <given-names>S.</given-names></name> <name><surname>Nelken</surname> <given-names>I.</given-names></name></person-group> (<year>2009</year>). <article-title>Modeling the auditory scene: predictive regularity representations and perceptual objects.</article-title> <source><italic>Trends Cogn. Sci.</italic></source> <volume>13</volume> <fpage>532</fpage>&#x2013;<lpage>540</lpage>. <pub-id pub-id-type="doi">10.1016/j.tics.2009.09.003</pub-id> <pub-id pub-id-type="pmid">19828357</pub-id></citation></ref>
<ref id="B61"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wolpaw</surname> <given-names>J. R.</given-names></name> <name><surname>Ramoser</surname> <given-names>H.</given-names></name></person-group> (<year>1998</year>). <article-title>EEG-based communication: improved accuracy by response verification.</article-title> <source><italic>IEEE Trans. Rehab. Eng.</italic></source> <volume>6</volume> <fpage>326</fpage>&#x2013;<lpage>333</lpage>. <pub-id pub-id-type="doi">10.1109/86.712231</pub-id> <pub-id pub-id-type="pmid">9749910</pub-id></citation></ref>
<ref id="B62"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zoefel</surname> <given-names>B.</given-names></name></person-group> (<year>2018</year>). <article-title>Speech entrainment: rhythmic predictions carried by neural oscillations.</article-title> <source><italic>Curr. Biol.</italic></source> <volume>28</volume> <fpage>1102</fpage>&#x2013;<lpage>1104</lpage>. <pub-id pub-id-type="doi">10.1016/j.cub.2018.07.048</pub-id> <pub-id pub-id-type="pmid">30253150</pub-id></citation></ref>
</ref-list>
</back>
</article>