<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neurosci.</journal-id>
<journal-title>Frontiers in Neuroscience</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neurosci.</abbrev-journal-title>
<issn pub-type="epub">1662-453X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnins.2022.855753</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Modeling the Repetition-Based Recovering of Acoustic and Visual Sources With Dendritic Neurons</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Dellaferrera</surname> <given-names>Giorgia</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<xref ref-type="author-notes" rid="fn002"><sup>&#x02020;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1628684/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Asabuki</surname> <given-names>Toshitake</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Fukai</surname> <given-names>Tomoki</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/22174/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Neural Coding and Brain Computing Unit, Okinawa Institute of Science and Technology</institution>, <addr-line>Okinawa</addr-line>, <country>Japan</country></aff>
<aff id="aff2"><sup>2</sup><institution>Institute of Neuroinformatics, University of Zurich and Swiss Federal Institute of Technology Zurich (ETH)</institution>, <addr-line>Zurich</addr-line>, <country>Switzerland</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Emre O. Neftci, University of California, Irvine, United States</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Lyes Khacef, University of Groningen, Netherlands; Dylan Richard Muir, University of Basel, Switzerland; Tom Tetzlaff, Helmholtz Association of German Research Centres (HZ), Germany</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Giorgia Dellaferrera <email>gde&#x00040;zurich.ibm.com</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Neuromorphic Engineering, a section of the journal Frontiers in Neuroscience</p></fn>
<fn fn-type="present-address" id="fn002"><p>&#x02020;Present address: Giorgia Dellaferrera, Department of Ophthalmology, Children&#x00027;s Hospital, Harvard Medical School, Boston, MA, United States</p></fn></author-notes>
<pub-date pub-type="epub">
<day>28</day>
<month>04</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>16</volume>
<elocation-id>855753</elocation-id>
<history>
<date date-type="received">
<day>15</day>
<month>01</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>31</day>
<month>03</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2022 Dellaferrera, Asabuki and Fukai.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Dellaferrera, Asabuki and Fukai</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>In natural auditory environments, acoustic signals originate from the temporal superimposition of different sound sources. The problem of inferring individual sources from ambiguous mixtures of sounds is known as blind source decomposition. Experiments on humans have demonstrated that the auditory system can identify sound sources as repeating patterns embedded in the acoustic input. Source repetition produces temporal regularities that can be detected and used for segregation. Specifically, listeners can identify sounds occurring more than once across different mixtures, but not sounds heard only in a single mixture. However, whether such a behavior can be computationally modeled has not yet been explored. Here, we propose a biologically inspired computational model to perform blind source separation on sequences of mixtures of acoustic stimuli. Our method relies on a somatodendritic neuron model trained with a Hebbian-like learning rule which was originally conceived to detect spatio-temporal patterns recurring in synaptic inputs. We show that the segregation capabilities of our model are reminiscent of the features of human performance in a variety of experimental settings involving synthesized sounds with naturalistic properties. Furthermore, we extend the study to investigate the properties of segregation on task settings not yet explored with human subjects, namely natural sounds and images. Overall, our work suggests that somatodendritic neuron models offer a promising neuro-inspired learning strategy to account for the characteristics of the brain segregation capabilities as well as to make predictions on yet untested experimental settings.</p></abstract>
<kwd-group>
<kwd>dendritic neurons</kwd>
<kwd>spiking neural networks</kwd>
<kwd>blind source separation</kwd>
<kwd>sound source repetition</kwd>
<kwd>spatio-temporal structure</kwd>
</kwd-group>
<contract-sponsor id="cn001">Japan Society for the Promotion of Science<named-content content-type="fundref-id">10.13039/501100001691</named-content></contract-sponsor>
<counts>
<fig-count count="10"/>
<table-count count="0"/>
<equation-count count="5"/>
<ref-count count="66"/>
<page-count count="18"/>
<word-count count="13562"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Hearing a sound of specific interest in a noisy environment is a fundamental ability of the brain that is necessary for auditory scene analysis. To achieve this, the brain has to unambiguously separate the target auditory signal from other distractor signals. In this vein, a famous example is the &#x0201C;cocktail party effect&#x0201D; (Cherry, <xref ref-type="bibr" rid="B13">1953</xref>), i.e., the ability to distinguish a particular speaker&#x00027;s voice against a multi-talker background (Brown et al., <xref ref-type="bibr" rid="B12">2001</xref>; Mesgarani and Chang, <xref ref-type="bibr" rid="B43">2012</xref>). Many psychophysical and neurobiological studies have been conducted to clarify the psychophysical properties and underlying mechanisms of the segregation of mixed signals (Asari et al., <xref ref-type="bibr" rid="B5">2006</xref>; Bee and Micheyl, <xref ref-type="bibr" rid="B9">2008</xref>; Narayan et al., <xref ref-type="bibr" rid="B46">2008</xref>; McDermott, <xref ref-type="bibr" rid="B40">2009</xref>; McDermott et al., <xref ref-type="bibr" rid="B41">2011</xref>; Schmidt and R&#x000F6;mer, <xref ref-type="bibr" rid="B55">2011</xref>; Lewald and Getzmann, <xref ref-type="bibr" rid="B36">2015</xref>; Li et al., <xref ref-type="bibr" rid="B37">2017</xref>; Atilgan et al., <xref ref-type="bibr" rid="B6">2018</xref>), and computational theories and models have also been proposed for this computation (Amari et al., <xref ref-type="bibr" rid="B3">1995</xref>; Bell and Sejnowski, <xref ref-type="bibr" rid="B10">1995</xref>; Sagi et al., <xref ref-type="bibr" rid="B52">2001</xref>; Haykin and Chen, <xref ref-type="bibr" rid="B25">2005</xref>; Elhilali and Shamma, <xref ref-type="bibr" rid="B19">2009</xref>; Thakur et al., <xref ref-type="bibr" rid="B60">2015</xref>; Dong et al., <xref ref-type="bibr" rid="B17">2016</xref>; Kameoka et al., <xref ref-type="bibr" rid="B29">2018</xref>; Karamatli et al., <xref ref-type="bibr" rid="B30">2018</xref>; Sawada et al., <xref ref-type="bibr" rid="B54">2019</xref>). However, how the brain attains its remarkable sound segregation remains elusive. Various properties of auditory cues such as spatial cues in binaural listening (Ding and Simon, <xref ref-type="bibr" rid="B16">2012</xref>) and temporal coherence of sound stimuli (Teki et al., <xref ref-type="bibr" rid="B59">2013</xref>; Krishnan et al., <xref ref-type="bibr" rid="B33">2014</xref>) are known to facilitate the listener&#x00027;s ability to segregate a particular sound from the background. Auditory signals that reached to ears first undergo the analysis of frequency spectrums by cochlea (Oxenham, <xref ref-type="bibr" rid="B48">2018</xref>). Simultaneous initiation and termination of the component signals and the harmonic structure of the frequency spectrums help the brain to identify the components of the target sound (Popham et al., <xref ref-type="bibr" rid="B51">2018</xref>). Prior knowledge about the target sound, such as its familiarity to listeners (Elhilali, <xref ref-type="bibr" rid="B18">2013</xref>; Woods and McDermott, <xref ref-type="bibr" rid="B64">2018</xref>), and top-down attention can also improve their ability to detect the sound (Kerlin et al., <xref ref-type="bibr" rid="B31">2010</xref>; Xiang et al., <xref ref-type="bibr" rid="B65">2010</xref>; Ahveninen et al., <xref ref-type="bibr" rid="B1">2011</xref>; Golumbic et al., <xref ref-type="bibr" rid="B23">2013</xref>; O&#x00027;Sullivan et al., <xref ref-type="bibr" rid="B47">2014</xref>; Bronkhorst, <xref ref-type="bibr" rid="B11">2015</xref>). Selective attention as the combination of the auditory (sound) and visual (lip movements, visual cues) modalities has also been suggested to be beneficial to solve the cocktail party problem (Yu, <xref ref-type="bibr" rid="B66">2020</xref>; Liu et al., <xref ref-type="bibr" rid="B38">2021</xref>). However, many of these cues are subsidiary and not absolutely required for hearing the target sound. For example, a mixture sound can be separated by monaural hearing (Hawley et al., <xref ref-type="bibr" rid="B24">2004</xref>) or without spatial cues (Middlebrooks and Waters, <xref ref-type="bibr" rid="B44">2020</xref>). Therefore, the crucial mechanisms of sound segregation remain to be explored.</p>
<p>Whether or not biological auditory systems segregate a sound based on principles similar to those invented for artificial systems remains unclear (Bee and Micheyl, <xref ref-type="bibr" rid="B9">2008</xref>; McDermott, <xref ref-type="bibr" rid="B40">2009</xref>). Among such principles, independent component analysis (ICA) (Comon, <xref ref-type="bibr" rid="B15">1994</xref>) and its variants are the conventional mathematical tools used for solving the sound segregation problem, or more generally, the blind source decomposition problem (Amari et al., <xref ref-type="bibr" rid="B3">1995</xref>; Bell and Sejnowski, <xref ref-type="bibr" rid="B10">1995</xref>; Hyv&#x000E4;rinen and Oja, <xref ref-type="bibr" rid="B26">1997</xref>; Haykin and Chen, <xref ref-type="bibr" rid="B25">2005</xref>). Owing to its linear algebraic features, the conventional ICA requires as many input channels (e.g., microphones) as the number of signal sources, which does not appear to be a requirement for sound segregation in biological systems. In this context, however, recent works for single-channel source separation based on techniques such as Non-Negative Matrix Factorization (NNMF) have demonstrated that ICA can be applied with a lower number of channels than the number of sources (Krause-Solberg and Iske, <xref ref-type="bibr" rid="B32">2015</xref>; Mika et al., <xref ref-type="bibr" rid="B45">2020</xref>). In addition, NNMF has been shown to extract regular spatio-temporal patterns within the audio and to achieve good performance in applications such as music processing (Smaragdis and Brown, <xref ref-type="bibr" rid="B57">2003</xref>; Cichocki et al., <xref ref-type="bibr" rid="B14">2006</xref>; Santosh and Bharathi, <xref ref-type="bibr" rid="B53">2017</xref>; L&#x000F3;pez-Serrano et al., <xref ref-type="bibr" rid="B39">2019</xref>). It has been suggested as an alternative possibility that human listeners detect latent recurring patterns in the spectro-temporal structure of sound mixtures for separating individual sound sources (McDermott et al., <xref ref-type="bibr" rid="B41">2011</xref>). This was indicated by the finding that listeners could identify a target sound when the sound was repeated in different mixtures in combination with various other sounds but could not do so when the sound was presented in a single mixture.</p>
<p>The finding represents an important piece of information about the computational principles of sound source separation in biological systems. Here, we demonstrate that a computational model implementing a pattern-detection mechanism accounts for the characteristic features of human performance observed in various task settings. To this end, we constructed a simplified model of biological auditory systems by using a two-compartment neuron model recently proposed for learning regularly or irregularly repeated patterns in input spike trains (Asabuki and Fukai, <xref ref-type="bibr" rid="B4">2020</xref>). Importantly, this learning occurs in an unsupervised fashion based on the minimization principle of regularized information loss, showing that the essential computation of sound source segregation can emerge at the single-neuron level without teaching signals. Furthermore, it was previously suggested that a similar repetition-based learning mechanism may also work for the segregation of visual objects (McDermott et al., <xref ref-type="bibr" rid="B41">2011</xref>). To provide a firm computational ground, we extended the tasks of our framework to predictions on visual images.</p>
</sec>
<sec sec-type="results" id="s2">
<title>2. Results</title>
<sec>
<title>2.1. Learning of Repeated Input Patterns by a Two-Compartment Neuron Model</title>
<p>We used a two-compartment spiking neuron model which learns recurring temporal features in synaptic input, as proposed in Asabuki and Fukai (<xref ref-type="bibr" rid="B4">2020</xref>). In short, the dendritic compartment attempts to predict the responses of the soma to given synaptic input by modeling the somatic responses. To this end, the neuron model minimizes information loss within a recent period when the somatic activity is replaced with its model generated by the dendrite. Mathematically, the learning rule minimizes the Kullback&#x02013;Leibler (KL) divergence between the probability distributions of somatic and dendritic activities. The dendritic membrane potential of a two-compartment neuron obeys <inline-formula><mml:math id="M1"><mml:mi>v</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:math></inline-formula> where <italic>w</italic><sub><italic>j</italic></sub> and <italic>e</italic><sub><italic>j</italic></sub> stand for the synaptic weight and the unit postsynaptic potential of the j-th presynaptic input, respectively. The somatic activity evolves as</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M2"><mml:mrow><mml:mover accent='true'><mml:mi>u</mml:mi><mml:mo>&#x002D9;</mml:mo></mml:mover><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>&#x003C4;</mml:mi></mml:mfrac><mml:mi>u</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mi>g</mml:mi><mml:mi>D</mml:mi></mml:msub><mml:mo stretchy='false'>[</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mi>u</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:mi>v</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>]</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mi>j</mml:mi></mml:munder><mml:mrow><mml:msub><mml:mi>G</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:mstyle><mml:msup><mml:mi>&#x003D5;</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>o</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mo>/</mml:mo><mml:msub><mml:mi>&#x003D5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>
<p>where the last term describes lateral inhibition with modifiable synaptic weights <italic>G</italic><sub><italic>k</italic></sub> (&#x02265;0), as shown later. The soma generates a Poisson spike train with the instantaneous firing rate &#x003D5;<sup><italic>som</italic></sup>(<italic>u</italic>(<italic>t</italic>)), where <inline-formula><mml:math id="M3"><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>o</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003D5;</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x0002B;</mml:mo><mml:msup><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B2;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:math></inline-formula> and the parameters &#x003B2; and &#x003B8; are modified in an activity-dependent manner in terms of the mean and variance of the membrane potential over a sufficiently long period <italic>t</italic><sub>0</sub>. To extract the repeated patterns from temporal input, the model compresses the high dimensional data carried by the input sequence onto a low dimensional manifold of neural dynamics. This is performed by modifying the weights of dendritic synapses to minimize the time-averaged mismatch between the somatic and dendritic activities over a certain interval [0,T]. In a stationary state, the somatic membrane potential <italic>u</italic><sub><italic>i</italic></sub>(<italic>t</italic>) can be described as an attenuated version <inline-formula><mml:math id="M4"><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> of the dendritic membrane potential. At each time point, we compare the attenuated dendritic membrane potential with the somatic membrane potential, on the level of the two Poissonian spike distributions with rates <inline-formula><mml:math id="M5"><mml:msubsup><mml:mrow><mml:mi>&#x003D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>o</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>u</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M6"><mml:mi>&#x003D5;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, respectively, which would be generated if both soma and dendrite were able to emit spikes independently. In practice, the neuron model minimizes the following cost function for synaptic weights <italic>w</italic>, which represents the averaged KL-divergence between somatic activity and dendritic activity, and in which we explicitly represent the dependency of <italic>u</italic><sub><italic>i</italic></sub> and <inline-formula><mml:math id="M7"><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> on X:</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M8"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:mi>E</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>w</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:mrow><mml:msub><mml:mo>&#x0222B;</mml:mo><mml:mrow><mml:msub><mml:mi>&#x003A9;</mml:mi><mml:mi>X</mml:mi></mml:msub></mml:mrow></mml:msub><mml:mi>d</mml:mi></mml:mrow></mml:mstyle><mml:mi>X</mml:mi><mml:msup><mml:mi>P</mml:mi><mml:mo>*</mml:mo></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>X</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mstyle displaystyle='true'><mml:mrow><mml:msubsup><mml:mo>&#x0222B;</mml:mo><mml:mn>0</mml:mn><mml:mi>T</mml:mi></mml:msubsup><mml:mi>d</mml:mi></mml:mrow></mml:mstyle><mml:mi>t</mml:mi><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mi>i</mml:mi></mml:munder><mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mi>K</mml:mi><mml:mi>L</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mstyle><mml:mo stretchy='false'>[</mml:mo><mml:msubsup><mml:mi>&#x003D5;</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>o</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>u</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>X</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy="false">&#x0007C;&#x0007C;</mml:mo><mml:msup><mml:mi>&#x003D5;</mml:mi><mml:mrow><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mi>v</mml:mi><mml:mi>i</mml:mi><mml:mo>*</mml:mo></mml:msubsup><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo>;</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>X</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>]</mml:mo><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>with <italic>P</italic><sup>&#x0002A;</sup>(<bold>X</bold>) and &#x003A9;<sub><italic>X</italic></sub> being the true distribution of input spike trains and the entire space spanned by them, and <inline-formula><mml:math id="M9"><mml:mrow><mml:msup><mml:mi>&#x003D5;</mml:mi><mml:mrow><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x003D5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:msup><mml:mrow><mml:mo stretchy='false'>[</mml:mo><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:msup><mml:mi>e</mml:mi><mml:mrow><mml:msub><mml:mi>&#x003B2;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mi>x</mml:mi><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x003B8;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:mo stretchy='false'>]</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>. To search for the optimal weight matrix, the cost function <italic>E</italic>(<italic>w</italic>) is minimized through gradient descent: &#x00394;<italic>w</italic><sub><italic>ij</italic></sub>&#x0221D;&#x02212;&#x02202;<italic>E</italic>/&#x02202;<italic>w</italic><sub><italic>ij</italic></sub>. Introducing the regularization term &#x02212;&#x003B3;<bold>w</bold><sub><italic>i</italic></sub> and a noise component &#x003BE;<sub><italic>i</italic></sub> with its intensity <italic>g</italic> gives the following learning rule (for the derivation see Asabuki and Fukai, <xref ref-type="bibr" rid="B4">2020</xref>):</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M10"><mml:mrow><mml:msub><mml:mrow><mml:mover><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>w</mml:mi></mml:mstyle><mml:mo>.</mml:mo></mml:mover></mml:mrow><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mi>&#x003B7;</mml:mi><mml:mo>&#x0007B;</mml:mo><mml:mi>&#x003C8;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mi>v</mml:mi><mml:mi>i</mml:mi><mml:mo>*</mml:mo></mml:msubsup><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>[</mml:mo><mml:mo>&#x0007B;</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mi>&#x003D5;</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>o</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x003D5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mi>g</mml:mi><mml:msub><mml:mi>&#x003BE;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02212;</mml:mo><mml:msup><mml:mi>&#x003D5;</mml:mi><mml:mrow><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mi>v</mml:mi><mml:mi>i</mml:mi><mml:mo>*</mml:mo></mml:msubsup><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x0007D;</mml:mo><mml:mo>/</mml:mo><mml:msub><mml:mi>&#x003D5;</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy='false'>]</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>e</mml:mi></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mi>t</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02212;</mml:mo><mml:mi>&#x003B3;</mml:mi><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>w</mml:mi></mml:mstyle><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x0007D;</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>
<p>where <bold>w</bold><sub><italic>i</italic></sub> &#x0003D; [<italic>w</italic><sub><italic>i</italic>1, ...,<sub><italic>w</italic></sub><sub><italic>i</italic><sub><italic>N</italic></sub><sub><italic>in</italic></sub></sub></sub>], <bold>e</bold>(<italic>t</italic>) &#x0003D; [<italic>e</italic><sub>1</sub>, ...<italic>e</italic><sub><italic>Nin</italic></sub>], &#x003BE;<sub><italic>i</italic></sub> obeys a normal distribution, <inline-formula><mml:math id="M11"><mml:mi>&#x003C8;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:mfrac><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>&#x003D5;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, &#x003D5;<sup><italic>som</italic></sup> and &#x003D5;<sup><italic>dend</italic></sup> follow Poisson distributions, &#x003B7; is the learning rate, and</p>
<disp-formula id="E4"><mml:math id="M12"><mml:mrow><mml:mi>f</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable columnalign='left'><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mn>0</mml:mn></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mtext>if&#x000A0;</mml:mtext><mml:mi>x</mml:mi><mml:mo>&#x0003C;</mml:mo><mml:mtext>0</mml:mtext><mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mi>x</mml:mi></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mtext>if&#x000A0;</mml:mtext><mml:mn>0</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x0003C;</mml:mo><mml:msub><mml:mi>&#x003D5;</mml:mi><mml:mtext>0</mml:mtext></mml:msub><mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:msub><mml:mi>&#x003D5;</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mrow></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mtext>if&#x000A0;</mml:mtext><mml:mi>x</mml:mi><mml:mo>&#x02265;</mml:mo><mml:msub><mml:mi>&#x003D5;</mml:mi><mml:mtext>0</mml:mtext></mml:msub></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>Finally, if a pair of presynaptic and postsynaptic spikes occur at the times <italic>t</italic><sub><italic>pre</italic></sub> and <italic>t</italic><sub><italic>post</italic></sub>, respectively, lateral inhibitory connections between two-compartment neurons <italic>i</italic> and <italic>j</italic> are modified through a symmetric anti-Hebbian STDP as</p>
<disp-formula id="E5"><label>(4)</label><mml:math id="M13"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:mo>&#x00394;</mml:mo><mml:msub><mml:mi>G</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02212;</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mi>&#x003C4;</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:mrow></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x02212;</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mi>d</mml:mi></mml:msub><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>p</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>&#x02212;</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02212;</mml:mo><mml:msub><mml:mi>t</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mi>&#x003C4;</mml:mi><mml:mi>d</mml:mi></mml:msub></mml:mrow></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>See Section 4 and <xref ref-type="supplementary-material" rid="SM1">Supplementary Note</xref> for additional details. The prediction is learnable when input spike sequences from presynaptic neurons are non-random and contain recurring temporal patterns. In such a case, the minimization of information loss induces a consistency check between the dendrite and soma, eventually enforcing both compartments to respond selectively to one of the patterns. Mathematically, the somatic response serves as a teaching signal to supervise synaptic learning in the dendrite. Biologically, backpropagating action potentials may provide the supervising signal (Larkum et al., <xref ref-type="bibr" rid="B35">1999</xref>; Larkum, <xref ref-type="bibr" rid="B34">2013</xref>).</p>
<p>We constructed an artificial neural network based on the somatodendritic consistency check model and trained the network to perform the task of source recovering from embedded repetition. The network consisted of two layers of neurons. The input layer encoded the spectrogram of acoustic stimuli into spike trains of Poisson neurons. For each sound, the spike train was generated through a sequence of 400 time steps, where each time step corresponds to a &#x0201C;fire&#x0201D; or &#x0201C;non-fire&#x0201D; event. The output layer was a competitive network of the two-compartment models that received synaptic input from the input layer and learned recurring patterns in the input (<xref ref-type="fig" rid="F1">Figure 1</xref>). We designed the output layer and the learning process similarly to the network used previously (Asabuki and Fukai, <xref ref-type="bibr" rid="B4">2020</xref>) for the blind signal separation (BSS) within mixtures of multiple mutually correlated signals. In particular, lateral inhibitory connections between the output neurons underwent spike-timing-dependent plasticity for self-organizing an array of feature-selective output neurons (Section 4). In the spike encoding stage, the spectrogram is flattened into a one-dimensional array where the intensity of each element is proportional to the Poisson firing probability of the associated input neuron. This operation disconnects the signal&#x00027;s temporal features from the temporal dynamics of the neurons. Although this signal manipulation is not biologically plausible and introduces additional latency as the whole sample needs to be buffered, it allows the input layer to encode simultaneously all the time points of the audio signal. Thanks to this strategy, the length of the input spike trains does not depend on the duration of the audio signal, and a sufficiently large population of input neurons can encode arbitrarily long sounds, possibly with some redundancy in the encoding for short sounds. We remark that, while the somatodendritic mismatch learning rule was conceived to capture temporal information in an online fashion, in our framework it is applied to a flattened spectrogram, thus to a static pattern. Furthermore, in order to relate the signal intensity with the encoding firing rate, we normalized the spectrogram values to the interval [0,1]. This strategy is suited to our aim of reproducing the experiments with synthetic sounds and custom naturalistic stimuli. However, in a real-world application any instantaneous outlier in signal intensity would destroy other temporal features of an input signal. Nonetheless, the normalization is performed independently for each mixture, so if the outlier affects a masker sound and not a target, and the target is presented in at least two other mixtures, we expect that the normalization does not affect the ability of the network of identifying sounds presented in different mixtures.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Network architecture. The input signal is pre-processed into a two-dimensional image (i.e., the spectrogram) with values normalized in the range [0,1]. The image is flattened into a one-dimensional array where the intensity of each element is proportional to the Poisson firing probability of the associated input neuron. The neurons in the input layer are connected to those in the output layer through either full connectivity or random connectivity with connection probability <italic>p</italic> = 0.3. The output neurons are trained following the artificial dendritic neuron learning scheme (Asabuki and Fukai, <xref ref-type="bibr" rid="B4">2020</xref>).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-855753-g0001.tif"/>
</fig>
</sec>
<sec>
<title>2.2. Synthesized and Natural Auditory Stimuli</title>
<p>We examined whether the results of our computational model are consistent with the outcomes of the experiments on human listeners on artificially synthesized sounds described previously (McDermott et al., <xref ref-type="bibr" rid="B41">2011</xref>). To provide a meaningful comparison with the human responses, we adopted for our simulations settings as close as possible to the experiments, both in terms of dataset generation and performance evaluation (Section 4). In McDermott et al. (<xref ref-type="bibr" rid="B41">2011</xref>), the generation of synthetic sounds is performed by first measuring the correlations between pairs of spectrograms cells of natural sounds (spoken words and animal vocalizations). Then such correlations are averaged across different pairs to obtain temporal correlation functions. The correlation functions in turn are used to generate covariance matrices, in which each element is the covariance between two spectrogram cells. Finally, spectrograms are drawn from the resulting Gaussian distribution and applied to samples of white noise, leading to the synthesis of novel sounds. In our experiments we synthesized the sounds using the toolbox provided at <ext-link ext-link-type="uri" xlink:href="https://mcdermottlab.mit.edu/downloads.html">https://mcdermottlab.mit.edu/downloads.html</ext-link>. In the human experiments, a dataset containing novel sounds was generated such that listeners&#x00027; performance in sound source segregation was not influenced by familiarity with previously experienced sounds. To closely reproduce the experiment, we created a database of synthesized sounds according to the same method as described in McDermott et al. (<xref ref-type="bibr" rid="B41">2011</xref>) (Section 4). The synthesized stimuli retained similarity to real-world sounds except that they lacked grouping cues related to temporal onset and harmonic spectral structures. Furthermore, unlike human listeners, our neural network was trained and built from scratch, and had no previous knowledge of natural sounds that could bias the task execution. We exploited this advantage to investigate whether and how the sound segregation performance was affected by the presence of grouping cues in real sounds. To this goal we also built a database composed of natural sounds (Section 4).</p>
<p>To build the sequence of input stimuli, we randomly chose a set of sounds from the database of synthesized or natural sounds, and we generated various mixtures by superimposing them&#x02014;i.e., we summed element-wise the spectrograms of the original sounds and then normalized the sum to the interval [0,1]. We refer to the main sound, which is always part of mixtures, as the <italic>target</italic>, and to all the other sounds, which were either presented as mixing sounds with the target (i.e., masker sounds) or presented alone, as <italic>distractors</italic>. The target sound is shown in red in the training protocols. Following the protocol in McDermott et al. (<xref ref-type="bibr" rid="B41">2011</xref>), we concatenated the mixtures of target and distractors into input sequences. For certain experiments, we also included unmixed distractor sounds. We presented the network with the input sequence for a fixed number of repetitions. As each input signal&#x02014;both unmixed sounds and mixtures&#x02014;is flattened into one input vector, each input signal is one element of the input sequence. During the input presentation, the network&#x00027;s parameters evolved following the learning rule described in Asabuki and Fukai (<xref ref-type="bibr" rid="B4">2020</xref>). Then, we examined the ability of the trained network to identify the target sound by using probe sounds, which were either the target or distractor sound composing the mixtures presented during training (<italic>correct probe</italic>) or a different sound (<italic>incorrect probe</italic>). Incorrect probes for synthesized target sounds were generated similarly as described in McDermott et al. (<xref ref-type="bibr" rid="B41">2011</xref>). Specifically, we synthesized the incorrect probe by using the same covariance structure of the target sound, and then we set a randomly selected time slice of the incorrect probe (1/8 of the sound&#x00027;s duration) to be equal to a time slice of the target of the same duration. Examples of target sounds, distractor sounds and incorrect probes are shown in <xref ref-type="fig" rid="F2">Figures 2A&#x02013;C</xref>, respectively. A further beneficial aspect of our model is the possibility of freezing plasticity during the inference stage, so that the synaptic connections do not change during the probe presentation. This allows us to investigate whether the trained network can identify not only the target but also the masker sounds.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Synthesized sounds&#x02014;target and associated distractor. <bold>(A)</bold> Spectrogram of one target sound. <bold>(B)</bold> Step 1 to build the spectrogram of an incorrect probe related to the target in <bold>(A)</bold>: a sound is randomly selected from the same Gaussian distribution generating the target. <bold>(C)</bold> Step 2 to build the incorrect probe: after the sampling, a randomly selected time slice equal to 1/8 of the sound duration is set to be equal to the target. In the figure, the temporal slice is the vertical stripe around time 0.5 s.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-855753-g0002.tif"/>
</fig>
</sec>
<sec>
<title>2.3. Learning of Mixture Sounds in the Network Model</title>
<p>Our network model contained various hyperparameters such as number of output neurons, number of mixtures and connectivity pattern. A grid search was performed to find the best combination of hyperparameters. <xref ref-type="fig" rid="F3">Figures 3A,B</xref> report the learning curves obtained on synthesized and natural sounds, respectively, for random initial weights and different combinations of hyperparameters. For both types of sounds, synaptic weights changed rapidly in the initial phase of learning. The changes were somewhat faster for synthesized sounds than for natural sounds, but the learning curves behaved similarly for both sound types. The number of output neurons little affected the learning curves, while they behaved differently for different connectivity patterns or different numbers of mixtures. Because familiarity to sounds enhances auditory perception in humans (Jacobsen et al., <xref ref-type="bibr" rid="B28">2005</xref>), we investigated whether pretraining with a sequence containing target and distractors improves learning in our model for various lengths of pretraining. Neither the training speed nor the final accuracy were significantly improved by the pretraining (<xref ref-type="fig" rid="F3">Figures 3C&#x02013;E</xref>). This suggests that the model was &#x0201C;forgetting&#x0201D; about the pretraining stage and learning the mixture sounds from scratch, not exploiting any familiarity with previously seen sounds. We suspect that this behavior is related to the well know limitation of ANNs of lack of continual learning (French, <xref ref-type="bibr" rid="B20">1999</xref>) rather than to a specific feature of our model. Furthermore, we cannot provide a comparison in the learning curve between the model and the psychophysical data, since the model was trained for multiple epochs, while the human listeners were presented with the training sequence only once and then tested on the probe immediately after.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Learning curves. <bold>(A)</bold> Average synaptic weight change for the experiments carried out on the synthetized sounds, the network being initialized with random values. <bold>(B)</bold> Average synaptic weight change for the experiments carried out on the natural sounds, the network being initialized with random values. <bold>(C)</bold> Average synaptic weight change for the experiments carried out on the synthetized sounds, the network being pretrained on the targets set presented for 100 epochs. <bold>(D)</bold> Average synaptic weight change for the experiments carried out on the synthetized sounds, the network being pretrained on the targets set presented for 200 epochs. <bold>(E)</bold> Average synaptic weight change for the experiments carried out on the synthetized sounds, the network being pretrained on the targets set presented for 300 epochs. The solid line and the shaded area represent the mean and standard deviation over 3 independent runs, respectively. Without pretraining, when the number of output neurons is varied no significant change is found, while with pretraining when a larger number of neurons is used, the weight change curve saturates at a lower value, as shown by the blue (<italic>N</italic> = 4) and green (<italic>N</italic> = 12) curves. Furthermore, the figures show that both when a larger number of training mixtures is presented (yellow curves) and when only 30% of the connections are kept (red curves) the slope of the learning curve is steeper. The weight change is computed by storing the weights values every 2,000 time steps (i.e., &#x0201C;fire&#x0201D; or &#x0201C;non-fire&#x0201D; events) and computing the standard deviation over the last 100 recorded values. The standard deviation is then averaged across all connections from input to output neurons. Therefore, each point on the curve reports the average weight change over the past 2000 &#x000D7;100 time steps. Note that each sound/mixture is presented for 400 time steps. Finally, the x-axis shows the number of repetitions of the training mixture sequence (2,000 for synthetic sounds and 1,500 for naturalistic sounds).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-855753-g0003.tif"/>
</fig>
<p>To reliably compare the performance of our model with human listeners, we designed a similar assessment strategy to that adopted in the experiment. In McDermott et al. (<xref ref-type="bibr" rid="B41">2011</xref>), listeners were presented with mixtures of sounds followed by a probe which could be either a correct probe (i.e., the target sound present in the training mixtures) or an incorrect probe (i.e., sounds unseen during the training). The subjects had to say whether they believed the probe was present in the training mixture by using one of the four responses &#x0201C;sure no,&#x0201D; &#x0201C;no,&#x0201D; &#x0201C;yes,&#x0201D; and &#x0201C;sure yes.&#x0201D; The responses were used to build a receiver operating characteristics (ROC) as described in Wickens (<xref ref-type="bibr" rid="B63">2002</xref>), and the area under the curve (AUC) was used as performance measure, with AUC = 0.5 and 1 corresponding to chance and perfect correct, respectively. In our algorithm, we mimicked this protocol for reporting by using the likelihood as a measure of performance. To this goal, first, for each tested probe, we projected the response of the N output neurons (<xref ref-type="fig" rid="F4">Figures 4A,D</xref>) to a two-dimensional PCA projection plane. We defined the PCA space based on the response to the correct probes and later projected on it the datapoints related to the incorrect probes (<xref ref-type="fig" rid="F4">Figures 4B,E</xref>). We remark that other clustering approaches such as K-means and self-organizing maps could be used instead of PCA without reducing the output dimension. Second, we clustered the datapoints related to the correct probes through a Gaussian Mixture Model (GMM) with as many classes as the number of correct probes (<xref ref-type="fig" rid="F4">Figures 4C,F</xref>). Third, for each datapoint we computed the likelihood that it belonged to one of the clusters. The target likelihood values are fixed to 1 and 0 for datapoints related to correct and incorrect probes respectively. We highlight that the labels introduced in this post-processing phase are not specific for each sound, but rather depend on the role of the sound in the tasks, i.e., if sound X is presented during training as a target or masker sound it is associated to label 1, while if, in another simulation, the same sound X is used to build an incorrect probe (not used during training) then it is associated with label 0. We binned the likelihood range into four intervals corresponding, in an ascending order, to the four responses &#x0201C;sure no,&#x0201D; &#x0201C;no,&#x0201D; &#x0201C;yes,&#x0201D; and &#x0201C;sure yes.&#x0201D; Finally, based on the four responses, we built the receiver operating characteristic (ROC) curve: the datapoints falling in the interval (i) <italic>L</italic>&#x0003E;0 (sure yes) were assigned the probability value <italic>p</italic> = 1.0, those in (ii) &#x02212;5 &#x0003C; <italic>L</italic> &#x0003C; 0 (yes) <italic>p</italic> = 0.66, those in (iii) &#x02212;15 &#x0003C; <italic>L</italic> &#x0003C; &#x02212;5 (no) <italic>p</italic> = 0.33, and those in (iv) <italic>L</italic> &#x0003C; &#x02212;15 (sure no) <italic>p</italic> = 0.0. The AUC of the ROC is used as the &#x0201C;accuracy&#x0201D; metric to evaluate the performance of the model. For additional details see Section 4. Now, we are ready to examine the performance of the model in a series of experiments. We show examples of the different behavior of the network trained on single (<xref ref-type="fig" rid="F4">Figures 4A&#x02013;C</xref>) or four mixtures (<xref ref-type="fig" rid="F4">Figures 4D&#x02013;F</xref>). As expected, the ability of the model to learn and distinguish the targets from the distractors depended crucially on the number of mixtures.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Experiment 1&#x02014;output dynamics and clustering. <bold>(A&#x02013;C</bold>) refer to the results of Experiment 1 on synthesized sounds with a single mixture presented during training. <bold>(D&#x02013;F)</bold> refer to the results of Experiment 1 on synthesized sounds with three mixtures presented during training. The &#x0201C;correct probes&#x0201D; are the target and the distractor sounds composing the mixtures presented during training, while the &#x0201C;incorrect probes&#x0201D; are sounds not presented during training. The numbers in the legends indicate the sound IDs. <bold>(A)</bold> Voltage dynamics of the 8 output neurons during inference, when the target, the distractor and the two associated incorrect probes are tested. The neuron population is not able to respond with different dynamics to the four sounds, and the voltage of all the output neurons fluctuates randomly throughout the whole testing sequence. <bold>(B)</bold> The PCA projection of the datapoints belonging to the two targets (in blue) shows that the clusters are collapsed into a single cluster. <bold>(C)</bold> When GMM is applied, all the datapoints representing both the correct probes (in blue) and the incorrect probes (in orange and red) fall within the same regions, making it impossible to distinguish the different sounds based on the population dynamics. <bold>(D)</bold> Voltage dynamics of the 8 output neurons during inference, when the four targets and the associated distractors are tested. As expected, the neuron population has learnt the feature of the different sounds and responds with different dynamics to the eight sounds. Each output neuron exhibits an enhanced response to one or few sounds. <bold>(E)</bold> The PCA projection of the datapoints belonging to the four correct probes (in blue) shows that the clusters are compact and spatially distant one from the other. <bold>(F)</bold> When GMM is applied, the model shows that the network is, most of the times, able to distinguish the target and distractors (in blue) from the incorrect probes (in yellow, orange and red). The correct probes are never overlapped. Three of the four distractors fall far from the targets&#x00027; region, while the fourth (in yellow) overlaps with one of the targets. These results are overall coherent with the human performance. In <bold>(C,F)</bold>, the contour lines represent the landscape of the log-likelihood that a point belongs to one of the clusters associated to the correct probes.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-855753-g0004.tif"/>
</fig>
<p>The algorithm was implemented in Python and a sample code used to simulate Experiment 1 is available at the repository <ext-link ext-link-type="uri" xlink:href="https://github.com/GiorgiaD/dendritic-neuron-BSS">https://github.com/GiorgiaD/dendritic-neuron-BSS</ext-link>.</p>
</sec>
<sec>
<title>2.4. Experiment 1: Sound Segregation With Single and Multiple Mixtures of Synthesized Sounds</title>
<p>To begin with, we compared how the number of mixtures influences the learning performance between human subjects and the model. The number of mixtures presented during training was varied from 1, where no learning was expected, to 2 or more, where the model was expected to distinguish the target sounds from their respective distractors. The simulation protocol is shown in <xref ref-type="fig" rid="F5">Figure 5A</xref> (bottom). As reported in <xref ref-type="fig" rid="F5">Figure 5A</xref> (top), we obtained that, when one mixture only was shown, neither the target nor the mixing sound was learnt, and performance was close to chance. An immediate boost in the performance was observed when the number of mixtures was raised to two. The network managed to distinguish the learnt targets from the incorrect probes with an accuracy greater than 90%. As the number of mixtures increased up to six, the accuracy worsened slightly, remaining above 80%. A significant drop in the performance was observed for a greater number of mixtures. From a comparison with the results shown in <xref ref-type="fig" rid="F5">Figure 5B</xref>, which were replicated for human subjects (McDermott et al., <xref ref-type="bibr" rid="B41">2011</xref>), it emerged that our model was able to partially reproduce human performance: the success rate was at chance levels when training consists of a single mixture only; the target sounds could be distinguished to a certain accuracy if more than a mixture was learnt. We also verified that the model performance was robust for variations of the network architecture, both in terms of the number of output neurons <italic>N</italic> and the connection probability <italic>p</italic> (<xref ref-type="supplementary-material" rid="SM1">Supplementary Figure 1</xref>). Furthermore we observe that, while none of the output neurons exhibits an enhanced high firing rate when presented with the target sound, the overall population response to the target is substantially different from the response to the masker sounds and to the incorrect probes.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Experiment 1 and 1 a.c.&#x02014;results and comparison with human performance. <bold>(A)</bold> Results and schematics for Experiment 1 on the dendritic network model. The number of mixtures is varied from 1 to 10. Performance is close to chance for a single training mixture. The performance is boosted as two mixtures are presented. As the number of mixtures is further increased, the clustering accuracy slowly decreases toward chance values. The protocol shown at the bottom of the panel illustrates that (i) in the training phase we feed the network only with the mixture(s), i.e., target&#x0002B;masker sound(s). (ii) in the inference phase we feed the network only with the unmixed sounds (target, distractor separately) and with the incorrect probes (also unmixed sounds). We remark that in the case of one mixture (condition 1) the target and the masker sounds play the same role, while in the case of multiple mixtures (conditions 2 and 3) the target has a different role in the protocol as it is present in more than one mixture while the masker sounds are presented in one mixture only in the training sequence. <bold>(B)</bold> Results and schematics for Experiment 1 on the human experiment. The number of mixtures presented are 1, 2, 3, 5, and 10. For a single mixture the performance is close to chance. As the number of mixtures increases, the classification accuracy improves steadily. Figure reproduced based on data acquired by McDermott et al. (<xref ref-type="bibr" rid="B41">2011</xref>). <bold>(C)</bold> Results and schematics for Experiment 1 a.c. on the dendritic network model. The number of mixtures is varied from 2 to 5. Combining all the mixing sounds in mixtures slightly improves the mean performance for two mixing sounds, while it slightly worsens it for a larger number of mixtures. The height of the bars and the error bars show, respectively, mean and standard deviation of the AUC over 10 independent runs.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-855753-g0005.tif"/>
</fig>
<p>Our model and human subjects also exhibited interesting differences. When the mixture number was increased to two, performance improved greatly in our model but only modestly in human subjects. Unlike human subjects, our model showed a decreasing accuracy as the number of mixtures further increased. We consider that such discrepancies may arise from a capacity limitation of the network. Indeed, the network architecture is very simple and consists of two layers only, whose size is limited by the spectrogram dimensions for the input layer and by the number of output neurons for the last layer. Therefore the amount of information that the network can learn and store is limited with respect to the significantly more complex structure of the human auditory system. We also suspect that the two-dimensional PCA projection might limit the model performance when a large number of distractors is used. Indeed the PCA space becomes very crowded and although the datapoints are grouped in distinct clusters, the probability that such a cluster lie close to each other is high. To verify this hypothesis, we tested a modification of the inference protocol of the algorithm. During test, we presented the network only with the target sound and one incorrect probe, and performed BSS on the PCA space containing the two sounds. Under this configuration, the model performance is above chance level for two or more different mixtures, and the accuracy does not significantly decrease for large number of mixtures (<xref ref-type="supplementary-material" rid="SM1">Supplementary Figure 2</xref>).</p>
<p>We may use our model for predicting performance of human subjects in auditory perception tasks not yet tested experimentally. To this end, we propose an extension of the paradigm tested previously: for set-ups with the number of mixtures between two and five, we investigated whether presenting all possible combinations of the mixing sounds among themselves, rather than only the distractors with the target, affects the performance. The experiment is labeled &#x0201C;Experiment 1 a.c.,&#x0201D; where a.c. stands for &#x0201C;all combinations,&#x0201D; and its training scheme is reported in <xref ref-type="fig" rid="F5">Figure 5C</xref>. Because all sounds are in principle learnable in the new paradigm, we expect an enhanced ability of distinguishing the correct probes from the incorrect ones. Somewhat unexpectedly, however, our model indicated no drastic changes in the performance when the mixture sequence presented during training contained all possible combinations of the mixing sounds. Such a scheme resulted in a minor improvement in the accuracy only for the experiments with two mixing sounds. Indeed, in the &#x0201C;all combinations&#x0201D; protocol, during training the distractor was presented in more than one different mixture, while in the original task setting only the target was combined with different sounds. We hypothesize that the &#x0201C;all combinations&#x0201D; protocol makes it easier for the network to better distinguish the distractor sound. For four or five mixing sounds, instead, the performance slightly worsened. It is likely that this behavior is related to the already mentioned capacity restraints of the network. Indeed, the length of the training sequence grows as the binomial coefficient <inline-formula><mml:math id="M14"><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mtable><mml:mtr><mml:mtd><mml:mi>n</mml:mi></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>k</mml:mi></mml:mtd></mml:mtr></mml:mtable><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> where <italic>k</italic> = 2, therefore for four and five targets (i.e., for <italic>n</italic> = 4 or 5) the number of mixtures is increased to 6 and 10, respectively.</p>
</sec>
<sec>
<title>2.5. Experiment 2: Sound Segregation With Alternating Multiple Mixtures of Synthesized Sounds</title>
<p>Next, we investigate the model&#x00027;s performance when the training sequence alternated mixtures of sounds with isolated sounds. An analogous protocol was tested in a psychophysical experiment (see experiment 3 in McDermott et al., <xref ref-type="bibr" rid="B41">2011</xref>). <xref ref-type="fig" rid="F6">Figures 6A,B</xref> show the network accuracy and human performance, respectively, for the protocols A,B,C in <xref ref-type="fig" rid="F6">Figure 6C</xref>. Only the target and the masker sounds were later tested since recognizing the sounds presented individually during training would have been trivial (see conditions B, 1, and 2 in <xref ref-type="fig" rid="F6">Figure 6C</xref>). In the alternating task, the network was only partially able to reproduce the human results, displaying an interesting contrast to human behavior. In condition A, in which the sounds mixed with the main target (in red) changed during training, the listeners were able to learn the targets with an accuracy of about 80%, and so did our model. In contrast, our network behaved radically differently with respect to human performance under condition B, in which the training sequence consisted of the same mixture alternating with different sounds. As reported in <xref ref-type="fig" rid="F5">Figure 5B</xref>, the listeners were generally not able to identify the single sounds composing the mixture. Our model, instead, unexpectedly achieved a performance well above chance. The output dynamics could distinguish the distractors from the two targets with accuracy surprisingly above 90%. The behavioral discrepancy under condition B could be explained by considering that in the training scheme the network is presented with three different sounds besides the mixture. With respect to Experiment 1 with a single mixture, in this protocol the network could learn the supplementary features of the isolated sounds and could exploit them during inference to respond differently to the distractors. From the spectrograms shown in <xref ref-type="fig" rid="F2">Figure 2</xref>, it is evident that some regions of overlap exist between the higher-intensity areas of different sounds. Therefore, the network presented during training with isolated sounds in addition to the single mixture, could detect some similarities between the training sounds and the tested distractors and respond with a more defined output dynamics than in Experiment 1. Finally, under condition C, both human subjects and our model performed above chance. While human performance was slightly above 60%, the network achieved more than 90% accuracy. This result should be interpreted considering that during inference also the isolated sound (blue) was tested together with the associated distractor, which was a trivial task for the nature of our network and thus boosted its overall performance.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Experiments 2 and 3&#x02014;results and comparison with human performance. <bold>(A)</bold> Results for Experiments 2 (dark blue) and 3 (light blue) on the dendritic network model. In Experiment 2 the performance is above chance for the three conditions. In Experiment 3 the accuracy decreases as the number of isolated sounds alternating with the mixtures increases. <bold>(B)</bold> Results for Experiments 2 (dark blue) and 3 (light blue) on the human experiment. In Experiment 2 the performance is above chance in the conditions A and C, while it is random for condition B. In Experiment 3 the accuracy decreases as the target presentation is more delayed. Figure reproduced based on data acquired by McDermott et al. (<xref ref-type="bibr" rid="B41">2011</xref>). <bold>(C)</bold> Schematics for Experiments 2 and 3. The training is the same for both the dendritic network model and the human experiment. The schematics is omitted for delays 3 and 5. The testing refers to the dendritic network model, while the testing for the human experiment (same as in <xref ref-type="fig" rid="F5">Figure 5B</xref>) is omitted. In <bold>(A,B)</bold>, the height of the bars and the error bars show respectively mean and standard deviation of the AUC over 10 independent runs.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-855753-g0006.tif"/>
</fig>
</sec>
<sec>
<title>2.6. Experiment 3: Effect of Temporal Delay in Target Presentation With Synthesized Sounds</title>
<p>Temporal delay in the presentation of mixtures containing the target degraded performance similarly in the model and human subjects. We presented the network with a training sequence of six mixtures containing the same target mixed each time with a different distractor (<xref ref-type="fig" rid="F6">Figure 6C</xref>, protocols 0,1,2: c.f. experiment 4 in McDermott et al., <xref ref-type="bibr" rid="B41">2011</xref>). The mixtures alternated with an increasing number of isolated sounds, hence increasing the interval between successive presentations of the target. The human ability to extract single sounds from mixtures was previously shown to worsen as the interval between target presentations increased, as replicated in <xref ref-type="fig" rid="F6">Figure 6B</xref>. The network presented a similar decreasing trend, as reported in <xref ref-type="fig" rid="F6">Figure 6A</xref>. An interesting difference, however, is that the performance of our model drastically dropped even with one isolated sound every other mixture while the human performance was affected when at least two isolated sounds separated the target-containing mixtures. The discrepant behavior indicates that the insertion of isolated sounds between the target-containing mixtures more strongly interferes the learning of the target sound in the model compared to human subjects. This stronger performance degradation may partly be due to the capacity constraint of our simple neural model, which uses a larger amount of memory resource as the number of isolated sounds increases. In contrast, such a constraint may be less tight in the human auditory systems.</p>
<p>Also for Experiments 2 and 3, we tested a modification of the inference protocol, by presenting the network only with the target sound and one incorrect probe. Under this configuration, the model performance of Experiment 2 improves compared to the original protocol, while no substantial changes are noted for Experiment 3 (<xref ref-type="supplementary-material" rid="SM1">Supplementary Figure 3</xref>).</p>
</sec>
<sec>
<title>2.7. Experiment 4: Sound Segregation With Single and Multiple Mixtures of Real-World Sounds</title>
<p>We applied the same protocol of Experiments 1 to the dataset of natural sounds. Although such experiments were previously not attempted on human subjects, it is intriguing to investigate whether the model can segregate target natural sounds by the same strategy. The spectrograms of two isolated sounds and of their mixtures are shown in <xref ref-type="fig" rid="F7">Figures 7A&#x02013;C</xref>, together with the respective sound waves (<xref ref-type="fig" rid="F7">Figures 7D&#x02013;F</xref>). The qualitative performance was very similar to that obtained with the synthesized sounds. Specifically, the output dynamics learned from the repetition of a single mixture was randomly fluctuating for both seen and randomly chosen unseen sounds (<xref ref-type="fig" rid="F8">Figure 8A</xref>), whereas the network responses to targets and unseen sounds were clearly distinct if multiple mixtures were presented during training (<xref ref-type="fig" rid="F8">Figure 8D</xref>). The output dynamics were not quantitatively evaluated because it was not possible to rigorously generate incorrect probes associated with the learnt targets and distractors. Therefore, we qualitatively assessed the performance of the model by observing the clustering of network responses to the learnt targets vs. unseen natural sounds (<xref ref-type="fig" rid="F8">Figures 8B&#x02013;F</xref>). We observed that, in the case of multiple mixtures, the clusters related to natural sounds (<xref ref-type="fig" rid="F8">Figures 8E,F</xref>) were more compact than those of synthetic sounds (<xref ref-type="fig" rid="F4">Figures 4E,F</xref>). Furthermore, these clusters were more widely spaced on the PCA projection plane: the intraclass correlation in the response to the same target was greater while the interclass similarity in the response to different targets or distractors was lower. These results indicate that grouping cues, such as harmonic structure and temporal onset, improve the performance of the model.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Real-world sounds&#x02014;targets and mixture. <bold>(A)</bold> Spectrogram of a spoken sentence 800 ms-long. <bold>(B)</bold> Spectrogram of 800 ms-long recording of chimes sounds. <bold>(C)</bold> Spectrogram of the mixture of the sounds in <bold>(A,B)</bold>. <bold>(D)</bold> Sound wave associated with the spectrogram in <bold>(A)</bold>. <bold>(E)</bold> Sound wave associated with the spectrogram in <bold>(B)</bold>. <bold>(F)</bold> Sound wave associated with the spectrogram in <bold>(C)</bold>.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-855753-g0007.tif"/>
</fig>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>Experiment 4&#x02014;output dynamics and clustering. <bold>(A&#x02013;C)</bold> refer to the results of Experiment 4 on real-world sounds with a single mixture presented during training. <bold>(D&#x02013;F)</bold> refer to the results of Experiment 4 on real-world sounds with three mixtures presented during training. <bold>(A)</bold> Voltage dynamics of the 8 output neurons during inference, when the target, the distractor and one unseen sound are tested. As expected, the neuron population is not able to respond with different dynamics to the three sounds, and the voltage of all the output neurons fluctuates randomly throughout the whole testing sequence. <bold>(B)</bold> The PCA projection of the datapoints belonging to the target and distractor (in blue) shows that the clusters are collapsed into a single cluster. <bold>(C)</bold> When GMM is applied, all the datapoints representing both the learnt sounds (in blue) and the unseen sound (in orange) fall within the same regions, making it impossible to distinguish the different sounds based on the population dynamics. <bold>(D)</bold> Voltage dynamics of the 8 output neurons during inference, when the target, the three distractors, and one unseen sound are tested. As expected, the neuron population has learnt the feature of the different sounds and responds with different dynamics to the five sounds. Each output neuron has an enhanced response to one or few sounds. <bold>(E)</bold> The PCA projection of the datapoints belonging to the four correct probes (in blue) shows that the clusters are more compact and more spatially distant one from the other with respect to the result obtained with the synthetized sounds. <bold>(F)</bold> When GMM is applied, the model shows that the network clearly distinguished the learnt sounds (in blue) from the unseen sound (in orange). These results show that the grouping cues improve the model accuracy with respect to the synthesized dataset.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-855753-g0008.tif"/>
</fig>
</sec>
<sec>
<title>2.8. Experiment 5: Image Segregation With Single and Multiple Mixtures of Real-World Images</title>
<p>Finally, we examined whether the source segregation through repetition scheme can also extend to vision-related tasks, as previously suggested (McDermott et al., <xref ref-type="bibr" rid="B41">2011</xref>). To this end, we employed the same method as developed for sound sources and performed the recovery of visual sources with the protocol of Experiment 1. The mixtures were obtained by overlapping black-and-white images sampled from our visual dataset (Section 4), as shown in <xref ref-type="fig" rid="F9">Figure 9</xref>. Similarly to Experiment 4, the performance of the model was assessed only qualitatively in the visual tasks. As in the acoustic tasks, the clustering of network responses showed that the model was able to retrieve the single images only when more than one mixture was presented during training. The network responses are shown in <xref ref-type="fig" rid="F10">Figure 10</xref>. We remark that the model is presented with the visual stimuli following the same computational steps as for sounds. Indeed, as previously described, the acoustic stimuli are first pre-processed into spectrograms and then encoded by the input layer. While it is not unexpected that similar computational steps lead to consistent results, we remark that the nature of the &#x0201C;audio images,&#x0201D; i.e., the spectrograms, is substantially different to that of the naturalistic images, leading to very different distributions of the encoding spike patterns. Therefore, successful signal discrimination in the visual task strengthens our results, proving that our model is robust with respect to different arrangements of signal intensity.</p>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption><p>Real-world images&#x02014;targets and mixture. <bold>(A)</bold> Squared 128 &#x000D7; 128 target image of a zebra. <bold>(B)</bold> Squared 128 &#x000D7; 128 distractor image of a butterfly. <bold>(C)</bold> Mixture of the target and distractor images shown in <bold>(A,B)</bold>. Source: Shutterstock.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-855753-g0009.tif"/>
</fig>
<fig id="F10" position="float">
<label>Figure 10</label>
<caption><p>Experiment 5&#x02014;output dynamics and clustering. <bold>(A&#x02013;C)</bold> refer to the results of Experiment 5 on real-world images with a single mixture presented during training. <bold>(D&#x02013;F)</bold> refer to the results of Experiment 5 on real-world images with three mixtures presented during training. <bold>(A)</bold> Voltage dynamics of the 5 output neurons during inference, when the two training images and one unseen image are tested. As expected, the neuron population is not able to respond with different dynamics to the three images, and the voltage of all the output neurons fluctuates randomly throughout the whole testing sequence. <bold>(B)</bold> The PCA projection of the datapoints belonging to the two seen images (in blue) shows that the clusters are collapsed into a single cluster. <bold>(C)</bold> When GMM is applied, all the datapoints representing both the targets (in blue) and the unseen image (in orange) fall within the same regions, making it impossible to distinguish the different images based on the population dynamics. <bold>(D)</bold> Voltage dynamics of the 5 output neurons during inference, when the four targets and one unseen image are tested. As expected, the neuron population has learnt the features of the different images and responds with different dynamics to the five images. Each output neuron has an enhanced response to one or few inputs. <bold>(E)</bold> The PCA projection of the datapoints belonging to the four learnt images (in blue) shows that the clusters are compact and spatially distant one from the other. <bold>(F)</bold> When GMM is applied, the model shows that the network clearly distinguished the target and distractors (in blue) from the unseen image (in orange). These results suggest that humans would be able to distinguish single visual targets if previously seen in different mixtures.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-855753-g0010.tif"/>
</fig>
</sec>
</sec>
<sec sec-type="discussion" id="s3">
<title>3. Discussion</title>
<p>The recovery of individual sound sources from mixtures of multiple sounds is a central challenge of hearing. Based on experiments on human listeners, sound segregation has been postulated to arise from prior knowledge of sound characteristics or detection of repeating spectro-temporal structure. The results of McDermott et al. (<xref ref-type="bibr" rid="B41">2011</xref>) show that a sound source can be recovered from a sequence of mixtures if it occurs more than once and is mixed with more than one masker sound. This supports the hypothesis that the auditory system detects repeating spectro-temporal structure embedded in mixtures, and interprets this structure as a sound source. We investigated whether a biologically inspired computational model of the auditory system can account for the characteristic performance of human subjects. To this end, we implemented a one-layer neural network with dendritic neurons followed by a readout layer based on GMM to classify probe sounds as seen or unseen in the training mixtures. The results in McDermott et al. (<xref ref-type="bibr" rid="B41">2011</xref>) show that source repetition can be detected by integrating information over time and that the auditory system can perform sound segregation when it is able to recover the target sound&#x00027;s latent structure. Motivated by these findings, we trained our dendritic model with a learning rule that was previously demonstrated to detect and analyze the temporal structure of a stream of signals. In particular, we relied on the learning rule described by Asabuki and Fukai (<xref ref-type="bibr" rid="B4">2020</xref>), which is based on the minimization of regularized information loss. Specifically, such a principle enables the self-supervised learning of recurring temporal features in information streams using a family of competitive networks of somatodendritic neurons. However, while the learning rule has been designed to capture temporal information in an online fashion, in our framework we flatten the spectrogram before encoding it, making the spike pattern static during the stimulus presentation. Therefore, the temporal fluctuations are determined by the stochastic processes in the rate encoding step.</p>
<p>We presented the network with temporally overlapping sounds following the same task protocols as described in McDermott et al. (<xref ref-type="bibr" rid="B41">2011</xref>). First, we carried out the segregation task with the same dataset of synthesized sounds presented to human listeners in McDermott et al. (<xref ref-type="bibr" rid="B41">2011</xref>). We found that the model was able to segregate sounds only when one of the masker sounds varied, not when both sounds of the mixture were repeated. Our findings bear a closer resemblance to the experimental findings of human listeners over a variety of task settings. Earlier works have proposed biologically inspired networks to perform BSS (Pehlevan et al., <xref ref-type="bibr" rid="B49">2017</xref>; Isomura and Toyoizumi, <xref ref-type="bibr" rid="B27">2019</xref>; Bahroun et al., <xref ref-type="bibr" rid="B7">2021</xref>). However, to our knowledge, this is the first attempt to reproduce the experimental results of recovering sound sources through embedded repetition. For this reason, we could not compare our results with previous works. Additionally, we demonstrated that our network can be a powerful tool for predicting the dynamics of brain segregation capabilities under settings difficult to test on humans. In particular, the recovery of natural sounds is expected to be a trivial task for humans given their familiarity with the sounds, whereas our model is built from scratch and has no prior knowledge about natural sounds. We find that the hallmarks of natural sounds make the task easier for the network when the target is mixed with different sounds, but, as for the synthetic dataset, the sounds cannot be detected if presented always in the same mixture. Furthermore, we extended the study to investigate BSS of visual stimuli and observed a similar qualitative performance as in the auditory settings. This is not surprising from a computational perspective as the computational steps of the visual experiment are the same as for the acoustic experiment: there, the sounds are first preprocessed into images, the spectrograms, and then presented to the network in a visual form. From the biological point of view, the neural computational primitives used in the visual and the auditory cortex may be similar, as evidenced by anatomical similarity and by developmental experiments where auditory cortex neurons acquire V1-like receptive fields when visual inputs are redirected there (Sharma et al., <xref ref-type="bibr" rid="B56">2000</xref>; Bahroun et al., <xref ref-type="bibr" rid="B7">2021</xref>). We point out, however, that such a similarity is valid only at high level as there are some substantial differences between visual and auditory processing. For instance, the mechanisms to encode the input signal into spikes rely on different principles: in the retina the spike of a neuron indicates a change in light in the space it represents, while in the cochlea the rate of a neurons represents the amplitude of the frequency it is associated to, like a mechanical FFT. Motivated by these reasons, we suggest extending the experiments of source repetition to vision to verify experimentally whether our computational results provide a correct prediction of the source separation dynamics of the visual system.</p>
<p>Although the dynamics of our model under many aspects matches the theory of repetition-based BSS, the proposed scheme presents a few limitations. The major limitation concerns the discrepancy of the results in experiment 2B. In such a setting, the model performance is well above chance, although the target sound always occurs in the same mixture. We speculate that, in this task settings, the output neurons learn the temporal structure of the distractor sounds presented outside the mixture and that they recognize some similarities in the latent structure of the probes. We note that the degree of similarity among distractors is the same as in the psychophysics experiment. This pushes the neurons to respond differently to the correct and incorrect probes, thereby allowing the output classifier to distinguish the sounds. In contrast, we speculate that human auditory perception relies also on the outcome of the later integration of features detected at early processing stages. This will prevent the misperception of sounds based on unimportant latent features. A second limitation of the selected encoding method consists in the difficulty to model the experiments relying on the asynchronous overlapping of signals and on reversed probe sounds presented by McDermott et al. (<xref ref-type="bibr" rid="B41">2011</xref>). Indeed, in our approach, because of the flattening of the spectrogram in the encoding phase, each input neuron responds to one specific time frame, and the output neurons are trained uniquely on this configuration. Hence, temporal shifts or inverting operations are not possible. Third, we observed that in Experiment 1, as the number of mixtures increased over a certain threshold, the model&#x00027;s accuracy degraded. We speculate that, in such settings, substituting PCA with a clustering algorithm not relying on dimensionality reduction, such as K-means, may help mitigate the issue. In addition, an interesting variation of our framework would be replacing the clustering step of the model with an another layer of spiking neurons. Fourth, the flattening of the spectrogram in the spike encoding stage is not biologically plausible and introduces high latency as the entire input signal needs to be buffered before the encoding starts. This strategy exhibits the advantage of making the length of the spike train fixed for any sound length, though modifications of the encoding scheme that preserves the signal&#x00027;s temporal structure might be more suitable for applications tailored for real-world devices. Furthermore, an instantaneous identity coding approach, either from raw signal or <italic>via</italic> a spectrogram, would not be affected by the previously described issues related to the spectrogram normalization in the presence of outliers in signal intensity. Motivated by these points, in a follow up work we intend to explore an extension of the presented framework combining time frame-dependent encoding and spike-based post-processing clustering, which would allow us to integrate the model in embedded neuromorphic applications for sound source separation with reduced response latency. In this context, for further lowering the temporal latency, as well as for reducing the model&#x00027;s energy consumption in neuromorphic devices, the time-to-first-spike encoding method could be explored as an alternative to the current rate coding approach.</p>
<p>Furthermore, as previously mentioned, the training scheme in Asabuki and Fukai (<xref ref-type="bibr" rid="B4">2020</xref>) has proven to be able to learn temporal structures in a variety of tasks. In particular, the model was shown to perform chunking as well as to achieve BSS from mixtures of mutually correlated signals. We underline that our computational model and experiments differ in fundamental ways from the BSS task described by Asabuki and Fukai (<xref ref-type="bibr" rid="B4">2020</xref>). First, the two experiments diverge in their primary scope. The BSS task aims at using the average firing rate of the single neurons responding to sound mixtures to decode separately the original sounds. In our work, instead, sound mixtures are included only in the training sequence and, during inference, only individual sounds are presented to the network. Our goal is to verify from the population activity whether the neurons have effectively learned the sounds and can distinguish them from unseen distractors. Furthermore, in Asabuki and Fukai (<xref ref-type="bibr" rid="B4">2020</xref>) the stimulus was encoded into spike patterns using one Poisson process proportional to the amplitude of the sound waveform at each time step, disregarding the signal intensity at different frequencies. This method was not suitable for the source segregation through repetition task, where the sound mixtures retain important information on the frequency features of the original sounds at each time frame. Furthermore, we flatten the audio signal spectrogram before encoding it, unlike in the BSS task described by Asabuki and Fukai (<xref ref-type="bibr" rid="B4">2020</xref>).</p>
<p>In summary, we have shown that a network of dendritic neurons trained in an unsupervised fashion is able to learn the features of overlapping sounds and, once the training is completed, can perform blind source separation if the individual sounds have been presented in different mixtures. These results account for the experimental performance of human listeners tested on the same task setting. Our study has demonstrated that a biologically inspired simple model of the auditory system can capture the intrinsic neural mechanisms underlying the brain&#x00027;s capability of recovering individual sound sources based on repetition protocols. Furthermore, as the adopted learning scheme in our model is local and unsupervised, the network is self-organizing. Therefore, the proposed framework opens up new computational paradigms with properties specifically suited for embedded implementations of audio and speech processing tasks in neuromorphic hardware.</p>
</sec>
<sec sec-type="materials and methods" id="s4">
<title>4. Materials and Methods</title>
<sec>
<title>4.1. Datasets</title>
<p>A dataset of synthesized sounds were created in the form of spectrogram, which shows how signal strength evolves over time at various frequencies, according to the method described previously (McDermott et al., <xref ref-type="bibr" rid="B41">2011</xref>). In short, the novel spectrograms were built as Gaussian distributions based on correlation functions analogous to those of real-world sounds. White noise was later applied to the resulting spectrograms. Five Gaussian distributions were employed to generate each of ten different sounds in <xref ref-type="fig" rid="F5">Figure 5A</xref>. The corresponding spectrograms featured 41 frequency filters equally spaced on an ERBN (Equivalent Rectangular Bandwidth, with subscript N denoting normal hearing) scale (Glasberg and Moore, <xref ref-type="bibr" rid="B22">1990</xref>) spanning 20&#x02013;4,000 Hz, and 33 time frames equally dividing the 700 ms sound length. For our simulations, we used the same MATLAB toolbox and parameters used in the previous study (McDermott et al., <xref ref-type="bibr" rid="B41">2011</xref>). For further details on the generative model for sounds, please refer to the SI Materials and Methods therein.</p>
<p>In addition to the dataset of synthesized sounds, we built a database composed of 72 recordings of isolated natural sounds. The database contained 8 recordings of human speech from the EUSTACE (the Edinburgh University Speech Timing Archive and Corpus of English) speech corpus (White and King, <xref ref-type="bibr" rid="B62">2003</xref>), 23 recordings of animal vocalizations from the Animal Sound Archive (Frommolt et al., <xref ref-type="bibr" rid="B21">2006</xref>), 29 recordings of music instruments by Philharmonia Orchestra (Philarmonia Orchestra Instruments, <xref ref-type="bibr" rid="B50">2019</xref>), and 12 sounds produced by inanimate objects from the BBC Sound Effect corpus (BBC, <xref ref-type="bibr" rid="B8">1991</xref>). The sounds were cut into 800 ms extracts. Then the library librosa (McFee et al., <xref ref-type="bibr" rid="B42">2015</xref>) was employed to extract spectrograms with 128 frequency filters spaced following the Mel scale (Stevens et al., <xref ref-type="bibr" rid="B58">1937</xref>) and 10 ms time frames with 50% overlap.</p>
<p>For image source separation, we built a database consisting of 32 black-and-white pictures of various types, both single objects and landscapes. The images were later squared, and their size was reduced to 128 &#x000D7; 128 pixels.</p>
</sec>
<sec>
<title>4.2. Neuron Model</title>
<p>In this study we used the same two-compartment neuron model as that developed previously (Asabuki and Fukai, <xref ref-type="bibr" rid="B4">2020</xref>). The mathematical details are found therein. Here, we only briefly outline the mathematical framework of the neuron model. Our two-compartment model learns temporal features of synaptic input given to the dendritic compartment by minimizing a regularized information loss arising in signal transmission from the dendrite to the soma. In other words, the two-compartment neuron extracts the characteristic features of temporal input by compressing the high dimensional data carried by a temporal sequence of presynaptic inputs to the dendrite onto a low dimensional manifold of neural dynamics. The model performs this temporal feature analysis by modifying the weights of dendritic synapses to minimize the time-averaged mismatch between the somatic and dendritic activities over a certain recent interval. In a stationary state, the somatic membrane potential of the two-compartment model could be described as an attenuated version of the dendritic membrane potential with an attenuation factor (Urbanczik and Senn, <xref ref-type="bibr" rid="B61">2014</xref>). Though we deal with time-dependent stimuli in our model, we compare the attenuated dendritic membrane potential with the somatic membrane potential at each time point. This comparison, however, is not drawn directly on the level of the membrane potentials but on the level of the two non-stationary Poissonian spike distributions with time-varying rates, which would be generated if both soma and dendrite were able to emit spikes independently. In addition, the dynamic range of somatic responses needs to be appropriately rescaled (or regularized) for meaningful comparison. An efficient learning algorithm for this comparison can be derived by minimizing the Kullback&#x02013;Leibler (KL) divergence between the probability distributions of somatic and dendritic activities. Note that the resultant learning rule enables unsupervised learning because the somatic response is fed back to the dendrite to train dendritic synapses. Thus, our model proposes the view that backpropagating action potentials from the soma may provide a supervising signal for training dendritic synapses (Larkum et al., <xref ref-type="bibr" rid="B35">1999</xref>; Larkum, <xref ref-type="bibr" rid="B34">2013</xref>).</p>
</sec>
<sec>
<title>4.3. Network Architecture</title>
<p>The network architecture, shown in <xref ref-type="fig" rid="F1">Figure 1</xref>, consisted of two layers of neurons, either fully connected or with only 30% of the total connections. The input layer contained as many Poisson neurons as the number of pixels present in the input spectrogram (acoustic stimulus) or input image (visual stimulus). The postsynaptic neurons were modeled according to the two-compartment neuron model proposed previously (Asabuki and Fukai, <xref ref-type="bibr" rid="B4">2020</xref>). Their number was varied from a pair to few tenths, depending on the complexity of the task. Unless specified otherwise, 8 and 5 output neurons were set for acoustic and visual task respectively.</p>
<p>In the first layer, the input was encoded into spikes through a rate coding-based method (Almomani et al., <xref ref-type="bibr" rid="B2">2019</xref>). The strength of the signal at each pixel drove the firing rate of the associated input neuron, i.e., the spike trains were drawn from Poisson point processes with probability proportional to the intensity of the pixel. We imposed that, for each input stimulus, the spike pattern was generated through a sequence of 400 time steps, where each time step corresponds to a &#x0201C;fire&#x0201D; or &#x0201C;non-fire&#x0201D; event.</p>
<p>We designed the output layer and the learning process similarly to the previous network used for the blind signal separation (BSS) within mixtures of multiple mutually correlated signals as well as for other temporal feature analyses (Asabuki and Fukai, <xref ref-type="bibr" rid="B4">2020</xref>). As mentioned previously, the learning rule was modeled as a self-supervising process, which is at a conceptual level similar to Hebbian learning with backpropagating action potentials. The soma generated a supervising signal to learn and detect the recurring spatiotemporal patterns encoded in the dendritic activity. Within the output layer, single neurons learned to respond differently to each input pattern. Competition among neurons was introduced to ensure that different neurons responded to different inputs. With respect to the network used for BSS containing only two output neurons, we rescaled the strength of the mutual inhibition among dendritic neurons by a factor proportional to the inverse of the square root of the number of output neurons. This correction prevented each neuron from being too strongly inhibited when the size of the output layer increased (i.e., exceeds three or four). Furthermore, we adopted the same inhibitory spike timing-dependent plasticity (iSTDP) as employed in the previous model. This rule modified inhibitory connections between two dendritic neurons when they coincidently responded to a certain input. The iSTDP allowed the formation of chunk-specific cell assemblies when the number of output neurons was greater than the number of input patterns.</p>
<p>For all parameters but noise intensity &#x003BE;<sub><italic>i</italic></sub> during learning, we used the same values as used in the original network model (Asabuki and Fukai, <xref ref-type="bibr" rid="B4">2020</xref>). For bigger values of noise intensity g, the neural responses were subject to more fluctuations and neurons tended to group in only one cell assembly. From the analysis of the learning curves shown in <xref ref-type="fig" rid="F3">Figure 3</xref>, we decided to train the network from randomly initialized weights and to expose it, during training, to the mixture sequence 3,000 times for the synthesized sounds and 1500 times for the real-world sounds. The learning rate was kept constant throughout the whole process. During testing, the sequence of target sounds and respective distractors was presented 50 times, and the resulting neural dynamics was averaged over 20 trials. The performance results shown in the section 2 were computed as average over 10 repetitions of the same simulation set-up. In each repetition different target sounds and distractors were randomly sampled from the dataset in order to ensure performance independence of specific sounds.</p>
</sec>
<sec>
<title>4.4. Experimental Settings and Performance Measure</title>
<p>The synapses were kept fixed during inference in our network, implying that the responses to probes tested later were not affected by the presentation of other previously tested probes. This allowed us to test the trained network on a sequence of probes, rather than only on one probe as in the studies of the human brain where plasticity cannot be frozen during inference (McDermott et al., <xref ref-type="bibr" rid="B41">2011</xref>). In <xref ref-type="fig" rid="F5">Figures 5A</xref>, <xref ref-type="fig" rid="F6">6C</xref>, the first half of the sequence contained the target and the distractors, the second half the respective incorrect probes, which were also built by using the same method as in human experiment (McDermott et al., <xref ref-type="bibr" rid="B41">2011</xref>). Each incorrect probe was a sound randomly selected from the same Gaussian distribution generating the associated target. After the sampling, a randomly selected time slice equal to 1/8 of the sound duration was set to be equal to the target.</p>
<p>The possibility of presenting more than one probe allowed us to test the performance of the network for all the sounds present in the mixtures. To ensure a stable neural response against the variability of the encoding, we repeated the sequence 50 times. The response of the network consisted of the ensemble activity of the output neurons. As previously explained, 400 time-steps were devoted to the presentation to each stimulus. The response to each probe, therefore, consisted of 400 data points describing the dynamical activity of each output neuron, each point being a collection of N values, where N is the number of output neurons. An example of one testing epoch output is shown in <xref ref-type="fig" rid="F4">Figures 4A,C</xref>. We neglected the first 50 data points, since, during the initial transient time, the membrane potential was still decaying or rising after the previous input presentation. For visualization purpose, we applied the principal component analysis (PCA) to reduce the dimensionality of the data from N to 2. In our settings, the two principal components explain approximately 40% of the variance of the neural response. The PCA transformation was based uniquely on the data points obtained with the presentation of the target and the distractors, as shown in <xref ref-type="fig" rid="F4">Figures 4B,E</xref>. The same transformation was later exploited to project the points related to the incorrect probes. Only the target and distractors patterns were presented during the learning process, and the responses to unseen patterns were afterwards projected on the space defined by the training.</p>
<p>The two-dimensional projection of the target-related data points were clustered in an unsupervised manner through GMM. We set the number of Gaussians equal to the number of targets such that the covariance matrices had a full rank. With the defined GMM model at hand, we proceeded with fitting all the PCA data points, related to both correct and incorrect probes. The model tells which cluster each data point belonged to and what was the likelihood (L) that the cluster had generated this data point. <xref ref-type="fig" rid="F4">Figures 4C,F</xref> show the datapoints projected on the PCA plane together with the GMM clustering and likelihood curves.</p>
<p>We used the likelihood as a measure of performance. The four intervals of the likelihood range corresponding to the four responses &#x0201C;sure no,&#x0201D; &#x0201C;no,&#x0201D; &#x0201C;yes,&#x0201D; and &#x0201C;sure yes&#x0201D; were (i) <italic>L</italic>&#x0003E;0 (sure yes), (ii) &#x02212;5 &#x0003C; <italic>L</italic> &#x0003C; 0 (yes), (iii) &#x02212;15 &#x0003C; <italic>L</italic> &#x0003C; &#x02212;5 (no), and (iv) <italic>L</italic> &#x0003C; &#x02212;15 (sure no). In building the receiver operating characteristic (ROC) curve, the datapoints falling in the interval (i) were assigned the probability value 1.0, those in (ii) 0.66, those in (iii) 0.33, and those in (iv) 0.0.</p>
<p>The described evaluation metrics was applied only to the experiments carried on the dataset composed of synthesized sounds. For the experiments based on natural sounds and images, the results of clustering were shown only qualitatively for the target-related datapoints. Indeed, due to the real-world nature of signals, it was not possible to simply use Gaussian functions to build physically consistent incorrect probes. On the real-world sound dataset, we performed all the same protocol of Experiment 1 (Experiment 4). On the image dataset we performed an experiment with a protocol analogous to Experiment 1. Here, the mixtures were obtained by overlapping two images, both with transparency 0.5, similarly to the spectrogram overlapping described for the acoustic task. The input images were normalized to the range [0,1] and the intensity of each pixel was encoded through the firing rate of one input neuron. We followed the same procedure and network setting described for the audio stimuli segregation to assess the ability of the network to separate visual stimuli presented in mixtures.</p>
</sec>
</sec>
<sec sec-type="data-availability" id="s5">
<title>Data Availability Statement</title>
<p>The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: <ext-link ext-link-type="uri" xlink:href="https://github.com/GiorgiaD/dendritic-neuron-BSS">https://github.com/GiorgiaD/dendritic-neuron-BSS</ext-link>.</p>
</sec>
<sec id="s6">
<title>Author Contributions</title>
<p>TF, GD, and TA conceived the idea. GD designed and performed the simulations, with input from TA. GD and TF wrote the manuscript. TA and GD wrote the <xref ref-type="supplementary-material" rid="SM1">Supplementary Material</xref>. All authors analyzed the results. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec sec-type="funding-information" id="s7">
<title>Funding</title>
<p>This work was partly supported by JSPS KAKENHI no. 19H04994 to TF.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s8">
<title>Publisher&#x00027;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<ack><p>We are grateful to all the colleagues in the Neural Coding and Brain Computing Unit for fruitful interaction.</p>
</ack>
<sec sec-type="supplementary-material" id="s9">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fnins.2022.855753/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fnins.2022.855753/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.PDF" id="SM1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ahveninen</surname> <given-names>J.</given-names></name> <name><surname>H&#x000E4;m&#x000E4;l&#x000E4;inen</surname> <given-names>M.</given-names></name> <name><surname>J&#x000E4;&#x000E4;skel&#x000E4;inen</surname> <given-names>I. P.</given-names></name> <name><surname>Ahlfors</surname> <given-names>S. P.</given-names></name> <name><surname>Huang</surname> <given-names>S.</given-names></name> <name><surname>Lin</surname> <given-names>F.-H.</given-names></name> <etal/></person-group>. (<year>2011</year>). <article-title>Attention-driven auditory cortex short-term plasticity helps segregate relevant sounds from noise</article-title>. <source>Proc. Natl. Acad. Sci. U.S.A.</source> <volume>108</volume>, <fpage>4182</fpage>&#x02013;<lpage>4187</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.1016134108</pub-id><pub-id pub-id-type="pmid">21368107</pub-id></citation></ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Almomani</surname> <given-names>D.</given-names></name> <name><surname>Alauthman</surname> <given-names>M.</given-names></name> <name><surname>Alweshah</surname> <given-names>M.</given-names></name> <name><surname>Dorgham</surname> <given-names>O.</given-names></name> <name><surname>Albalas</surname> <given-names>F.</given-names></name></person-group> (<year>2019</year>). <article-title>A comparative study on spiking neural network encoding schema: implemented with cloud computing</article-title>. <source>Cluster Comput.</source> <volume>22</volume>, <fpage>419</fpage>&#x02013;<lpage>433</lpage>. <pub-id pub-id-type="doi">10.1007/s10586-018-02891-0</pub-id></citation>
</ref>
<ref id="B3">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Amari</surname> <given-names>S.</given-names></name> <name><surname>Cichocki</surname> <given-names>A.</given-names></name> <name><surname>Yang</surname> <given-names>H.</given-names></name></person-group> (<year>1995</year>). <article-title>A new learning algorithm for blind signal separation,</article-title> in <source>NIPS&#x00027;95: Proceedings of the 8th International Conference on Neural Information Processing Systems</source> (<publisher-loc>Cambridge, MA</publisher-loc>), <fpage>757</fpage>&#x02013;<lpage>763</lpage>.</citation>
</ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Asabuki</surname> <given-names>T.</given-names></name> <name><surname>Fukai</surname> <given-names>T.</given-names></name></person-group> (<year>2020</year>). <article-title>Somatodendritic consistency check for temporal feature segmentation</article-title>. <source>Nat. Commun.</source> <volume>11</volume>, <fpage>1554</fpage>. <pub-id pub-id-type="doi">10.1038/s41467-020-15367-w</pub-id><pub-id pub-id-type="pmid">32214100</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Asari</surname> <given-names>H.</given-names></name> <name><surname>Pearlmutter</surname> <given-names>B. A.</given-names></name> <name><surname>Zador</surname> <given-names>A. M.</given-names></name></person-group> (<year>2006</year>). <article-title>Sparse representations for the cocktail party problem</article-title>. <source>J. Neurosci.</source> <volume>26</volume>, <fpage>7477</fpage>&#x02013;<lpage>7490</lpage>. <pub-id pub-id-type="doi">10.1523/JNEUROSCI.1563-06.2006</pub-id><pub-id pub-id-type="pmid">16837596</pub-id></citation></ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Atilgan</surname> <given-names>H.</given-names></name> <name><surname>Town</surname> <given-names>S. M.</given-names></name> <name><surname>Wood</surname> <given-names>K. C.</given-names></name> <name><surname>Jones</surname> <given-names>G. P.</given-names></name> <name><surname>Maddox</surname> <given-names>R. K.</given-names></name> <name><surname>Lee</surname> <given-names>A. K.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>Integration of visual information in auditory cortex promotes auditory scene analysis through multisensory binding</article-title>. <source>Neuron</source> <volume>97</volume>, <fpage>640.e4</fpage>&#x02013;<lpage>655.e4</lpage>. <pub-id pub-id-type="doi">10.1101/098798</pub-id><pub-id pub-id-type="pmid">29395914</pub-id></citation></ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bahroun</surname> <given-names>Y.</given-names></name> <name><surname>Chklovskii</surname> <given-names>D. B.</given-names></name> <name><surname>Sengupta</surname> <given-names>A. M.</given-names></name></person-group> (<year>2021</year>). <article-title>A normative and biologically plausible algorithm for independent component analysis</article-title>. <source>arXiv [Preprint]</source>. arXiv: 2111.08858. <pub-id pub-id-type="doi">10.48550/arXiv.2111.08858</pub-id></citation>
</ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><collab>BBC</collab></person-group>. (<year>1991</year>). <source>BBC sound effects library. Compact disc.; Digital and Analog Recordings.; Detailed Contents on Insert in Each Container.;Recorded: 1977&#x02013;1986</source>. <publisher-loc>Princeton, NJ</publisher-loc>: <publisher-name>Films for the Humanities and Sciences</publisher-name>.</citation>
</ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bee</surname> <given-names>M.</given-names></name> <name><surname>Micheyl</surname> <given-names>C.</given-names></name></person-group> (<year>2008</year>). <article-title>The cocktail party problem: what is it? How can it be solved? and why should animal behaviorists study it?</article-title> <source>J. Comp. Psychol.</source> <volume>122</volume>, <fpage>235</fpage>&#x02013;<lpage>251</lpage>. <pub-id pub-id-type="doi">10.1037/0735-7036.122.3.235</pub-id><pub-id pub-id-type="pmid">18729652</pub-id></citation></ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bell</surname> <given-names>A.</given-names></name> <name><surname>Sejnowski</surname> <given-names>T.</given-names></name></person-group> (<year>1995</year>). <article-title>An information-maximization approach to blind separation and blind deconvolution</article-title>. <source>Neural Comput.</source> <volume>7</volume>, <fpage>1129</fpage>&#x02013;<lpage>1159</lpage>.<pub-id pub-id-type="pmid">7584893</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bronkhorst</surname> <given-names>A.</given-names></name></person-group> (<year>2015</year>). <article-title>The cocktail-party problem revisited: early processing and selection of multi-talker speech</article-title>. <source>Attent. Percept. Psychophys.</source> <volume>77</volume>, <fpage>1465</fpage>&#x02013;<lpage>1487</lpage>. <pub-id pub-id-type="doi">10.3758/s13414-015-0882-9</pub-id><pub-id pub-id-type="pmid">25828463</pub-id></citation></ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Brown</surname> <given-names>G.</given-names></name> <name><surname>Yamada</surname> <given-names>S.</given-names></name> <name><surname>Sejnowski</surname> <given-names>T.</given-names></name></person-group> (<year>2001</year>). <article-title>Independent component analysis at neural cocktail party</article-title>. <source>Trends Neurosci.</source> <volume>24</volume>, <fpage>54</fpage>&#x02013;<lpage>63</lpage>. <pub-id pub-id-type="doi">10.1016/S0166-2236(00)01683-0</pub-id><pub-id pub-id-type="pmid">11163888</pub-id></citation></ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cherry</surname> <given-names>E. C.</given-names></name></person-group> (<year>1953</year>). <article-title>Some experiments on the recognition of speech, with one and with two ears</article-title>. <source>J. Acoust. Soc. Am.</source> <volume>25</volume>, <fpage>975</fpage>&#x02013;<lpage>979</lpage>.</citation>
</ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Cichocki</surname> <given-names>A.</given-names></name> <name><surname>Zdunek</surname> <given-names>R.</given-names></name> <name><surname>Amari</surname> <given-names>S.</given-names></name></person-group> (<year>2006</year>). <article-title>New algorithms for non-negative matrix factorization in applications to blind source separation,</article-title> in <source>2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings</source> (<publisher-loc>Toulouse</publisher-loc>).</citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Comon</surname> <given-names>P.</given-names></name></person-group> (<year>1994</year>). <article-title>Independent component analysis, a new concept?</article-title> <source>Signal Process.</source> <volume>36</volume>, <fpage>287</fpage>&#x02013;<lpage>314</lpage>.</citation>
</ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ding</surname> <given-names>N.</given-names></name> <name><surname>Simon</surname> <given-names>J. Z.</given-names></name></person-group> (<year>2012</year>). <article-title>Neural coding of continuous speech in auditory cortex during monaural and dichotic listening</article-title>. <source>J. Neurophysiol.</source> <volume>107</volume>, <fpage>78</fpage>&#x02013;<lpage>89</lpage>. <pub-id pub-id-type="doi">10.1152/jn.00297.2011</pub-id><pub-id pub-id-type="pmid">21975452</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dong</surname> <given-names>J.</given-names></name> <name><surname>Colburn</surname> <given-names>H. S.</given-names></name> <name><surname>Sen</surname> <given-names>K.</given-names></name></person-group> (<year>2016</year>). <article-title>Cortical transformation of spatial processing for solving the cocktail party problem: a computational model</article-title>. <source>eNeuro</source> <volume>3</volume>, <fpage>1</fpage>&#x02013;<lpage>11</lpage>. <pub-id pub-id-type="doi">10.1523/ENEURO.0086-15.2015</pub-id><pub-id pub-id-type="pmid">26866056</pub-id></citation></ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Elhilali</surname> <given-names>M.</given-names></name></person-group> (<year>2013</year>). <article-title>Bayesian inference in auditory scenes,</article-title> in <source>Conference Proceedings : Annual International Conference of the IEEE Engineering in Medicine and Biology Society</source> (<publisher-loc>Osaka</publisher-loc>), <fpage>2792</fpage>&#x02013;<lpage>2795</lpage>.<pub-id pub-id-type="pmid">24110307</pub-id></citation></ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Elhilali</surname> <given-names>M.</given-names></name> <name><surname>Shamma</surname> <given-names>S.</given-names></name></person-group> (<year>2009</year>). <article-title>A cocktail party with a cortical twist: how cortical mechanisms contribute to sound segregation</article-title>. <source>J. Acoust. Soc. Am.</source> <volume>124</volume>, <fpage>3751</fpage>&#x02013;<lpage>3771</lpage>. <pub-id pub-id-type="doi">10.1121/1.3001672</pub-id><pub-id pub-id-type="pmid">19206802</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>French</surname> <given-names>R. M.</given-names></name></person-group> (<year>1999</year>). <article-title>Catastrophic forgetting in connectionist networks</article-title>. <source>Trends Cogn. Sci.</source> <volume>3</volume>, <fpage>128</fpage>&#x02013;<lpage>135</lpage>. <pub-id pub-id-type="doi">10.1016/S1364-6613(99)01294-2</pub-id><pub-id pub-id-type="pmid">10322466</pub-id></citation></ref>
<ref id="B21">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Frommolt</surname> <given-names>K. -H.</given-names></name> <name><surname>Bardeli</surname> <given-names>R.</given-names></name> <name><surname>Kurth</surname> <given-names>F.</given-names></name> <name><surname>Clausen</surname> <given-names>M.</given-names></name></person-group> (<year>2006</year>). <source>The Animal Sound Archive at the Humboldt-University of Berlin: Current Activities in Conservation and Improving Access for Bioacoustic Research</source>. <publisher-loc>Ljubljana</publisher-loc>: <publisher-name>Slovenska akademija znanosti in umetnosti</publisher-name>.</citation>
</ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Glasberg</surname> <given-names>B. R.</given-names></name> <name><surname>Moore</surname> <given-names>B. C.</given-names></name></person-group> (<year>1990</year>). <article-title>Derivation of auditory filter shapes from notched-noise data</article-title>. <source>Hear. Res.</source> <volume>47</volume>, <fpage>103</fpage>&#x02013;<lpage>138</lpage>.<pub-id pub-id-type="pmid">2228789</pub-id></citation></ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Golumbic</surname> <given-names>E. Z.</given-names></name> <name><surname>Cogan</surname> <given-names>G. B.</given-names></name> <name><surname>Schroeder</surname> <given-names>C. E.</given-names></name> <name><surname>Poeppel</surname> <given-names>D.</given-names></name></person-group> (<year>2013</year>). <article-title>Visual input enhances selective speech envelope tracking in auditory cortex at a cocktail party</article-title>. <source>J. Neurosci.</source> <volume>33</volume>, <fpage>1417</fpage>&#x02013;<lpage>1426</lpage>. <pub-id pub-id-type="doi">10.1523/JNEUROSCI.3675-12.2013</pub-id><pub-id pub-id-type="pmid">23345218</pub-id></citation></ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hawley</surname> <given-names>M. L.</given-names></name> <name><surname>Litovsky</surname> <given-names>R. Y.</given-names></name> <name><surname>Culling</surname> <given-names>J. F.</given-names></name></person-group> (<year>2004</year>). <article-title>The benefit of binaural hearing in a cocktail party: effect of location and type of interferer</article-title>. <source>J. Acoust. Soc. Am.</source> <volume>115</volume>, <fpage>833</fpage>&#x02013;<lpage>843</lpage>. <pub-id pub-id-type="doi">10.1121/1.1639908</pub-id><pub-id pub-id-type="pmid">15000195</pub-id></citation></ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Haykin</surname> <given-names>S.</given-names></name> <name><surname>Chen</surname> <given-names>Z.</given-names></name></person-group> (<year>2005</year>). <article-title>The cocktail party problem</article-title>. <source>Neural Comput.</source> <volume>17</volume>, <fpage>1875</fpage>&#x02013;<lpage>1902</lpage>. <pub-id pub-id-type="doi">10.1162/0899766054322964</pub-id><pub-id pub-id-type="pmid">15992485</pub-id></citation></ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hyv&#x000E4;rinen</surname> <given-names>A.</given-names></name> <name><surname>Oja</surname> <given-names>E.</given-names></name></person-group> (<year>1997</year>). <article-title>A fast fixed-point algorithm for independent component analysis</article-title>. <source>Neural Comput.</source> <volume>9</volume>, <fpage>1483</fpage>&#x02013;<lpage>1492</lpage>.<pub-id pub-id-type="pmid">10798706</pub-id></citation></ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Isomura</surname> <given-names>T.</given-names></name> <name><surname>Toyoizumi</surname> <given-names>T.</given-names></name></person-group> (<year>2019</year>). <article-title>Multi-context blind source separation by error-gated Hebbian rule</article-title>. <source>Sci. Rep.</source> <volume>9</volume>, <fpage>7127</fpage>. <pub-id pub-id-type="doi">10.1038/s41598-019-43423-z</pub-id><pub-id pub-id-type="pmid">31073206</pub-id></citation></ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jacobsen</surname> <given-names>T.</given-names></name> <name><surname>Schr&#x000F6;ger</surname> <given-names>E.</given-names></name> <name><surname>Winkler</surname> <given-names>I.</given-names></name> <name><surname>Horv&#x000E1;th</surname> <given-names>J.</given-names></name></person-group> (<year>2005</year>). <article-title>Familiarity affects the processing of task-irrelevant auditory deviance</article-title>. <source>J. Cogn. Neurosci.</source> <volume>17</volume>, <fpage>1704</fpage>&#x02013;<lpage>1713</lpage>. <pub-id pub-id-type="doi">10.1162/089892905774589262</pub-id><pub-id pub-id-type="pmid">16269107</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kameoka</surname> <given-names>H.</given-names></name> <name><surname>Li</surname> <given-names>L.</given-names></name> <name><surname>Inoue</surname> <given-names>S.</given-names></name> <name><surname>Makino</surname> <given-names>S.</given-names></name></person-group> (<year>2018</year>). <article-title>Semi-blind source separation with multichannel variational autoencoder</article-title>. <source>arXiv preprint arXiv:1808.00892</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1808.00892</pub-id></citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Karamatli</surname> <given-names>E.</given-names></name> <name><surname>Cemgil</surname> <given-names>A. T.</given-names></name> <name><surname>Kirbiz</surname> <given-names>S.</given-names></name></person-group> (<year>2018</year>). <article-title>Weak label supervision for monaural source separation using non-negative denoising variational autoencoders,</article-title> in <source>2019 27th Signal Processing and Communications Applications Conference (SIU)</source> (Sivas).</citation>
</ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kerlin</surname> <given-names>J.</given-names></name> <name><surname>Shahin</surname> <given-names>A.</given-names></name> <name><surname>Miller</surname> <given-names>L.</given-names></name></person-group> (<year>2010</year>). <article-title>Attentional gain control of ongoing cortical speech representations in a cocktail party</article-title>. <source>J. Neurosci.</source> <volume>30</volume>, <fpage>620</fpage>&#x02013;<lpage>628</lpage>. <pub-id pub-id-type="doi">10.1523/JNEUROSCI.3631-09.2010</pub-id><pub-id pub-id-type="pmid">20071526</pub-id></citation></ref>
<ref id="B32">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Krause-Solberg</surname> <given-names>S.</given-names></name> <name><surname>Iske</surname> <given-names>A.</given-names></name></person-group> (<year>2015</year>). <article-title>Non-negative dimensionality reduction for audio signal separation by NNMF and ICA,</article-title> in <source>2015 International Conference on Sampling Theory and Applications, SampTA 2015</source> (<publisher-loc>Washington, DC</publisher-loc>), <fpage>377</fpage>&#x02013;<lpage>381</lpage>.</citation>
</ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Krishnan</surname> <given-names>L.</given-names></name> <name><surname>Elhilali</surname> <given-names>M.</given-names></name> <name><surname>Shamma</surname> <given-names>S.</given-names></name></person-group> (<year>2014</year>). <article-title>Segregating complex sound sources through temporal coherence</article-title>. <source>PLoS Comput. Biol.</source> <volume>10</volume>, <fpage>e1003985</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pcbi.1003985</pub-id><pub-id pub-id-type="pmid">25521593</pub-id></citation></ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Larkum</surname> <given-names>M.</given-names></name></person-group> (<year>2013</year>). <article-title>A cellular mechanism for cortical associations: an organizing principle for the cerebral cortex</article-title>. <source>Trends Neurosci.</source> <volume>36</volume>, <fpage>141</fpage>&#x02013;<lpage>151</lpage>. <pub-id pub-id-type="doi">10.1016/j.tins.2012.11.006</pub-id><pub-id pub-id-type="pmid">23273272</pub-id></citation></ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Larkum</surname> <given-names>M.</given-names></name> <name><surname>Zhu</surname> <given-names>J.</given-names></name> <name><surname>Sakmann</surname> <given-names>B.</given-names></name></person-group> (<year>1999</year>). <article-title>A new cellular mechanism for coupling inputs arriving at different cortical layers</article-title>. <source>Nature</source> <volume>398</volume>, <fpage>338</fpage>&#x02013;<lpage>341</lpage>.<pub-id pub-id-type="pmid">10192334</pub-id></citation></ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lewald</surname> <given-names>J.</given-names></name> <name><surname>Getzmann</surname> <given-names>S.</given-names></name></person-group> (<year>2015</year>). <article-title>Electrophysiological correlates of cocktail-party listening</article-title>. <source>Behav. Brain Res.</source> <volume>292</volume>, <fpage>157</fpage>&#x02013;<lpage>166</lpage>. <pub-id pub-id-type="doi">10.1016/j.bbr.2015.06.025</pub-id><pub-id pub-id-type="pmid">26092714</pub-id></citation></ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>F.</given-names></name> <name><surname>Chen</surname> <given-names>Y.</given-names></name> <name><surname>Cichocki</surname> <given-names>A.</given-names></name> <name><surname>Sejnowski</surname> <given-names>T.</given-names></name></person-group> (<year>2017</year>). <article-title>The effects of audiovisual inputs on solving the cocktail party problem in the human brain: an fMRI study</article-title>. <source>Cereb. Cortex</source> <volume>28</volume>, <fpage>3623</fpage>&#x02013;<lpage>3637</lpage>. <pub-id pub-id-type="doi">10.1093/cercor/bhx235</pub-id><pub-id pub-id-type="pmid">29029039</pub-id></citation></ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>Q.</given-names></name> <name><surname>Huang</surname> <given-names>Y.</given-names></name> <name><surname>Hao</surname> <given-names>Y.</given-names></name> <name><surname>Xu</surname> <given-names>J.</given-names></name> <name><surname>Xu</surname> <given-names>B.</given-names></name></person-group> (<year>2021</year>). <article-title>LiMuSE: Lightweight multi-modal speaker extraction</article-title>. <source>arXiv [Preprint]</source>. arXiv: 2111.04063.</citation>
</ref>
<ref id="B39">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>L&#x000F3;pez-Serrano</surname> <given-names>P.</given-names></name> <name><surname>Dittmar</surname> <given-names>C.</given-names></name> <name><surname>&#x000D6;zer</surname> <given-names>Y.</given-names></name> <name><surname>M&#x000FC;ller</surname> <given-names>M.</given-names></name></person-group> (<year>2019</year>). <article-title>NMF toolbox: music processing applications of nonnegative matrix factorization</article-title>.</citation>
</ref>
<ref id="B40">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>McDermott</surname> <given-names>J. H.</given-names></name></person-group> (<year>2009</year>). <article-title>The cocktail party problem</article-title>. <source>Curr. Biol.</source> <volume>19</volume>, <fpage>R1024</fpage>&#x02013;<lpage>R1027</lpage>. <pub-id pub-id-type="doi">10.1016/j.cub.2009.09.005</pub-id><pub-id pub-id-type="pmid">19948136</pub-id></citation></ref>
<ref id="B41">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>McDermott</surname> <given-names>J. H.</given-names></name> <name><surname>Wrobleski</surname> <given-names>D.</given-names></name> <name><surname>Oxenham</surname> <given-names>A. J.</given-names></name></person-group> (<year>2011</year>). <article-title>Recovering sound sources from embedded repetition</article-title>. <source>Proc. Natl. Acad. Sci. U.S.A.</source> <volume>108</volume>, <fpage>1188</fpage>&#x02013;<lpage>1193</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.1004765108</pub-id><pub-id pub-id-type="pmid">21199948</pub-id></citation></ref>
<ref id="B42">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>McFee</surname> <given-names>B.</given-names></name> <name><surname>Raffel</surname> <given-names>C.</given-names></name> <name><surname>Liang</surname> <given-names>D.</given-names></name> <name><surname>Ellis</surname> <given-names>D.</given-names></name> <name><surname>McVicar</surname> <given-names>M.</given-names></name> <name><surname>Battenberg</surname> <given-names>E.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>librosa: Audio and music signal analysis in Python,</article-title> in <source>Proc. of the 14th Python in Science Conf. (SCIPY 2015)</source> (<publisher-loc>Austin</publisher-loc>), <fpage>18</fpage>&#x02013;<lpage>24</lpage>.</citation>
</ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mesgarani</surname> <given-names>N.</given-names></name> <name><surname>Chang</surname> <given-names>E.</given-names></name></person-group> (<year>2012</year>). <article-title>Selective cortical representation of attended speaker in multi-talker speech perception</article-title>. <source>Nature</source> <volume>485</volume>, <fpage>233</fpage>&#x02013;<lpage>236</lpage>. <pub-id pub-id-type="doi">10.1038/nature11020</pub-id><pub-id pub-id-type="pmid">22522927</pub-id></citation></ref>
<ref id="B44">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Middlebrooks</surname> <given-names>J. C.</given-names></name> <name><surname>Waters</surname> <given-names>M. F.</given-names></name></person-group> (<year>2020</year>). <article-title>Spatial mechanisms for segregation of competing sounds, and a breakdown in spatial hearing</article-title>. <source>Front. Neurosci.</source> <volume>14</volume>, <fpage>571095</fpage>. 10.3389/fnins.2020.571095<pub-id pub-id-type="pmid">33041763</pub-id></citation></ref>
<ref id="B45">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mika</surname> <given-names>D.</given-names></name> <name><surname>Budzik</surname> <given-names>G.</given-names></name> <name><surname>J&#x000F3;zwik</surname> <given-names>J.</given-names></name></person-group> (<year>2020</year>). <article-title>ICA-based single channel source separation with time-frequency decomposition,</article-title> in <source>2020 IEEE 7th International Workshop on Metrology for AeroSpace (MetroAeroSpace)</source> (<publisher-loc>Pisa</publisher-loc>), <fpage>238</fpage>&#x02013;<lpage>243</lpage>.<pub-id pub-id-type="pmid">32260304</pub-id></citation></ref>
<ref id="B46">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Narayan</surname> <given-names>R.</given-names></name> <name><surname>Best</surname> <given-names>V.</given-names></name> <name><surname>Ozmeral</surname> <given-names>E.</given-names></name> <name><surname>McClaine</surname> <given-names>E.</given-names></name> <name><surname>Dent</surname> <given-names>M.</given-names></name> <name><surname>Shinn-Cunningham</surname> <given-names>B.</given-names></name> <etal/></person-group>. (<year>2008</year>). <article-title>Cortical interference effects in the cocktail party problem</article-title>. <source>Nat. Neurosci.</source> <volume>10</volume>, <fpage>1601</fpage>&#x02013;<lpage>1607</lpage>. <pub-id pub-id-type="doi">10.1038/nn2009</pub-id><pub-id pub-id-type="pmid">17994016</pub-id></citation></ref>
<ref id="B47">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>O&#x00027;Sullivan</surname> <given-names>J.</given-names></name> <name><surname>Power</surname> <given-names>A.</given-names></name> <name><surname>Mesgarani</surname> <given-names>N.</given-names></name> <name><surname>Rajaram</surname> <given-names>S.</given-names></name> <name><surname>Foxe</surname> <given-names>J.</given-names></name> <name><surname>Shinn-Cunningham</surname> <given-names>B.</given-names></name> <name><surname>Slaney</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2014</year>). <article-title>Attentional selection in a cocktail party environment can be decoded from single-trial EEG</article-title>. <source>Cereb. Cortex</source> <volume>25</volume>, <fpage>1697</fpage>&#x02013;<lpage>1706</lpage>. <pub-id pub-id-type="doi">10.1093/cercor/bht355</pub-id><pub-id pub-id-type="pmid">24429136</pub-id></citation></ref>
<ref id="B48">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Oxenham</surname> <given-names>A. J.</given-names></name></person-group> (<year>2018</year>). <article-title>How we hear: the perception and neural coding of sound</article-title>. <source>Annu. Rev. Psychol.</source> <volume>69</volume>, <fpage>27</fpage>&#x02013;<lpage>50</lpage>. <pub-id pub-id-type="doi">10.1146/annurev-psych-122216-011635</pub-id><pub-id pub-id-type="pmid">29035691</pub-id></citation></ref>
<ref id="B49">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pehlevan</surname> <given-names>C.</given-names></name> <name><surname>Mohan</surname> <given-names>S.</given-names></name> <name><surname>Chklovskii</surname> <given-names>D. B.</given-names></name></person-group> (<year>2017</year>). <article-title>Blind nonnegative source separation using biological neural networks</article-title>. <source>Neural Comput.</source> <volume>29</volume>, <fpage>2925</fpage>&#x02013;<lpage>2954</lpage>. <pub-id pub-id-type="doi">10.1162/neco_a_01007</pub-id><pub-id pub-id-type="pmid">28777718</pub-id></citation></ref>
<ref id="B50">
<citation citation-type="web"><person-group person-group-type="author"><collab>Philarmonia Orchestra Instruments.</collab></person-group> (<year>2019</year>). Available online at: <ext-link ext-link-type="uri" xlink:href="https://philharmonia.co.uk/resources/instruments/">https://philharmonia.co.uk/resources/instruments/</ext-link></citation>
</ref>
<ref id="B51">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Popham</surname> <given-names>S.</given-names></name> <name><surname>Boebinger</surname> <given-names>D.</given-names></name> <name><surname>Ellis</surname> <given-names>D.</given-names></name> <name><surname>Kawahara</surname> <given-names>H.</given-names></name> <name><surname>McDermott</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). <article-title>Inharmonic speech reveals the role of harmonicity in the cocktail party problem</article-title>. <source>Nat. Commun.</source> <volume>9</volume>, <fpage>2122</fpage>. <pub-id pub-id-type="doi">10.1038/s41467-018-04551-8</pub-id><pub-id pub-id-type="pmid">29844313</pub-id></citation></ref>
<ref id="B52">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sagi</surname> <given-names>B.</given-names></name> <name><surname>Nemat-Nasser</surname> <given-names>S. C.</given-names></name> <name><surname>Kerr</surname> <given-names>R.</given-names></name> <name><surname>Hayek</surname> <given-names>R.</given-names></name> <name><surname>Downing</surname> <given-names>C.</given-names></name> <name><surname>Hecht-Nielsen</surname> <given-names>R.</given-names></name></person-group> (<year>2001</year>). <article-title>A biologically motivated solution to the cocktail party problem</article-title>. <source>Neural Comput.</source> <volume>13</volume>, <fpage>1575</fpage>&#x02013;<lpage>1602</lpage>. <pub-id pub-id-type="doi">10.1162/089976601750265018</pub-id><pub-id pub-id-type="pmid">11440598</pub-id></citation></ref>
<ref id="B53">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Santosh</surname> <given-names>K. S.</given-names></name> <name><surname>Bharathi</surname> <given-names>S. H.</given-names></name></person-group> (<year>2017</year>). <article-title>Non-negative matrix factorization algorithms for blind source sepertion in speech recognition,</article-title> in <source>2017 2nd IEEE International Conference on Recent Trends in Electronics, Information Communication Technology (RTEICT)</source> (<publisher-loc>Bangalore</publisher-loc>), <fpage>2242</fpage>&#x02013;<lpage>2246</lpage>.</citation>
</ref>
<ref id="B54">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sawada</surname> <given-names>H.</given-names></name> <name><surname>Ono</surname> <given-names>N.</given-names></name> <name><surname>Kameoka</surname> <given-names>H.</given-names></name> <name><surname>Kitamura</surname> <given-names>D.</given-names></name> <name><surname>Saruwatari</surname> <given-names>H.</given-names></name></person-group> (<year>2019</year>). <article-title>A review of blind source separation methods: two converging routes to ilrma originating from ICA and NMF</article-title>. <source>APSIPA Trans. Signal Inform. Process.</source> <volume>8</volume>, <fpage>1</fpage>&#x02013;<lpage>14</lpage>. <pub-id pub-id-type="doi">10.1017/ATSIP.2019.5</pub-id></citation>
</ref>
<ref id="B55">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schmidt</surname> <given-names>A. K. D.</given-names></name> <name><surname>R&#x000F6;mer</surname> <given-names>H.</given-names></name></person-group> (<year>2011</year>). <article-title>Solutions to the cocktail party problem in insects: selective filters, spatial release from masking and gain control in tropical crickets</article-title>. <source>PLoS ONE</source> <volume>6</volume>, <fpage>e28593</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0028593</pub-id><pub-id pub-id-type="pmid">22163041</pub-id></citation></ref>
<ref id="B56">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sharma</surname> <given-names>J.</given-names></name> <name><surname>Angelucci</surname> <given-names>A.</given-names></name> <name><surname>Sur</surname> <given-names>M.</given-names></name></person-group> (<year>2000</year>). <article-title>Induction of visual orientation modules in auditory cortex</article-title>. <source>Nature</source> <volume>404</volume>, <fpage>841</fpage>&#x02013;<lpage>847</lpage>. <pub-id pub-id-type="doi">10.1038/35009043</pub-id><pub-id pub-id-type="pmid">10786784</pub-id></citation></ref>
<ref id="B57">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Smaragdis</surname> <given-names>P.</given-names></name> <name><surname>Brown</surname> <given-names>J.</given-names></name></person-group> (<year>2003</year>). <article-title>Non-negative matrix factorization for polyphonic music transcription,</article-title> in <source>2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics</source> (<publisher-loc>New Paltz, NY</publisher-loc>), <fpage>177</fpage>&#x02013;<lpage>180</lpage>.</citation>
</ref>
<ref id="B58">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stevens</surname> <given-names>S. S.</given-names></name> <name><surname>Volkmann</surname> <given-names>J.</given-names></name> <name><surname>Newman</surname> <given-names>E. B.</given-names></name></person-group> (<year>1937</year>). <article-title>A scale for the measurement of the psychological magnitude pitch</article-title>. <source>J. Acoust. Soc. Am.</source> <volume>8</volume>, <fpage>185</fpage>&#x02013;<lpage>190</lpage>.<pub-id pub-id-type="pmid">8710456</pub-id></citation></ref>
<ref id="B59">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Teki</surname> <given-names>S.</given-names></name> <name><surname>Chait</surname> <given-names>M.</given-names></name> <name><surname>Kumar</surname> <given-names>S.</given-names></name> <name><surname>Shamma</surname> <given-names>S.</given-names></name> <name><surname>Griffiths</surname> <given-names>T. D.</given-names></name></person-group> (<year>2013</year>). <article-title>Segregation of complex acoustic scenes based on temporal coherence</article-title>. <source>eLife</source> <volume>2</volume>, <fpage>e00699</fpage>. <pub-id pub-id-type="doi">10.7554/eLife.00699.009</pub-id><pub-id pub-id-type="pmid">23898398</pub-id></citation></ref>
<ref id="B60">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Thakur</surname> <given-names>C.</given-names></name> <name><surname>Wang</surname> <given-names>R.</given-names></name> <name><surname>Afshar</surname> <given-names>S.</given-names></name> <name><surname>Hamilton</surname> <given-names>T.</given-names></name> <name><surname>Tapson</surname> <given-names>J.</given-names></name> <name><surname>Shamma</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>Sound stream segregation: a neuromorphic approach to solve the cocktail party problem in real-time</article-title>. <source>Front. Neurosci.</source> <volume>9</volume>, <fpage>309</fpage>. <pub-id pub-id-type="doi">10.3389/fnins.2015.00309</pub-id><pub-id pub-id-type="pmid">26388721</pub-id></citation></ref>
<ref id="B61">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Urbanczik</surname> <given-names>R.</given-names></name> <name><surname>Senn</surname> <given-names>W.</given-names></name></person-group> (<year>2014</year>). <article-title>Learning by the dendritic prediction of somatic spiking</article-title>. <source>Neuron</source> <volume>81</volume>, <fpage>521</fpage>&#x02013;<lpage>528</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuron.2013.11.030</pub-id><pub-id pub-id-type="pmid">24507189</pub-id></citation></ref>
<ref id="B62">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>White</surname> <given-names>L.</given-names></name> <name><surname>King</surname> <given-names>S.</given-names></name></person-group> (<year>2003</year>). <source>The Eustace Speech Corpus</source>. <publisher-name>Centre for Speech Technology Research, University of Edinburgh</publisher-name>.</citation>
</ref>
<ref id="B63">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wickens</surname> <given-names>T. D.</given-names></name></person-group> (<year>2002</year>). <source>Elementary Signal Detection Theory.</source> <publisher-loc>New York, NY</publisher-loc>: <publisher-name>Oxford University Press</publisher-name>.</citation>
</ref>
<ref id="B64">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Woods</surname> <given-names>K. J. P.</given-names></name> <name><surname>McDermott</surname> <given-names>J. H.</given-names></name></person-group> (<year>2018</year>). <article-title>Schema learning for the cocktail party problem</article-title>. <source>Proc. Natl. Acad. Sci. U.S.A.</source> <volume>115</volume>, <fpage>E3313</fpage>&#x02013;<lpage>E3322</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.1801614115</pub-id><pub-id pub-id-type="pmid">29563229</pub-id></citation></ref>
<ref id="B65">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xiang</surname> <given-names>J.</given-names></name> <name><surname>Simon</surname> <given-names>J.</given-names></name> <name><surname>Elhilali</surname> <given-names>M.</given-names></name></person-group> (<year>2010</year>). <article-title>Competing streams at the cocktail party: exploring the mechanisms of attention and temporal integration</article-title>. <source>J. Neurosci.</source> <volume>30</volume>, <fpage>12084</fpage>&#x02013;<lpage>12093</lpage>. <pub-id pub-id-type="doi">10.1523/JNEUROSCI.0827-10.2010</pub-id><pub-id pub-id-type="pmid">20826671</pub-id></citation></ref>
<ref id="B66">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yu</surname> <given-names>D.</given-names></name></person-group> (<year>2020</year>). <article-title>Solving cocktail party problem&#x02013;from single modality to multi-modality,</article-title> in <source>Proc. 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020)</source> (<publisher-name>Virtual workshop</publisher-name>).</citation>
</ref>
</ref-list> 
</back>
</article>