<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Comput. Sci.</journal-id>
<journal-title>Frontiers in Computer Science</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Comput. Sci.</abbrev-journal-title>
<issn pub-type="epub">2624-9898</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">624558</article-id>
<article-id pub-id-type="doi">10.3389/fcomp.2021.624558</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Computer Science</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Recognition of Alzheimer&#x2019;s Dementia From the Transcriptions of Spontaneous Speech Using fastText and CNN Models</article-title>
<alt-title alt-title-type="left-running-head">Meghanani et&#x20;al.</alt-title>
<alt-title alt-title-type="right-running-head">fastText/CNN for AD Recognition</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Meghanani</surname>
<given-names>Amit</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/1010444/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Anoop</surname>
<given-names>C. S.</given-names>
</name>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1031769/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Ramakrishnan</surname>
<given-names>Angarai Ganesan</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/878020/overview"/>
</contrib>
</contrib-group>
<aff>MILE Laboratory, Department of Electrical Engineering, Indian Institute of Science, <addr-line>Bengaluru</addr-line>, <country>India</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/141969/overview">Saturnino Luz</ext-link>, University of Edinburgh, United&#x20;Kingdom</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/505655/overview">Diego R. Amancio</ext-link>, University of S&#xe3;o Paulo, Brazil</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/343229/overview">Anna Pribilova</ext-link>, Slovak Academy of Sciences (SAS), Slovakia</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: C. S. Anoop, <email>anoopcs@iisc.ac.in</email>
</corresp>
<fn fn-type="other">
<p>This article was submitted to Human-Media Interaction, a section of the journal Frontiers in Computer Science</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>24</day>
<month>03</month>
<year>2021</year>
</pub-date>
<pub-date pub-type="collection">
<year>2021</year>
</pub-date>
<volume>3</volume>
<elocation-id>624558</elocation-id>
<history>
<date date-type="received">
<day>31</day>
<month>10</month>
<year>2020</year>
</date>
<date date-type="accepted">
<day>25</day>
<month>01</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2021 Meghanani, Anoop and Ramakrishnan.</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Meghanani, Anoop and Ramakrishnan</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these&#x20;terms.</p>
</license>
</permissions>
<abstract>
<p>Alzheimer&#x2019;s dementia (AD) is a type of neurodegenerative disease that is associated with a decline in memory. However, speech and language impairments are also common in Alzheimer&#x2019;s dementia patients. This work is an extension of our previous work, where we had used spontaneous speech for Alzheimer&#x2019;s dementia recognition employing log-Mel spectrogram and Mel-frequency cepstral coefficients (MFCC) as inputs to deep neural networks (DNN). In this work, we explore the transcriptions of spontaneous speech for dementia recognition and compare the results with several baseline results. We explore two models for dementia recognition: 1) fastText and 2) convolutional neural network (CNN) with a single convolutional layer, to capture the n-gram-based linguistic information from the input sentence. The fastText model uses a bag of bigrams and trigrams along with the input text to capture the local word orderings. In the CNN-based model, we try to capture different n-grams (we use <italic>n</italic>&#x20;&#x3d; 2, 3, 4, 5) present in the text by adapting the kernel sizes to n. In both fastText and CNN architectures, the word embeddings are initialized using pretrained GloVe vectors. We use bagging of 21 models in each of these architectures to arrive at the final model using which the performance on the test data is assessed. The best accuracies achieved with CNN and fastText models on the text data are 79.16 and 83.33%, respectively. The best root mean square errors (RMSE) on the prediction of mini-mental state examination (MMSE) score are 4.38 and 4.28 for CNN and fastText, respectively. The results suggest that the n-gram-based features are worth pursuing, for the task of AD detection. fastText models have competitive results when compared to several baseline methods. Also, fastText models are shallow in nature and have the advantage of being faster in training and evaluation, by several orders of magnitude, compared to deep models.</p>
</abstract>
<kwd-group>
<kwd>fastText</kwd>
<kwd>convolutional neural network</kwd>
<kwd>Alzheimer&#x2019;s</kwd>
<kwd>dementia</kwd>
<kwd>mini-mental state examination</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1 Introduction</title>
<p>Dementia is a syndrome characterized by the decline in cognition that is significant enough to interfere with one&#x2019;s independent, daily functioning. Alzheimer&#x2019;s disease contributes to around 60&#x2013;70% of dementia cases. Toward the final stages of Alzheimer&#x2019;s dementia (AD), the patients lose control of their physical functions and depend on others for care. As there are no curative treatments for dementia, the early detection is critical to delay or slow down the onset or progression of the disease. The mini-mental state examination (MMSE) is a widely used test to screen for dementia and to estimate the severity and progression of cognitive impairment.</p>
<p>AD affects the temporal characteristics of spontaneous speech. Changes in the spoken language are evident even in mild AD patients. Subtle language impairments such as difficulties in word finding and comprehension, usage of incorrect words, ambiguous referents, loss of verbal fluency, speaking too much at inappropriate times, talking too loudly, repeating ideas, and digressing from the topic are common in the early stages of AD (<xref ref-type="bibr" rid="B33">Savundranayagam et&#x20;al., 2005</xref>) and they turn extreme in the moderate and severe stages. <xref ref-type="bibr" rid="B37">Szatl&#xf3;czki et&#x20;al. (2015)</xref> show that AD can be detected with the help of a linguistic analysis more sensitively than with other cognitive examinations. <xref ref-type="bibr" rid="B22">Mueller et&#x20;al. (2018b)</xref> analyzed the connected language samples obtained from simple picture description tasks and found that the speech fluency and the semantic content features declined faster in participants with early mild cognitive impairment. The language profile of AD patients is characterized by &#x201c;empty speech,&#x201d; devoid of content words (<xref ref-type="bibr" rid="B23">Nicholas et&#x20;al., 1985</xref>). They tend to use pronouns without proper noun references and indefinite terms like &#x201c;this,&#x201d; &#x201c;that,&#x201d; and &#x201c;thing&#x201d; more often (<xref ref-type="bibr" rid="B21">Mueller et&#x20;al., 2018a</xref>). These results motivate us to believe that modeling the transcriptions of the narrative speech in the cookie-theft picture description task using n-gram language models can help in the detection of AD and prediction of MMSE&#x20;score.</p>
<p>In this work we address the AD detection and MMSE score prediction problems using two natural language processing (NLP)&#x2013;based models: 1) fastText and 2) convolutional neural network (CNN). These models have the advantage that they can be easily structured to capture the linguistic cues in the form of n-grams from the transcriptions of the picture description task, provided with the Alzheimer&#x2019;s Dementia Recognition through Spontaneous Speech (ADReSS) dataset (<xref ref-type="bibr" rid="B18">Luz et&#x20;al., 2020</xref>). CNNs, though originated in computer vision, have become popular for NLP tasks and have achieved great results in sentence classification (<xref ref-type="bibr" rid="B15">Kim, 2014</xref>), semantic parsing (<xref ref-type="bibr" rid="B38">tau Yih et&#x20;al., 2014</xref>), search query retrieval (<xref ref-type="bibr" rid="B35">Shen et&#x20;al., 2014</xref>), and other traditional NLP tasks (<xref ref-type="bibr" rid="B3">Collober et&#x20;al., 2011</xref>). Our convolutional neural network model draws inspiration from the work on sentence classification using CNNs (<xref ref-type="bibr" rid="B15">Kim, 2014</xref>). The fastText (<xref ref-type="bibr" rid="B14">Joulin et&#x20;al., 2017</xref>) is a simple and efficient model for text classification (e.g., tag prediction and sentiment analysis). The fundamental idea in the fastText classifier is to calculate the n-grams of an input sentence and append them to the end of the sentence. Our choice of fastText model is also motivated by its ability to often outperform deep learning classifiers in terms of accuracy and training/evaluation times (<xref ref-type="bibr" rid="B14">Joulin et&#x20;al., 2017</xref>).</p>
<p>The rest of the paper is organized as follows. <xref ref-type="sec" rid="s2">Section 2</xref> discusses the ADReSS dataset in detail. <xref ref-type="sec" rid="s3">Section 3</xref> discusses the baseline results in AD detection. <xref ref-type="sec" rid="s4">Section 4</xref> discusses our proposed NLP-based models followed by the listing of results in <xref ref-type="sec" rid="s5">Section 5</xref>. Our results and conclusions are discussed in <xref ref-type="sec" rid="s6">Section&#x20;6</xref>.</p>
</sec>
<sec id="s2">
<title>2 ADReSS Dataset</title>
<p>The ADReSS dataset (<xref ref-type="bibr" rid="B18">Luz et&#x20;al., 2020</xref>) is designed to provide Alzheimer&#x2019;s research community with a standard platform for AD detection and MMSE score prediction. The dataset is acoustically preprocessed and balanced in terms of age and gender. It consists of audio recordings and transcriptions [in CHAT format (<xref ref-type="bibr" rid="B19">Macwhinney, 2009</xref>)] of the cookie-theft picture description task, elicited from subjects in the age group of 50&#x2013;80 years. The training set consists of data from 108 subjects, 54 each from AD and non-AD classes. The test set has data from 48 subjects, again balanced with respect to AD and non-AD classes. More information on the ADReSS dataset can be found in the ADReSS challenge baseline paper (<xref ref-type="bibr" rid="B18">Luz et&#x20;al., 2020</xref>).</p>
</sec>
<sec id="s3">
<title>3 Review of Baseline Methods</title>
<p>This section provides a brief overview of the various approaches for AD detection and MMSE score prediction on ADReSS dataset. These approaches can be broadly classified into three types based on the type of the features used in the problem: 1) acoustic feature, 2) linguistic feature, and 3) a fusion of acoustic and linguistic features. The performance of different approaches on the AD detection and MMSE score prediction tasks are compared using the accuracy and root mean square error (RMSE) measures computed on the ADReSS test set.<disp-formula id="e1">
<mml:math id="me1">
<mml:mrow>
<mml:mtext>Accuracy</mml:mtext>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mi>N</mml:mi>
</mml:mfrac>
</mml:mrow>
</mml:math>
<label>(1)</label>
</disp-formula>
<disp-formula id="e2">
<mml:math id="me2">
<mml:mrow>
<mml:mtext>RMSE</mml:mtext>
<mml:mo>&#x3d;</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mstyle displaystyle="true">
<mml:msubsup>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>N</mml:mi>
</mml:msubsup>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
<mml:mi>N</mml:mi>
</mml:mfrac>
</mml:mrow>
</mml:msqrt>
</mml:mrow>
</mml:math>
<label>(2)</label>
</disp-formula>where <italic>N</italic> is the total number of subjects involved in the study, <inline-formula id="inf1">
<mml:math id="minf1">
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> the number of true positives, and <inline-formula id="inf2">
<mml:math id="minf2">
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> the number of true negatives. <inline-formula id="inf3">
<mml:math id="minf3">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf4">
<mml:math id="minf4">
<mml:mrow>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> are the estimated and target MMSE scores for <inline-formula id="inf5">
<mml:math id="minf5">
<mml:mrow>
<mml:msup>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mtext>th</mml:mtext>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> test sample. The results of different approaches on the ADReSS dataset are summarized in <xref ref-type="table" rid="T1">Table&#x20;1</xref>.</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Baseline methods on ADReSS test&#x20;set.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Model</th>
<th align="center">Accuracy (%)</th>
<th align="center">RMSE</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">
<xref ref-type="bibr" rid="B34">Searle et&#x20;al. (2020)</xref>, DistilBERT</td>
<td align="center">81.25</td>
<td align="center">4.58</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B34">Searle et&#x20;al. (2020)</xref>, SVM &#x2b; CRF</td>
<td align="center">81.25</td>
<td align="center">5.22</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B26">Pompili et&#x20;al. (2020)</xref>, x-vectors SRE</td>
<td align="center">54.17</td>
<td align="center">&#x2014;</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B26">Pompili et&#x20;al. (2020)</xref>, sentence embedding</td>
<td align="center">72.92</td>
<td align="center">&#x2014;</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B26">Pompili et&#x20;al. (2020)</xref>, fusion of system</td>
<td align="center">81.25</td>
<td align="center">&#x2014;</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B18">Luz et&#x20;al. (2020)</xref>, linguistic</td>
<td align="center">75.00</td>
<td align="center">5.20</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B32">Sarawgi et&#x20;al. (2020b)</xref>, ensemble</td>
<td align="center">83.33</td>
<td align="center">4.60</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B16">Koo et&#x20;al. (2020)</xref>, VGGish</td>
<td align="center">72.92</td>
<td align="center">5.07</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B16">Koo et&#x20;al. (2020)</xref>, Transformer-XL</td>
<td align="center">81.25</td>
<td align="center">4.01</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B16">Koo et&#x20;al. (2020)</xref>, VGGish &#x2b; GloVe</td>
<td align="center">77.08</td>
<td align="center">4.33</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B16">Koo et&#x20;al. (2020)</xref>, VGGish &#x2b; transformer-XL</td>
<td align="center">75.00</td>
<td align="center">3.74</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B16">Koo et&#x20;al. (2020)</xref>, ensembled output</td>
<td align="center">81.25</td>
<td align="center">3.77</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B1">Campbell et&#x20;al. (2020)</xref>, fusion II</td>
<td align="center">75.00</td>
<td align="center">&#x2014;</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B1">Campbell et&#x20;al. (2020)</xref>, fusion I</td>
<td align="center">72.92</td>
<td align="center">&#x2014;</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B1">Campbell et&#x20;al. (2020)</xref>, RNN model</td>
<td align="center">75.00</td>
<td align="center">&#x2014;</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B1">Campbell et&#x20;al. (2020)</xref>, fluency</td>
<td align="center">60.42</td>
<td align="center">&#x2014;</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B1">Campbell et&#x20;al. (2020)</xref>, x-vector</td>
<td align="center">54.17</td>
<td align="center">&#x2014;</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B31">Sarawgi et&#x20;al. (2020a)</xref>, UA ensemble</td>
<td align="center">&#x2014;</td>
<td align="center">4.35</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B31">Sarawgi et&#x20;al. (2020a)</xref>, UA ensemble (weighted)</td>
<td align="center">&#x2014;</td>
<td align="center">3.93</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B24">Pappagari et&#x20;al. (2020)</xref>, acoustic and transcript</td>
<td align="center">75.00</td>
<td align="center">5.37</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B28">Rohanian et&#x20;al. (2020)</xref>, LSTM (Lexical &#x2b; Dis)</td>
<td align="center">72.92</td>
<td align="center">4.88</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B28">Rohanian et&#x20;al. (2020)</xref>, LSTM with gating (Acoustic &#x2b; Lexical)</td>
<td align="center">77.08</td>
<td align="center">4.57</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B28">Rohanian et&#x20;al. (2020)</xref>, LSTM with gating (Acoustic &#x2b; Lexical &#x2b; Dis)</td>
<td align="center">79.17</td>
<td align="center">4.54</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B41">Yuan et&#x20;al. (2020)</xref>, ERNIE3p</td>
<td align="center">89.58</td>
<td align="center">&#x2014;</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B36">Syed et&#x20;al. (2020)</xref>
</td>
<td align="center">85.42</td>
<td align="center">4.30</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B9">Edwards et&#x20;al. (2020)</xref>, phonemes and audio</td>
<td align="center">79.17</td>
<td align="center">&#x2014;</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B20">Meghanani et&#x20;al. (2021)</xref>, CNN-LSTM with MFCC</td>
<td align="center">64..58</td>
<td align="center">6.24</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B20">Meghanani et&#x20;al. (2021)</xref>, pBLSTM-CNN with log-Mel</td>
<td align="center">52.08</td>
<td align="center">5.90</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B20">Meghanani et&#x20;al. (2021)</xref>, ResNet-LSTM with log-Mel</td>
<td align="center">62.50</td>
<td align="center">5.98</td>
</tr>
</tbody>
</table>
</table-wrap>
<sec id="s3-1">
<title>3.1 Acoustic Feature-Based Methods</title>
<p>
<xref ref-type="bibr" rid="B18">Luz et&#x20;al. (2020)</xref> explore several acoustic features like extended Geneva minimalistic acoustic parameter set (eGeMAPS) (<xref ref-type="bibr" rid="B10">Eyben et&#x20;al., 2016</xref>), emobase, ComParE-2013 (<xref ref-type="bibr" rid="B11">Eyben et&#x20;al., 2013</xref>), and multiresolution cochleagram (MRCG) (<xref ref-type="bibr" rid="B2">Chen et&#x20;al., 2014</xref>), feeding the traditional machine learning algorithms like linear discriminant analysis, decision trees, nearest neighbor, random forests, and support vector machines. In our previous work (<xref ref-type="bibr" rid="B20">Meghanani et&#x20;al., 2021</xref>), we have used CNN/ResNet &#x2b; long short-term memory (LSTM) networks and pyramidal bidirectional LSTM &#x2b; CNN networks trained on log-Mel spectrogram and Mel-frequency cepstral coefficient (MFCC) features extracted from the spontaneous speech. <xref ref-type="bibr" rid="B26">Pompili et&#x20;al. (2020)</xref> exploit the pretrained models to produce i-vector- and x-vector-based acoustic feature embeddings. They evaluate x-vector, i-vector, and statistical speech-based functional features. Rhythmic features are proposed in <xref ref-type="bibr" rid="B1">Campbell et&#x20;al. (2020)</xref>, as lower speaking fluency is a common pattern in patients with AD. <xref ref-type="bibr" rid="B16">Koo et&#x20;al. (2020)</xref> use VGGish (<xref ref-type="bibr" rid="B13">Hershey et&#x20;al., 2017</xref>) trained with Audio Set (<xref ref-type="bibr" rid="B12">Gemmeke et&#x20;al., 2017</xref>) for audio classification. They have proposed a modified version of convolutional recurrent neural network (CRNN), where an attention layer is the forefront layer of the network, and fully connected layers follow the recurrent&#x20;layer.</p>
</sec>
<sec id="s3-2">
<title>3.2 Linguistic Feature-Based Methods</title>
<p>Recently, there have been multiple attempts on the AD detection problem based on text-based features and models. <xref ref-type="bibr" rid="B34">Searle et&#x20;al. (2020)</xref> use traditional machine learning techniques like support vector machines (SVMs), gradient boosting decision trees (GBDT), and conditional random fields (CRFs). They also try deep learning transformer-based models, specifically, bidirectional encoder representations from transformers (BERT) (<xref ref-type="bibr" rid="B8">Devlin et&#x20;al., 2019</xref>), RoBERTa (<xref ref-type="bibr" rid="B17">Liu et&#x20;al., 2019</xref>), and DistilBERT/DistilRoBERTa (<xref ref-type="bibr" rid="B29">Sanh et&#x20;al., 2019</xref>). <xref ref-type="bibr" rid="B26">Pompili et&#x20;al. (2020)</xref> encode each word of the clean transcriptions into 768-dimensional context embedding vector using a frozen English BERT model pretrained with 12 layers. Three different neural models are trained on top of contextual word embeddings: 1) global maximum pooling, 2) bidirectional long short-term memory (BLSTM)&#x2013;based recurrent neural networks (RNN) provided with an attention module, and 3) the second model augmented with part-of-speech (POS) embeddings. In the work of <xref ref-type="bibr" rid="B1">Campbell et&#x20;al. (2020)</xref>, authors have used the manual transcripts to extract linguistic information (interventions, vocabulary richness, frequency of verbs, nouns, POS-tagging, etc.) for creating the input features of the classifier. They use another sequential deep learning-based classifier, which directly classifies the sequence of Gobal Vectors (GloVe)&#x2013;based word embeddings. <xref ref-type="bibr" rid="B16">Koo et&#x20;al. (2020)</xref> use transformer-based language models (<xref ref-type="bibr" rid="B40">Vaswani et&#x20;al., 2017</xref>), generative pretraining (GPT) (<xref ref-type="bibr" rid="B27">Radford et&#x20;al., 2018</xref>), RoBERTa (<xref ref-type="bibr" rid="B17">Liu et&#x20;al., 2019</xref>), and transformer-XL (<xref ref-type="bibr" rid="B6">Dai et&#x20;al., 2020</xref>) to get textual features and perform classification and regression tasks using a modified convolutional recurrent neural network-based structure.</p>
<p>Graph-based representation of word features (<xref ref-type="bibr" rid="B39">Tom&#xe1;s and Radev, 2012</xref>; <xref ref-type="bibr" rid="B4">Cong and Liu, 2014</xref>), which have shown promise in classifying texts (<xref ref-type="bibr" rid="B7">De Arruda et&#x20;al., 2016</xref>), is also employed for detection of mild cognitive impairments. <xref ref-type="bibr" rid="B30">Santos et&#x20;al. (2017)</xref> model transcripts as complex networks and enrich them with word embedding to better represent short texts produced in neuropsychological assessments. They use metrics of topological properties of complex networks in a machine learning classification approach to distinguish between healthy subjects and patients with mild cognitive impairments. Such graph-based techniques have also been used in the word sense disambiguation (WSD) tasks to identify the meaning of words in a given context for specific words conveying multiple meanings. <xref ref-type="bibr" rid="B5">Corra et&#x20;al. (2018)</xref> suggest that a bipartite network model with local features employed to characterize the context can be useful in improving the semantic characterization of written texts without the use of deep linguistic information.</p>
</sec>
<sec id="s3-3">
<title>3.3 Bimodal Methods</title>
<p>Methods with bimodal input features (both acoustic and linguistic) are also used for AD recognition in various studies (<xref ref-type="bibr" rid="B31">Sarawgi et&#x20;al., 2020a</xref>; <xref ref-type="bibr" rid="B32">Sarawgi et&#x20;al., 2020b</xref>; <xref ref-type="bibr" rid="B1">Campbell et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B16">Koo et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B26">Pompili et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B28">Rohanian et&#x20;al., 2020</xref>). However, in this work, we restrict ourselves to the NLP-based approaches.</p>
</sec>
</sec>
<sec id="s4">
<title>4 Proposed NLP-Based Methods</title>
<sec id="s4-1">
<title>4.1 Data Preparation</title>
<p>In this work, we explore the linguistic features for AD detection and hence only the textual transcripts in the ADReSS dataset are used. The transcripts contain the conversational content between the participant and the investigator. This includes pauses in speech, laughter, and discourse markers such as &#x201c;um&#x201d; and &#x201c;uh.&#x201d; Each transcript is considered as a single data point with their corresponding AD label and MMSE score. We create two transcription level datasets after preprocessing the transcripts as in <xref ref-type="bibr" rid="B34">Searle et&#x20;al. (2020)</xref>&#x2014;1) PAR: containing the utterances of participant alone, 2) PAR &#x2b; INV: containing utterances from both the participant and the investigator. In addition to the preprocessing performed in <xref ref-type="bibr" rid="B34">Searle et&#x20;al. (2020)</xref>, we keep PAR and INV tags as well in the data (which defines whether the utterance is spoken by the participant or the investigator).</p>
</sec>
<sec id="s4-2">
<title>4.2 Convolutional Neural Network Model</title>
<p>Language impairments like difficulties in lexical retrieval, loss of verbal fluency, and breakdown in comprehension of higher order written and spoken languages are common in AD patients. Hence the linguistic information, like the n-grams present in the input sentence, may provide good cues for AD detection. Any <inline-formula id="inf6">
<mml:math id="minf6">
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> CNN filter, where <italic>n</italic> is the number of sequential words looked over by the filter and <italic>d</italic> is the dimension of word embedding, can be viewed as a feature detector looking for a specific n-gram in the input that can capture the language impairments associated with&#x20;AD.</p>
<p>We describe the details of the CNN model from the work (<xref ref-type="bibr" rid="B15">Kim, 2014</xref>) as follows. Let <inline-formula id="inf7">
<mml:math id="minf7">
<mml:mrow>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>R</mml:mi>
<mml:mi>d</mml:mi>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> be a <italic>d</italic>-dimensional word vector corresponding to the <italic>i</italic>th word in the sentence. A sentence of length <italic>L</italic> is represented as <inline-formula id="inf8">
<mml:math id="minf8">
<mml:mrow>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mi>L</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>. Let <inline-formula id="inf9">
<mml:math id="minf9">
<mml:mrow>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>:</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> represent the concatenation of the words <inline-formula id="inf10">
<mml:math id="minf10">
<mml:mrow>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>. A convolution operation involves a filter <inline-formula id="inf11">
<mml:math id="minf11">
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mtext>nd</mml:mtext>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>, which is applied to a window of <italic>n</italic> words to produce a new feature as shown in <xref ref-type="disp-formula" rid="e3">Eq. 3</xref>, where <inline-formula id="inf12">
<mml:math id="minf12">
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is generated from a window of words <inline-formula id="inf13">
<mml:math id="minf13">
<mml:mrow>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>:</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> by<disp-formula id="e3">
<mml:math id="me3">
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mo>&#x22c5;</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>:</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:math>
<label>(3)</label>
</disp-formula>
</p>
<p>In <xref ref-type="disp-formula" rid="e3">Eq. 3</xref>, <italic>f</italic> is a nonlinear function and <italic>b</italic> is the bias term. A feature map <inline-formula id="inf14">
<mml:math id="minf14">
<mml:mi mathvariant="script">E</mml:mi>
</mml:math>
</inline-formula> is obtained by applying the filter to all possible windows of words in the sentence <inline-formula id="inf15">
<mml:math id="minf15">
<mml:mrow>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>:</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>:</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>:</mml:mo>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>.<disp-formula id="e4">
<mml:math id="me4">
<mml:mrow>
<mml:mi mathvariant="script">E</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:math>
<label>(4)</label>
</disp-formula>
</p>
<p>A max-pool over time (<xref ref-type="bibr" rid="B3">Collober et&#x20;al., 2011</xref>) is performed over the feature map to get <inline-formula id="inf16">
<mml:math id="minf16">
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mrow>
<mml:mi>max</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>max</mml:mi>
<mml:mi mathvariant="script">E</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> as the feature corresponding to that filter. This corresponds to the n-gram that is &#x201c;most relevant&#x201d; in the AD recognition task. The weights of the filters, which in turn determine the &#x201c;most relevant&#x201d; feature, are learnt using backpropagation. CNNs are trained with just one layer of convolution. Variable length sentences are automatically handled by the pooling scheme. We use pretrained 100-dimensional GloVe word vectors (<xref ref-type="bibr" rid="B25">Pennington et&#x20;al., 2014</xref>) for word embedding. Multiple kernels of sizes <inline-formula id="inf17">
<mml:math id="minf17">
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:mn>100</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf18">
<mml:math id="minf18">
<mml:mrow>
<mml:mn>3</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:mn>100</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf19">
<mml:math id="minf19">
<mml:mrow>
<mml:mn>4</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:mn>100</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, and <inline-formula id="inf20">
<mml:math id="minf20">
<mml:mrow>
<mml:mn>5</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:mn>100</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> are employed to have a look at the bigrams, trigrams, 4-grams, and 5-grams within the text. We use 100 filters each with heights 2, 3, 4, and 5. Multiple configurations with filter sizes [2,3,4], [3,4,5], and [2,3,4,5] are applied which are referred to as CNN-bi&#x2b;tri&#x2b;4 gram, CNN-tri&#x2b;4&#x2b;5 gram, and CNN-bi&#x2b;tri&#x2b;4&#x2b;5 gram in our tables. The outputs of the filter are concatenated together to form a single vector. Dropout with probability <inline-formula id="inf21">
<mml:math id="minf21">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.5</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> is applied on the concatenated filter output and the results are passed through a linear layer for the final prediction task. The linear layer weights up the evidence from each of these n-grams and make a final decision. <xref ref-type="fig" rid="F1">Figure&#x20;1</xref> shows the basic CNN operation over an example sentence.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>Demonstration of CNN over text for an example sentence.</p>
</caption>
<graphic xlink:href="fcomp-03-624558-g001.tif"/>
</fig>
<sec id="s4-2-1">
<title>4.2.1 Training Details</title>
<p>For the classification task, training is performed for 100 epochs with a batch size of 16. Adam optimizer is used with a learning rate of 0.001. Model with the lowest validation loss is saved and used for prediction. Since AD classification is a two-class problem, binary cross-entropy with logits loss is used as the loss function. For the MMSE score prediction task, the output layer is a fully connected layer with linear activation function. In the regression task the network is trained for 1,500 epochs with the objective to minimize the mean squared&#x20;error.</p>
<p>We use bootstrap aggregation of models known as bagging (<xref ref-type="bibr" rid="B42">Breiman, 1996</xref>) to predict the final labels/MMSE scores for test samples. Bootstrap aggregation is an ensemble technique to improve the stability and accuracy of machine learning models. It combines the prediction from multiple models. It also reduces variance and helps to avoid overfitting. We fit 21 models and the outputs are combined by a majority voting scheme for final classification. In the regression task, the outputs of these bootstrap models are averaged to arrive at the final MMSE&#x20;score.</p>
</sec>
</sec>
<sec id="s4-3">
<title>4.3 fastText</title>
<p>fastText-based classifiers calculate the n-grams of an input sentence explicitly and append them to the end of the sentence. In this work, we use bigrams and trigrams. We conducted the experiments with 4-grams as well, but the results did not show any improvement over the use of trigrams. This bag of bigrams and trigrams acts as additional features to capture some information about the local word&#x20;order.</p>
<p>
<xref ref-type="fig" rid="F2">Figure&#x20;2</xref> shows the architecture of fastText model. The fastText model has two layers, an embedding layer and a linear layer. The embedding layer calculates the word embedding (100-dimensional) for each word. The average of all these word embeddings is calculated and fed through the linear layer for final prediction as described in <xref ref-type="fig" rid="F2">Figure&#x20;2</xref>. fastText models are faster for training and evaluation by many orders of magnitude, compared to the &#x201c;deep&#x201d; models. As mentioned in the work (<xref ref-type="bibr" rid="B14">Joulin et&#x20;al., 2017</xref>), fastText can be trained on more than one billion words in less than 10&#xa0;min using a standard multicore CPU and classify half a million sentences among 312&#xa0;K classes in less than a minute.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>fastText model (<xref ref-type="bibr" rid="B14">Joulin et&#x20;al., 2017</xref>) with appended n-gram features <inline-formula id="inf22">
<mml:math id="minf22">
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mn>3</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mrow>
<mml:mi>K</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>K</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> as&#x20;input.</p>
</caption>
<graphic xlink:href="fcomp-03-624558-g002.tif"/>
</fig>
<sec id="s4-3-1">
<title>4.3.1 Training Details</title>
<p>All training details are the same as mentioned in <xref ref-type="sec" rid="s4-2-1">Section 4.2.1</xref>. The only difference is that dropout is not used in this model. Here also we use 21 bootstrapping models and the outputs are combined as described in <xref ref-type="sec" rid="s4-2-1">Section&#x20;4.2.1</xref>.</p>
</sec>
</sec>
</sec>
<sec id="s5">
<title>5 Results</title>
<p>We have performed 5-fold cross-validation, to estimate the generalization error. One of the folds has 20 validation samples and the remaining four have 22 validation samples. The results of cross-validation on CNN and fastText models trained on PAR and PAR &#x2b; INV sets are listed in <xref ref-type="table" rid="T2">Table&#x20;2</xref>. The best performing model for classification during the cross-validation was fastText with bigrams on the PAR &#x2b; INV set, which yields an average cross-validation accuracy of 86.09%. Among the CNN models, tri&#x2b;4&#x2b;5 grams give the best accuracy in both PAR (77.54%) and INV &#x2b; PAR (81.27%) sets. As far as accuracy is concerned, both the CNN and fastText models seem to benefit from the inclusion of utterances from the investigator. For the prediction of MMSE score, CNN with bi&#x2b;tri&#x2b;4&#x2b;5 grams (RMSE of 4.38) was the best. The fastText models seem to get a clear advantage in RMSE with the addition of the utterances from the investigator. However such a large difference in RMSE is not observable between the CNN models using PAR and INV &#x2b; PAR sets. The cross-validation results confirmed our belief that the n-grams from the transcriptions of the picture description task could be useful in the detection of&#x20;AD.</p>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>Average 5-fold cross-validation results for AD classification and RMSE values.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Dataset</th>
<th align="center">Model</th>
<th align="center">Accuracy</th>
<th align="center">RMSE</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">PAR</td>
<td align="left">CNN, bi&#x2b;tri&#x2b;4 gram</td>
<td align="center">73.91</td>
<td align="center">4.55</td>
</tr>
<tr>
<td align="left">PAR</td>
<td align="left">CNN, tri&#x2b;4&#x2b;5 gram</td>
<td align="center">77.54</td>
<td align="center">4.41</td>
</tr>
<tr>
<td align="left">PAR</td>
<td align="left">CNN, bi&#x2b;tri&#x2b;4&#x2b;5 gram</td>
<td align="center">76.54</td>
<td align="center">4.65</td>
</tr>
<tr>
<td align="left">PAR</td>
<td align="left">fastText, bigram</td>
<td align="center">80.54</td>
<td align="center">5.43</td>
</tr>
<tr>
<td align="left">PAR</td>
<td align="left">fastText, bi &#x2b; trigram</td>
<td align="center">82.36</td>
<td align="center">5.40</td>
</tr>
<tr>
<td align="left">PAR &#x2b; INV</td>
<td align="left">CNN, bi&#x2b;tri&#x2b;4 gram</td>
<td align="center">80.18</td>
<td align="center">4.63</td>
</tr>
<tr>
<td align="left">PAR &#x2b; INV</td>
<td align="left">CNN, tri&#x2b;4&#x2b;5 gram</td>
<td align="center">81.27</td>
<td align="center">4.53</td>
</tr>
<tr>
<td align="left">PAR &#x2b; INV</td>
<td align="left">CNN, bi&#x2b;tri&#x2b;4&#x2b;5 gram</td>
<td align="center">80.36</td>
<td align="center">4.38</td>
</tr>
<tr>
<td align="left">PAR &#x2b; INV</td>
<td align="left">fastText, bigram</td>
<td align="center">86.09</td>
<td align="center">4.66</td>
</tr>
<tr>
<td align="left">PAR &#x2b; INV</td>
<td align="left">fastText, bi &#x2b; trigram</td>
<td align="center">85.90</td>
<td align="center">4.81</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>
<xref ref-type="table" rid="T3">Table&#x20;3</xref> lists the classification accuracy and RMSE in the prediction of MMSE score on the test set of the ADReSS corpus. The table also lists the precision, recall, and <inline-formula id="inf23">
<mml:math id="minf23">
<mml:mrow>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mtext>score</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula> for each class. They are computed as precision <inline-formula id="inf24">
<mml:math id="minf24">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, recall <inline-formula id="inf25">
<mml:math id="minf25">
<mml:mrow>
<mml:mi>&#x3c1;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, and <inline-formula id="inf26">
<mml:math id="minf26">
<mml:mrow>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mtext>score</mml:mtext>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi>&#x3c0;</mml:mi>
<mml:mi>&#x3c1;</mml:mi>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3c1;</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, where <inline-formula id="inf27">
<mml:math id="minf27">
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf28">
<mml:math id="minf28">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf29">
<mml:math id="minf29">
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>, and <inline-formula id="inf30">
<mml:math id="minf30">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> are the number of true positives, false positives, true negatives, and false negatives, respectively. The listed results are obtained after bootstrapping with 21 samples. The best classification accuracy is 83.33% which is achieved using fastText model with appended bigrams and trigrams. The accuracies are similar in both PAR and PAR &#x2b; INV sets using the fastText model. The maximum accuracy obtained with CNN models is 79.16%, which is achieved on the INV &#x2b; PAR set using bi&#x2b;tri&#x2b;4 grams or tri&#x2b;4&#x2b;5 grams. In the detection task, the CNN models seem to benefit from the addition of utterances from the investigator. Also the accuracies seem to degrade when bigrams, trigrams, 4-grams, and 5-grams are considered together. This behavior is consistent across the PAR and PAR &#x2b; INV sets. The best RMSE in the prediction of MMSE score is 4.28 which is obtained on the PAR &#x2b; INV set using fastText model employing only bigrams. In the regression task using fastText, the use of bigrams achieves slightly better RMSE compared to the use of both bigrams and trigrams. Also the fastText models seem to benefit from the use of utterances from the investigator. In contrast, CNN models do not seem to get any specific advantage with the inclusion of investigator&#x2019;s utterances. The performance of the CNN models remains almost the same across the use of bi&#x2b;tri&#x2b;4, tri&#x2b;4&#x2b;5, and bi&#x2b;tri&#x2b;4&#x2b;5&#x20;grams.</p>
<table-wrap id="T3" position="float">
<label>TABLE 3</label>
<caption>
<p>Results on ADReSS test set. The bold values represent the best results obtained by our models.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Dataset</th>
<th align="center">Model</th>
<th align="center">Class</th>
<th align="center">Precision</th>
<th align="center">Recall</th>
<th align="center">F1 score</th>
<th align="center">Accuracy (%)</th>
<th align="center">RMSE</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td rowspan="2" align="left">PAR</td>
<td rowspan="2" align="left">CNN, bi&#x2b;tri&#x2b;4 gram</td>
<td align="left">Non-AD</td>
<td align="center">0.74</td>
<td align="center">0.71</td>
<td align="center">0.72</td>
<td rowspan="2" align="center">72.91</td>
<td rowspan="2" align="center">4.38</td>
</tr>
<tr>
<td align="left">AD</td>
<td align="center">0.72</td>
<td align="center">0.75</td>
<td align="center">0.73</td>
</tr>
<tr>
<td rowspan="2" align="left">PAR</td>
<td rowspan="2" align="left">CNN, tri&#x2b;4&#x2b;5 gram</td>
<td align="left">Non-AD</td>
<td align="center">0.76</td>
<td align="center">0.67</td>
<td align="center">0.71</td>
<td rowspan="2" align="center">72.91</td>
<td rowspan="2" align="center">4.46</td>
</tr>
<tr>
<td align="left">AD</td>
<td align="center">0.70</td>
<td align="center">0.79</td>
<td align="center">0.75</td>
</tr>
<tr>
<td rowspan="2" align="left">PAR</td>
<td rowspan="2" align="left">CNN, bi&#x2b;tri&#x2b;4&#x2b;5 gram</td>
<td align="left">Non-AD</td>
<td align="center">0.71</td>
<td align="center">0.71</td>
<td align="center">0.71</td>
<td rowspan="2" align="center">70.83</td>
<td rowspan="2" align="center">4.42</td>
</tr>
<tr>
<td align="left">AD</td>
<td align="center">0.71</td>
<td align="center">0.71</td>
<td align="center">0.71</td>
</tr>
<tr>
<td rowspan="2" align="left">PAR</td>
<td rowspan="2" align="left">fastText, bigram</td>
<td align="left">Non-AD</td>
<td align="center">0.78</td>
<td align="center">0.88</td>
<td align="center">0.82</td>
<td rowspan="2" align="center">81.25</td>
<td rowspan="2" align="center">4.51</td>
</tr>
<tr>
<td align="left">AD</td>
<td align="center">0.86</td>
<td align="center">0.75</td>
<td align="center">0.80</td>
</tr>
<tr>
<td rowspan="2" align="left">PAR</td>
<td rowspan="2" align="left">fastText, bi &#x2b; trigram</td>
<td align="left">Non-AD</td>
<td align="center">0.81</td>
<td align="center">0.88</td>
<td align="center">0.84</td>
<td rowspan="2" align="center">
<bold>83.33</bold>
</td>
<td rowspan="2" align="center">4.87</td>
</tr>
<tr>
<td align="left">AD</td>
<td align="center">0.86</td>
<td align="center">0.79</td>
<td align="center">0.83</td>
</tr>
<tr>
<td rowspan="2" align="left">PAR &#x2b; INV</td>
<td rowspan="2" align="left">CNN, bi&#x2b;tri&#x2b;4 gram</td>
<td align="center">Non-AD</td>
<td align="center">0.77</td>
<td align="center">0.83</td>
<td align="center">0.80</td>
<td rowspan="2" align="center">79.16</td>
<td rowspan="2" align="center">4.48</td>
</tr>
<tr>
<td align="left">AD</td>
<td align="center">0.82</td>
<td align="center">0.75</td>
<td align="center">0.78</td>
</tr>
<tr>
<td rowspan="2" align="left">PAR &#x2b; INV</td>
<td rowspan="2" align="left">CNN, tri&#x2b;4&#x2b;5 gram</td>
<td align="center">Non-AD</td>
<td align="center">0.77</td>
<td align="center">0.83</td>
<td align="center">0.80</td>
<td rowspan="2" align="center">79.16</td>
<td rowspan="2" align="center">4.47</td>
</tr>
<tr>
<td align="left">AD</td>
<td align="center">0.82</td>
<td align="center">0.75</td>
<td align="center">0.78</td>
</tr>
<tr>
<td rowspan="2" align="left">PAR &#x2b; INV</td>
<td rowspan="2" align="left">CNN, bi&#x2b;tri&#x2b;4&#x2b;5 gram</td>
<td align="center">Non-AD</td>
<td align="center">0.74</td>
<td align="center">0.71</td>
<td align="center">0.72</td>
<td rowspan="2" align="center">72.91</td>
<td rowspan="2" align="center">4.44</td>
</tr>
<tr>
<td align="left">AD</td>
<td align="center">0.72</td>
<td align="center">0.75</td>
<td align="center">0.73</td>
</tr>
<tr>
<td rowspan="2" align="left">PAR &#x2b; INV</td>
<td rowspan="2" align="left">fastText, bigram</td>
<td align="center">Non-AD</td>
<td align="center">0.78</td>
<td align="center">0.88</td>
<td align="center">0.82</td>
<td rowspan="2" align="center">81.25</td>
<td rowspan="2" align="center">
<bold>4.28</bold>
</td>
</tr>
<tr>
<td align="left">AD</td>
<td align="center">0.86</td>
<td align="center">0.75</td>
<td align="center">0.80</td>
</tr>
<tr>
<td rowspan="2" align="left">PAR &#x2b; INV</td>
<td rowspan="2" align="left">fastText, bi &#x2b; trigram</td>
<td align="center">Non-AD</td>
<td align="center">0.79</td>
<td align="center">0.92</td>
<td align="center">0.85</td>
<td rowspan="2" align="center">
<bold>83.33</bold>
</td>
<td rowspan="2" align="center">4.47</td>
</tr>
<tr>
<td align="left">AD</td>
<td align="center">0.90</td>
<td align="center">0.75</td>
<td align="center">0.82</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s6">
<title>6 Discussion and Conclusion</title>
<p>In this work, we explore two models, CNN with a single convolution layer and fastText, to address the problem of AD classification and prediction of MMSE score from the transcriptions of the picture description task. The choice of these models was based on our initial belief that modeling the transcriptions of the narrative speech in the picture description task using n-grams could give some indication on the status of AD. The chosen models are also shallow. The number of parameters is much less than the usual deep learning architectures and hence they can be trained and evaluated quite fast. Yet, the performance of these models is competitive with the baseline results reported with complex models (refer to <xref ref-type="table" rid="T1">Table&#x20;1</xref>). The results suggest that the n-gram-based features are worth pursuing, for the task of AD detection.</p>
<p>Among the considered models, fastText model with bigrams and trigrams appended to the input achieves the best classification accuracy (83.33%). In the regression task, the best results (RMSE of 4.28) are achieved using fastText model with only the bigrams appended to the input. The fastText models have a clear edge over CNN in the classification task. Empirical evidence suggests that fastText models benefit from the inclusion of utterances from the investigator in the regression task, though they do not make much difference in the classification task. The CNN models on the other hand perform better on the PAR &#x2b; INV sets in the classification task. In the regression task, their performance is similar across the PAR and PAR &#x2b; INV sets. Bigrams have an edge over bi &#x2b; tri grams in fastText, when used for prediction of MMSE score. However, the performance of the CNN models remains almost the same across the use of bi&#x2b;tri&#x2b;4, tri&#x2b;4&#x2b;5, and bi&#x2b;tri&#x2b;4&#x2b;5 grams, in the regression&#x20;task.</p>
</sec>
</body>
<back>
<sec id="s7">
<title>Data Availability Statement</title>
<p>The data analyzed in this study are subject to the following licenses/restrictions: In order to gain access to the ADReSS data, you will need to become a member of DementiaBank (free of charge) by contacting Brian MacWhinney on <email>macw@cmu.edu</email>. You should include your contact information and affiliation, as well as a general statement on how you plan to use the data, with specific mention to the ADReSS challenge. If you are a student, please ask your supervisor to join as a member as well. This membership will give you full access to the DementiaBank database, where the ADReSS dataset will be available and clearly identified. For further information, visit DementiaBank. Requests to access these datasets should be directed to Brian MacWhinney, <email>macw@cmu.edu</email>.</p>
</sec>
<sec id="s8">
<title>Author Contributions</title>
<p>AM, AS, and AR contributed to the conception and design of the study. AM and AS wrote the first draft of the manuscript. AR reviewed the first draft and suggested improvements. AM and AS wrote sections of the manuscript. All authors contributed to manuscript revision and read and approved the submitted version.</p>
</sec>
<sec sec-type="COI-statement" id="s9">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B42">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Breiman</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>1996</year>). <article-title>Bagging predictors</article-title>. <source>Mach. Learn.</source> <volume>24</volume>, <fpage>123</fpage>&#x2013;<lpage>140</lpage>. <pub-id pub-id-type="doi">10.1007/BF00058655</pub-id>
</citation>
</ref>
<ref id="B1">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Campbell</surname>
<given-names>E. L.</given-names>
</name>
<name>
<surname>Doc&#xed;o-Fern&#xe1;ndez</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Raboso</surname>
<given-names>J.&#x20;J.</given-names>
</name>
<name>
<surname>Garc&#xed;a-Mateo</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Alzheimer&#x2019;s dementia detection from audio and text modalities</article-title>. <comment>arXiv preprint <ext-link ext-link-type="uri" xlink:href="http://arXiv:2008.04617">arXiv:2008.04617</ext-link>
</comment> </citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>A feature study for classification-based speech separation at low signal-to-noise ratios</article-title>. <source>IEEE/ACM Trans. Audio Speech Lang. Process.</source> <volume>22</volume>, <fpage>1993</fpage>&#x2013;<lpage>2002</lpage>. <pub-id pub-id-type="doi">10.1109/TASLP.2014.2359159</pub-id> </citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Collober</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Weston</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Bottou</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Karlen</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Kavukcuoglu</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Kuksa</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2011</year>). <article-title>Natural language processing (almost) from scratch</article-title>. <source>J.&#x20;Machine Learn. Res.</source> <volume>12</volume>, <fpage>2493</fpage>&#x2013;<lpage>2537</lpage>. <pub-id pub-id-type="doi">10.5555/1953048.2078186</pub-id> </citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cong</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>Approaching human language with complex networks</article-title>. <source>Phys. Life Rev.</source> <volume>11</volume>, <fpage>598</fpage>&#x2013;<lpage>618</lpage>. <pub-id pub-id-type="doi">10.1016/j.plrev.2014.04.004</pub-id> </citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Corra</surname>
<given-names>E. A.</given-names>
</name>
<name>
<surname>Lopes</surname>
<given-names>A. A.</given-names>
</name>
<name>
<surname>Amancio</surname>
<given-names>D. R.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Word sense disambiguation</article-title>. <source>Inf. Sci.</source> <volume>442</volume>, <fpage>103</fpage>&#x2013;<lpage>113</lpage>. <pub-id pub-id-type="doi">10.1016/j.ins.2018.02.047</pub-id> </citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dai</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Carbonell</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Le</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Salakhutdinov</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Transformer-XL: attentive language models beyond a fixed-length context</article-title>,&#x201d; in <conf-name>Proceedings of the 57th annual meeting of the association for computational linguistics</conf-name>, <conf-loc>Florence, Italy</conf-loc>, <conf-date>July 2019</conf-date>, <fpage>2978</fpage>&#x2013;<lpage>2988</lpage>. <pub-id pub-id-type="doi">10.18653/v1/P19-1285</pub-id> </citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>De Arruda</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Costa</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Amancio</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Using complex networks for text classification: discriminating informative and imaginative documents</article-title>. <source>EPL</source> <volume>113</volume>, <fpage>28007</fpage>. <pub-id pub-id-type="doi">10.1209/0295-5075/113/28007</pub-id> </citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Devlin</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Chang</surname>
<given-names>M.-W.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Toutanova</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>,&#x201d; in <conf-name>Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies</conf-name>, <conf-loc>Minneapolis, MN</conf-loc>, <conf-date>June 2&#x2013;7, 2019</conf-date>, <volume>Vol. 1</volume>, <fpage>4171</fpage>&#x2013;<lpage>4186</lpage>. <pub-id pub-id-type="doi">10.18653/v1/N19-1423</pub-id> </citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Edwards</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Dognin</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Bollepalli</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Singh</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Multiscale system for Alzheimer&#x2019;s dementia recognition through spontaneous speech</article-title>,&#x201d; in <conf-name>Proceedings of interspeech 2020</conf-name>, <conf-loc>Shanghai, China</conf-loc>, <conf-date>October 2020</conf-date>, <fpage>2197</fpage>&#x2013;<lpage>2201</lpage>. <pub-id pub-id-type="doi">10.21437/Interspeech.2020-2781</pub-id> </citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Eyben</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Scherer</surname>
<given-names>K. R.</given-names>
</name>
<name>
<surname>Schuller</surname>
<given-names>B. W.</given-names>
</name>
<name>
<surname>Sundberg</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Andr&#xe9;</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Busso</surname>
<given-names>C.</given-names>
</name>
<etal/>
</person-group> (<year>2016</year>). <article-title>The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing</article-title>. <source>IEEE Trans. Affective Comput.</source> <volume>7</volume>, <fpage>190</fpage>&#x2013;<lpage>202</lpage>. <pub-id pub-id-type="doi">10.1109/taffc.2015.2457417</pub-id> </citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Eyben</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Weninger</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Gross</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Schuller</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2013</year>). &#x201c;<article-title>Recent developments in openSMILE, the Munich open-source multimedia feature extractor</article-title>,&#x201d; in <conf-name>Proceedings the 2013 ACM multimedia conference</conf-name>, <conf-loc>Barcelona, Spain</conf-loc>, <conf-date>October, 2013</conf-date>, <fpage>835</fpage>&#x2013;<lpage>838</lpage>. <pub-id pub-id-type="doi">10.1145/2502081.2502224</pub-id> </citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gemmeke</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Ellis</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Freedman</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Jansen</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Lawrence</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Moore</surname>
<given-names>R.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). &#x201c;<article-title>Audio set: an ontology and human-labeled dataset for audio events</article-title>,&#x201d; in <conf-name>2017 IEEE international conference on acoustics, speech and signal processing (ICASSP)</conf-name>, <conf-loc>New Orleans, LA</conf-loc>, <conf-date>March 5&#x2013;9, 2017</conf-date>, <fpage>776</fpage>&#x2013;<lpage>780</lpage>. <pub-id pub-id-type="doi">10.1109/ICASSP.2017.7952261</pub-id> </citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hershey</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Chaudhuri</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Ellis</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Gemmeke</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Jansen</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Moore</surname>
<given-names>R. C.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). &#x201c;<article-title>CNN architectures for large-scale audio classification</article-title>,&#x201d; in <conf-name>2017 IEEE international conference on acoustics, speech and signal processing (ICASSP)</conf-name>, <conf-loc>New Orleans, LA</conf-loc>, <conf-date>March 5&#x2013;9, 2017</conf-date>, <fpage>131</fpage>&#x2013;(<lpage>135</lpage>.) </citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Joulin</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Grave</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Bojanowski</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Mikolov</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Bag of tricks for efficient text classification</article-title>,&#x201d; in <conf-name>Proceedings of the 15th conference of the european chapter of the association for computational linguistics</conf-name>, <conf-loc>Valencia, Spain</conf-loc>, <conf-date>April 3&#x2013;7, 2017</conf-date>, <volume>Vol. 2</volume>, <fpage>427</fpage>&#x2013;(<lpage>431</lpage>.) </citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kim</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Convolutional neural networks for sentence classification</article-title>,&#x201d; in <conf-name>Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</conf-name>, <conf-loc>Doha, Qatar</conf-loc>, <conf-date>October 25&#x2013;29, 2014</conf-date>, <fpage>1746</fpage>&#x2013;<lpage>1751</lpage>. <pub-id pub-id-type="doi">10.3115/v1/D14-1181</pub-id> </citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Koo</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>J.&#x20;H.</given-names>
</name>
<name>
<surname>Pyo</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Jo</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Exploiting multi-modal features from pre-trained networks for Alzheimer&#x2019;s dementia recognition</article-title>,&#x201d; in <conf-name>Proceedings of interspeech 2020</conf-name>, <conf-loc>Shanghai, China</conf-loc>, <conf-date>October 2020</conf-date>, <fpage>2217</fpage>&#x2013;<lpage>2221</lpage>. <pub-id pub-id-type="doi">10.21437/Interspeech.2020-3153</pub-id> </citation>
</ref>
<ref id="B17">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Ott</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Goyal</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Du</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Joshi</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>D.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>RoBERTa: a robustly optimized BERT pretraining approach</article-title>. <comment>ArXiv <ext-link ext-link-type="uri" xlink:href="http://abs/1907.11692">abs/1907.11692</ext-link>
</comment> </citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Luz</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Haider</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>de la Fuente</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Fromm</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>MacWhinney</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Alzheimer&#x2019;s dementia recognition through spontaneous speech: the ADReSS challenge</article-title>,&#x201d; in <conf-name>Proceedings of interspeech 2020</conf-name>, <conf-loc>Shanghai, China</conf-loc>, <conf-date>October 2020</conf-date>, <fpage>2172</fpage>&#x2013;<lpage>2176</lpage>. <pub-id pub-id-type="doi">10.21437/Interspeech.2020-2571</pub-id> </citation>
</ref>
<ref id="B19">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Macwhinney</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2009</year>). &#x201c;<article-title>The CHILDES project part 1</article-title>,&#x201d; in <source>The CHAT transcription format</source>. <pub-id pub-id-type="doi">10.1184/R1/6618440.v1</pub-id> </citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Meghanani</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Anoop</surname>
<given-names>C. S.</given-names>
</name>
<name>
<surname>Ramakrishnan</surname>
<given-names>A. G.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>An exploration of log-mel spectrogram and MFCC features for Alzheimer&#x2019;s dementia recognition from spontaneous speech</article-title>,&#x201d; in <conf-name>The 8th IEEE spoken language technology workshop (SLT)</conf-name>, <conf-loc>Shenzhen, China</conf-loc>, <conf-date>January 19-22, 2021</conf-date> </citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mueller</surname>
<given-names>K. D.</given-names>
</name>
<name>
<surname>Hermann</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Mecollarib</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Turkstra</surname>
<given-names>L. S.</given-names>
</name>
</person-group> (<year>2018a</year>). <article-title>Connected speech and language in mild cognitive impairment and Alzheimer&#x2019;s disease: a review of picture description tasks</article-title>. <source>J.&#x20;Clin. Exp. Neuropsychol.</source> <volume>40</volume>, <fpage>917</fpage>&#x2013;<lpage>939</lpage>. <pub-id pub-id-type="doi">10.1080/13803395.2018.1446513</pub-id> </citation>
</ref>
<ref id="B22">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mueller</surname>
<given-names>K. D.</given-names>
</name>
<name>
<surname>Koscik</surname>
<given-names>R. L.</given-names>
</name>
<name>
<surname>Hermann</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Johnson</surname>
<given-names>S. C.</given-names>
</name>
<name>
<surname>Turkstra</surname>
<given-names>L. S.</given-names>
</name>
</person-group> (<year>2018b</year>). <article-title>Declines in connected language are associated with very early mild cognitive impairment: results from the Wisconsin registry for Alzheimer&#x2019;s prevention</article-title>. <source>Front. Aging Neurosci.</source> <volume>9</volume>, <fpage>437</fpage>. <pub-id pub-id-type="doi">10.3389/fnagi.2017.00437</pub-id> </citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nicholas</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Obler</surname>
<given-names>L. K.</given-names>
</name>
<name>
<surname>Albert</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Helm-Estabrooks</surname>
<given-names>N.</given-names>
</name>
</person-group> (<year>1985</year>). <article-title>Empty speech in Alzheimer&#x2019;s disease and fluent aphasia</article-title>. <source>J.&#x20;Speech Hear. Res.</source> <volume>28</volume>, <fpage>405</fpage>&#x2013;<lpage>410</lpage>. <pub-id pub-id-type="doi">10.1044/jshr.2803.405</pub-id> </citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pappagari</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Cho</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Moro-Vel&#xe1;zquez</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Dehak</surname>
<given-names>N.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Using state of the art speaker recognition and natural language processing technologies to detect Alzheimer&#x2019;s disease and assess its severity</article-title>,&#x201d; in <conf-name>Proceedings of interspeech 2020</conf-name>, <conf-loc>Shanghai, China</conf-loc>, <conf-date>October 2020</conf-date>, <fpage>2177</fpage>&#x2013;<lpage>2181</lpage>. <pub-id pub-id-type="doi">10.21437/Interspeech.2020-2587</pub-id> </citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pennington</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Socher</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Manning</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>Glove: global vectors for word representation</article-title>. <conf-name>Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</conf-name>, <conf-loc>Doha, Qatar</conf-loc>, <conf-date>October 25&#x2013;29, 2014</conf-date>, <fpage>1532</fpage>&#x2013;<lpage>1543</lpage>. <pub-id pub-id-type="doi">10.3115/v1/d14-1162</pub-id> </citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pompili</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Rolland</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Abad</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>The INESC-ID multi-modal system for the ADReSS 2020 challenge</article-title>,&#x201d; in <conf-name>Proceedings of interspeech 2020</conf-name>, <conf-loc>Shanghai, China</conf-loc>, <conf-date>October 2020</conf-date>, <fpage>2202</fpage>&#x2013;<lpage>2206</lpage>. <pub-id pub-id-type="doi">10.21437/Interspeech.2020-2833</pub-id> </citation>
</ref>
<ref id="B27">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Radford</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Narasimhan</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Salimans</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Sutskever</surname>
<given-names>I.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Improving language understanding by generative pre-training</article-title>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="https://www.cs.ubc.ca/amuham01/LING530/papers/radford2018improving.pdf">https://www.cs.ubc.ca/amuham01/LING530/papers/radford2018improving.pdf</ext-link>
</comment>. <pub-id pub-id-type="doi">10.1017/9781108552202</pub-id> </citation>
</ref>
<ref id="B28">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Rohanian</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Hough</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Purver</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Multi-Modal fusion with gating using audio, lexical and disfluency features for Alzheimer&#x2019;s dementia recognition from spontaneous speech</article-title>,&#x201d; in <conf-name>Proceedings of interspeech 2020</conf-name>, <conf-loc>Shanghai, China</conf-loc>, <conf-date>October 2020</conf-date>, <fpage>2187</fpage>&#x2013;<lpage>2191</lpage>. <pub-id pub-id-type="doi">10.21437/Interspeech.2020-2721</pub-id> </citation>
</ref>
<ref id="B29">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Sanh</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Debut</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Chaumond</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wolf</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter</article-title>. <comment>ArXiv <ext-link ext-link-type="uri" xlink:href="http://abs/1910.01108">abs/1910.01108</ext-link>
</comment> </citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Santos</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Corr&#xea;a J&#xfa;nior</surname>
<given-names>E. A.</given-names>
</name>
<name>
<surname>Oliveira</surname>
<given-names>O.</given-names>
<suffix>Jr.</suffix>
</name>
<name>
<surname>Amancio</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Mansur</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Alu&#xed;sio</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Enriching complex networks with word embeddings for detecting mild cognitive impairment from speech transcripts</article-title>,&#x201d; in <conf-name>Proceedings the 55th annual meet: the association for computational linguistics</conf-name>, <conf-loc>Vancouver, BC</conf-loc>, <conf-date>July 30&#x2013;August 4, 2017</conf-date>, <volume>Vol. 1</volume>, <fpage>1284</fpage>&#x2013;<lpage>1296</lpage>. <pub-id pub-id-type="doi">10.18653/v1/P17-1118</pub-id> </citation>
</ref>
<ref id="B31">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Sarawgi</surname>
<given-names>U.</given-names>
</name>
<name>
<surname>Zulfikar</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Khincha</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Maes</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2020a</year>). <article-title>Uncertainty-aware multi-modal ensembling for severity prediction of Alzheimer&#x2019;s dementia</article-title>. <comment>ArXiv <ext-link ext-link-type="uri" xlink:href="http://abs/2010.01440">abs/2010.01440</ext-link>
</comment>. <pub-id pub-id-type="doi">10.21437/interspeech.2020-3137</pub-id> </citation>
</ref>
<ref id="B32">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Sarawgi</surname>
<given-names>U.</given-names>
</name>
<name>
<surname>Zulfikar</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Soliman</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Maes</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2020b</year>). <article-title>Multimodal inductive transfer learning for detection of Alzheimer&#x2019;s dementia and its severity</article-title>. <comment>arXiv preprint <ext-link ext-link-type="uri" xlink:href="http://arXiv:2009.00700">arXiv:2009.00700</ext-link>
</comment>. <pub-id pub-id-type="doi">10.21437/interspeech.2020-3137</pub-id> </citation>
</ref>
<ref id="B33">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Savundranayagam</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Hummert</surname>
<given-names>M. L.</given-names>
</name>
<name>
<surname>Montgomery</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2005</year>). <article-title>Investigating the effects of communication problems on caregiver burden</article-title>. <source>J.&#x20;Gerontol. B Psychol. Sci. Soc. Sci.</source> <volume>60</volume> (<issue>1</issue>), <fpage>S48</fpage>&#x2013;<lpage>S55</lpage>. <pub-id pub-id-type="doi">10.1093/geronb/60.1.s48</pub-id> </citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Searle</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Ibrahim</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Dobson</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Comparing natural language processing techniques for Alzheimer&#x2019;s dementia prediction in spontaneous speech</article-title>,&#x201d; in <conf-name>Proceedings of interspeech 2020</conf-name>, <conf-loc>Shanghai, China</conf-loc>, <conf-date>October 2020</conf-date>, <fpage>2192</fpage>&#x2013;<lpage>2196</lpage>. <pub-id pub-id-type="doi">10.21437/Interspeech.2020-2729</pub-id> </citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shen</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>He</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Deng</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Mesnil</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Learning semantic representations using convolutional neural networks for web search</article-title>,&#x201d; in <conf-name>WWW 2014</conf-name>, <conf-loc>Seoul, South Korea</conf-loc>, <conf-date>April 7&#x2013;11, 2014</conf-date>, <fpage>373</fpage>&#x2013;<lpage>374</lpage>. <pub-id pub-id-type="doi">10.1145/2567948.2577348</pub-id> </citation>
</ref>
<ref id="B36">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Syed</surname>
<given-names>M. S. S.</given-names>
</name>
<name>
<surname>Syed</surname>
<given-names>Z. S.</given-names>
</name>
<name>
<surname>Lech</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Pirogova</surname>
<given-names>E.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Automated screening for Alzheimer&#x2019;s dementia through spontaneous speech</article-title>,&#x201d; <conf-name>Proceedings of interspeech 2020</conf-name>, <conf-loc>Shanghai, China</conf-loc>, <conf-date>October 2020</conf-date>, <fpage>2222</fpage>&#x2013;<lpage>2226</lpage>. <pub-id pub-id-type="doi">10.21437/Interspeech.2020-3158</pub-id> </citation>
</ref>
<ref id="B37">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Szatl&#xf3;czki</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Hoffmann</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Vincze</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>K&#xe1;lm&#xe1;n</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>P&#xe1;k&#xe1;ski</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Speaking in Alzheimer&#x2019;s disease, is that an early sign? Importance of changes in language abilities in Alzheimer&#x2019;s disease</article-title>. <source>Front. Aging Neurosci.</source> <volume>7</volume>, <fpage>110</fpage>. <pub-id pub-id-type="doi">10.3389/fnagi.2015.00195</pub-id> </citation>
</ref>
<ref id="B38">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>tau Yih</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>He</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Meek</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Semantic parsing for single-relation question answering</article-title>,&#x201d; in <conf-name>Proceedings of the 52nd annual meeting of the association for computational linguistics</conf-name>, <conf-loc>Baltimore, MA</conf-loc>, <conf-date>June 2014</conf-date>, <volume>Vol. 2</volume>, <fpage>643</fpage>&#x2013;<lpage>648</lpage>. <pub-id pub-id-type="doi">10.3115/v1/P14-2105</pub-id> </citation>
</ref>
<ref id="B39">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tom&#xe1;s</surname>
<given-names>D. R. M.</given-names>
</name>
<name>
<surname>Radev</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>Graph-based natural language processing and information retrieval</article-title>. <source>Machine Translation</source> <volume>26</volume>, <fpage>277</fpage>&#x2013;<lpage>280</lpage>. <pub-id pub-id-type="doi">10.1007/s10590-011-9122-9</pub-id> </citation>
</ref>
<ref id="B40">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vaswani</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Shazeer</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Parmar</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Uszkoreit</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Gomez</surname>
<given-names>A.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). &#x201c;<article-title>Attention is all you need</article-title>,&#x201d; <conf-name>Proceedings of the 31st international conference on neural information processing systems</conf-name>, <conf-loc>Long Beach, CA</conf-loc>, <conf-date>December 2017</conf-date>, <fpage>5999</fpage>&#x2013;(<lpage>6009</lpage>.) </citation>
</ref>
<ref id="B41">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yuan</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Bian</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Cai</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Ye</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Church</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Disfluencies and fine-tuning pre-trained language models for detection of Alzheimer&#x2019;s disease</article-title>,&#x201d; in <conf-name>Proceedings of interspeech 2020</conf-name>, <conf-loc>Shanghai, China</conf-loc>, <conf-date>October 2020</conf-date>, <fpage>2162</fpage>&#x2013;<lpage>2166</lpage>. <pub-id pub-id-type="doi">10.21437/Interspeech.2020-2516</pub-id> </citation>
</ref>
</ref-list>
</back>
</article>
