<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neurosci.</journal-id>
<journal-title>Frontiers in Neuroscience</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neurosci.</abbrev-journal-title>
<issn pub-type="epub">1662-453X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnins.2022.1107284</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Integrating audio and visual modalities for multimodal personality trait recognition <italic>via</italic> hybrid deep learning</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Zhao</surname> <given-names>Xiaoming</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1779094/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Liao</surname> <given-names>Yuehui</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Tang</surname> <given-names>Zhiwei</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Xu</surname> <given-names>Yicheng</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Tao</surname> <given-names>Xin</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1459747/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Wang</surname> <given-names>Dandan</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1804608/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Wang</surname> <given-names>Guoyu</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Lu</surname> <given-names>Hongsheng</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1740437/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Taizhou Central Hospital (Taizhou University Hospital), Taizhou University</institution>, <addr-line>Taizhou, Zhejiang</addr-line>, <country>China</country></aff>
<aff id="aff2"><sup>2</sup><institution>School of Computer Science, Hangzhou Dianzi University</institution>, <addr-line>Hangzhou</addr-line>, <country>China</country></aff>
<aff id="aff3"><sup>3</sup><institution>School of Information Technology Engineering, Taizhou Vocational and Technical College</institution>, <addr-line>Taizhou, Zhejiang</addr-line>, <country>China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Xiaopeng Hong, Harbin Institute of Technology, China</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Lang He, Xi&#x2019;an University of Posts and Telecommunications, China; Yong Li, Nanjing University of Science and Technology, China</p></fn>
<corresp id="c001">&#x002A;Correspondence: Hongsheng Lu, <email>luhs@tzc.edu.cn</email></corresp>
<fn fn-type="other" id="fn004"><p>This article was submitted to Perception Science, a section of the journal Frontiers in Neuroscience</p></fn>
</author-notes>
<pub-date pub-type="epub">
<day>06</day>
<month>01</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>16</volume>
<elocation-id>1107284</elocation-id>
<history>
<date date-type="received">
<day>24</day>
<month>11</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>13</day>
<month>12</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2023 Zhao, Liao, Tang, Xu, Tao, Wang, Wang and Lu.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Zhao, Liao, Tang, Xu, Tao, Wang, Wang and Lu</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>Recently, personality trait recognition, which aims to identify people&#x2019;s first impression behavior data and analyze people&#x2019;s psychological characteristics, has been an interesting and active topic in psychology, affective neuroscience and artificial intelligence. To effectively take advantage of spatio-temporal cues in audio-visual modalities, this paper proposes a new method of multimodal personality trait recognition integrating audio-visual modalities based on a hybrid deep learning framework, which is comprised of convolutional neural networks (CNN), bi-directional long short-term memory network (Bi-LSTM), and the Transformer network. In particular, a pre-trained deep audio CNN model is used to learn high-level segment-level audio features. A pre-trained deep face CNN model is leveraged to separately learn high-level frame-level global scene features and local face features from each frame in dynamic video sequences. Then, these extracted deep audio-visual features are fed into a Bi-LSTM and a Transformer network to individually capture long-term temporal dependency, thereby producing the final global audio and visual features for downstream tasks. Finally, a linear regression method is employed to conduct the single audio-based and visual-based personality trait recognition tasks, followed by a decision-level fusion strategy used for producing the final Big-Five personality scores and interview scores. Experimental results on the public ChaLearn First Impression-V2 personality dataset show the effectiveness of our method, outperforming other used methods.</p>
</abstract>
<kwd-group>
<kwd>multimodal personality trait recognition</kwd>
<kwd>hybrid deep learning</kwd>
<kwd>convolutional neural networks</kwd>
<kwd>bi-directional long short-term memory network</kwd>
<kwd>Transformer</kwd>
<kwd>spatiotemporal</kwd>
</kwd-group>
<contract-sponsor id="cn001">National Natural Science Foundation of China<named-content content-type="fundref-id">10.13039/501100001809</named-content></contract-sponsor>
<counts>
<fig-count count="2"/>
<table-count count="5"/>
<equation-count count="13"/>
<ref-count count="47"/>
<page-count count="11"/>
<word-count count="7482"/>
</counts>
</article-meta>
</front>
<body>
<sec id="S1" sec-type="intro">
<title>1. Introduction</title>
<p>In personality psychology, researchers believe that human personality is innate, and have developed various theoretical methods to understand and measure a person&#x2019;s personality. <xref ref-type="bibr" rid="B4">Costa and McCrae (1998)</xref> proposed a personality trait theory, in which personality characteristic were referred to as the main factors affecting the characteristics of individual behaviors, the critical factor in forming personality traits, and the basic unit for measuring personality traits. In <xref ref-type="bibr" rid="B36">Vinciarelli and Mohammadi (2014)</xref> personality is defined as: &#x201C;personality is a psychological construct that can explain the diversity of human behaviors on the basis of a few, stable and measurable individual characteristics.&#x201D; At present, researchers have used psychological scales to establish various personality traits models, including Big-Five (<xref ref-type="bibr" rid="B25">McCrae and John, 1992</xref>), Cattell sixteen personality factor (16PF) (<xref ref-type="bibr" rid="B20">Karson and O&#x2019;Dell, 1976</xref>), Myers-Briggs type indicators (MBTI) (<xref ref-type="bibr" rid="B9">Furnham, 1996</xref>), Minnesota multiple personality inventory (MMPI) (<xref ref-type="bibr" rid="B2">Bathurst et al., 1997</xref>), and so on. Among them, the Big-Five model has become the most fashionable measure model for automatic personality trait recognition. In particular, the Big-Five model, also known as the OCEAN model, aims to measure a person&#x2019;s personality through five dipolar scales: openness (O), conscientiousness (C), extroversion (E), agreeableness (A), and neuroticism (N). In affective neuroscience, the neural mechanisms of emotion expression are investigated by means of combining neuroscience with the psychological study of personality, emotion, and mood (<xref ref-type="bibr" rid="B27">Montag and Davis, 2018</xref>; <xref ref-type="bibr" rid="B37">Wang and Zhao, 2022</xref>; <xref ref-type="bibr" rid="B44">Zhang et al., 2022</xref>).</p>
<p>In recent years, researchers have employed computational techniques such as machine learning and deep learning methods (<xref ref-type="bibr" rid="B10">Gao et al., 2020</xref>; <xref ref-type="bibr" rid="B24">Liang et al., 2021</xref>; <xref ref-type="bibr" rid="B38">Wang and Deng, 2021</xref>; <xref ref-type="bibr" rid="B41">Yan et al., 2021</xref>; <xref ref-type="bibr" rid="B42">Ye et al., 2021</xref>) to model and measure human personality from the first impression behavior data, which is called personality computing (<xref ref-type="bibr" rid="B19">Junior et al., 2019</xref>). One of the most important research subject in personality computing is automatic personality trait recognition, which aims to identify people&#x2019;s first impression behavior data by computer and then analyze people&#x2019;s psychological characteristics (<xref ref-type="bibr" rid="B46">Zhao et al., 2022</xref>). Personality trait recognition has significant applications to human emotional behavior analysis, human-computer interaction, and interview recommendation. For example, <xref ref-type="bibr" rid="B45">Zhao et al. (2019)</xref> explored the influence of personality on emotional behavior by means of a hypergraph learning framework. When an enterprise recruits, human resource department can leverage personality trait recognition techniques to analyze personality characteristics of the job seekers by collecting their first-impression behavior data, and then select employees who can better meet the needs of the enterprise. To advance the development of personality trait recognition, the 2016 European Conference on Computer Vision (ECCV) released a publicly available personality dataset, i.e., ChaLearn-2016, and organized an academic competition of personality trait recognition (<xref ref-type="bibr" rid="B29">Ponce-L&#x00F3;pez et al., 2016</xref>). Since 2016, personality trait recognition has become a hot research topic in psychology, affective neuroscience, and artificial intelligence.</p>
<p>In a basic personality trait recognition system, two important steps are involved: feature extraction and personality trait classification or prediction (<xref ref-type="bibr" rid="B46">Zhao et al., 2022</xref>). Feature extraction aims to derive appropriate feature parameters related to the expression of personality traits from the acquired first impression behavioral data. Personality trait classification or prediction aims to employ machine learning methods to conduct personality classification or prediction. The conventional classifiers or regressors such as support vector machines (SVM) and linear regressors can be adopted for personality trait classification or prediction. This paper will focus on feature extraction in a personality trait recognition system.</p>
<p>According to the types of extracted features characterizing personality traits, personality trait recognition techniques can be divided into hand-crafted based methods and deep learning based methods. Based on the extracted hand-crafted or deep learning features, previous works (<xref ref-type="bibr" rid="B46">Zhao et al., 2022</xref>) focus on performing personality trait recognition from single modality, such as audio-based personality trait recognition (<xref ref-type="bibr" rid="B26">Mohammadi and Vinciarelli, 2012</xref>), visual-based personality trait recognition (<xref ref-type="bibr" rid="B15">G&#x00FC;rp&#x0131;nar et al., 2016</xref>), etc. Although these works based on single modality have achieved good performance, there are still two limitations for them. First, the people&#x2019;s first impression behavior data in real-world scenery are often multimodal rather than single-modal for characterizing personality traits. For instance, both verbal and non-verbal information such as audio and visual modality are highly correlated with personality traits. In this case, it is thus necessary to adopt multiple input modalities for personality trait recognition. Second, although deep learning methods have been fashionable for personality trait recognition, each of them has its advantages and disadvantages. Therefore, integrating the advantages of different deep learning methods may further improve the performance of personality trait recognition, which will be investigated in this work.</p>
<p>To address these two issues above-mentioned, this paper proposes a multimodal personality trait recognition method integrating audio and visual modalities based on a hybrid deep learning framework. As depicted in <xref ref-type="fig" rid="F1">Figure 1</xref>, the proposed method combines three different deep models, including convolutional neural networks (CNN) (<xref ref-type="bibr" rid="B23">LeCun et al., 1998</xref>; <xref ref-type="bibr" rid="B21">Krizhevsky et al., 2012</xref>), bi-directional long short-term memory network (Bi-LSTM) (<xref ref-type="bibr" rid="B33">Schuster and Paliwal, 1997</xref>), recently emerged Transformer (<xref ref-type="bibr" rid="B35">Vaswani et al., 2017</xref>), to learn high-level audio-visual feature representations, followed by a decision-level fusion strategy for final personality trait recognition. In particular, for audio feature extraction, the pre-trained deep audio CNN model called VGGish (<xref ref-type="bibr" rid="B17">Hershey et al., 2017</xref>) is used to learn high-level segment-level audio features. For visual feature extraction, the pre-trained deep face CNN model called VGG-Face (<xref ref-type="bibr" rid="B28">Parkhi et al., 2015</xref>) is leveraged to separately learn high-level frame-level global scene image features and local facial image features from each frame in dynamic video sequences. Then, these extracted deep audio-visual features are fed into a Bi-LSTM and a Transformer network (<xref ref-type="bibr" rid="B35">Vaswani et al., 2017</xref>) to individually capture long-term temporal dependency, thereby producing the final global audio and visual features for downstream tasks. Finally, a linear regression method is employed to conduct the single audio-based and visual-based personality trait recognition tasks, and yield six independent personality trait prediction scores. A decision-level fusion strategy is adopted to merge these personality trait prediction scores and output the final Big-Five personality scores and interview scores. Extensive experiments is conducted on the public ChaLearn First Impressions-V2 dataset (<xref ref-type="bibr" rid="B7">Escalante et al., 2017</xref>), and demonstrate the effectiveness of the proposed method on personality trait recognition tasks.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption><p>The flowchart of the proposed multimodal personality trait recognition method integrating audio and visual modalities based on a hybrid deep learning framework.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-1107284-g001.tif"/>
</fig>
<p>The main contributions of this paper are summarized as follows:</p>
<list list-type="simple">
<list-item><label>(1)</label><p>This paper proposes a multimodal personality trait recognition method integrating audio and visual modalities based on a hybrid deep learning framework, in which CNN, Bi-LSTM, and Transformer are combined to capture high-level audio-visual spatio-temporal feature representations for personality trait recognition.</p></list-item>
<list-item><label>(2)</label><p>Extensive experiments are performed on the public ChaLearn First Impressions-V2 dataset and experimental results show that the proposed method outperforms other comparing methods on personality trait recognition tasks.</p></list-item>
</list>
</sec>
<sec id="S2">
<title>2. Related work</title>
<p>The majority of prior works for personality trait recognition concentrates on single modality such as audio or visual cues, as described below.</p>
<sec id="S2.SS1">
<title>2.1. Audio-based personality trait recognition</title>
<p>In early works, the conventional extracted hand-crafted audio features are low-level descriptor (LLD) features including intensity, pitch, formants, Mel-Frequency Cepstrum Coefficients (MFCCs), and so on. <xref ref-type="bibr" rid="B26">Mohammadi and Vinciarelli (2012)</xref> derived the LLD features like intensity, pitch, and formants, and then employed a logistic regression to predict the Big-five personality traits in audio clips. <xref ref-type="bibr" rid="B1">An et al. (2016)</xref> extracted the typical Interspeech-2013 ComParE feature set (<xref ref-type="bibr" rid="B32">Schuller et al., 2013</xref>) and fed them into a SVM classifier to conduct the Big-Five personality trait recognition.</p>
<p>In recent years, researchers have tried to leverage deep learning (<xref ref-type="bibr" rid="B22">LeCun et al., 2015</xref>) models with a multilayer network structure to learn high-level audio feature representations for promoting the performance of personality trait recognition. Among them, the representative deep learning methods are CNN (<xref ref-type="bibr" rid="B23">LeCun et al., 1998</xref>; <xref ref-type="bibr" rid="B21">Krizhevsky et al., 2012</xref>), recurrent neural networks (RNN) (<xref ref-type="bibr" rid="B6">Elman, 1990</xref>) and its variants called long short-term memory (LSTM) (<xref ref-type="bibr" rid="B18">Hochreiter and Schmidhuber, 1997</xref>), etc. <xref ref-type="bibr" rid="B16">Hayat et al. (2019)</xref> proposed an audio personality feature extraction method based on CNN. They fine-tuned the pre-trained CNN model called AudioSet in the first-impression behavior dataset and extracted high-level audio features for Big-Five personality prediction, demonstrating the advantages of CNN-based learned features compared with hand-crafted features. <xref ref-type="bibr" rid="B47">Zhu et al. (2018)</xref> presented a method of automatic perception of speakers&#x2019; personality from speech in Mandarin. They developed a new skip-frame LSTM system to learn personality information from frame-level descriptor like MFCCs instead of hand-crafted prosodic features.</p>
</sec>
<sec id="S2.SS2">
<title>2.2. Visual-based personality trait recognition</title>
<p>In terms of the input type of visual data, visual-based personality trait recognition can be divided into two groups: static images-based and dynamic video sequences-based personality trait recognition.</p>
<p>For static images-based personality trait recognition, the extracted visual features mainly come from facial features, since facial morphology provides explicit cues for personality trait recognition. In early works, the commonly used hand-crafted facial features are color histograms, local binary patterns (LBP), global descriptor, aesthetic features, etc. <xref ref-type="bibr" rid="B14">Guntuku et al. (2015)</xref> extracted low-level hand-crafted features of facial images, including color histograms, LBP, global descriptor, and aesthetic features, and then employed the lasso regressor to predict the Big-five personality traits of users in self-portrait images. Recently, deep learning methods have been applied for static images-based personality trait recognition. <xref ref-type="bibr" rid="B40">Xu et al. (2021)</xref> explored the relationship between self-reported personality characteristics and static facial images. They investigated the performance of several deep learning models pre-trained on the ImageNet data, such as MobileNetv2, ResNeSt50, and the designed personality prediction neural network based on soft thresholding (S-NNPP) by means of fine-tuning them on the self-constructed dataset composed of facial images and personality characteristics.</p>
<p>For dynamic video sequences-based personality trait recognition, dynamic video sequences contain temporal information related to facial activity statistics, thereby providing useful and complementary cues for personality trait recognition (<xref ref-type="bibr" rid="B19">Junior et al., 2019</xref>). In early works, the hand-crafted video features related to facial activity statistics were usually adopted for personality trait recognition. <xref ref-type="bibr" rid="B34">Teijeiro-Mosquera et al. (2014)</xref> exploited the relationships between facial expressions in dynamic video sequences and personality impressions of the Big-Five traits. To characterize facial activity statistics, they extracted four kinds of behavioral cues for personality trait recognition, including statistic-based cues, Threshold (THR) cues, Hidden Markov Models (HMM) cues, and Winner Takes All (WTA) cues. Likewise, several recently developed deep learning methods have been employed for dynamic video sequences-based personality trait recognition. <xref ref-type="bibr" rid="B15">G&#x00FC;rp&#x0131;nar et al. (2016)</xref> extracted deep facial and scene feature representations in dynamic video sequences by fine-tuning a pre-trained VGG-19 model, and then input them into a kernel extreme learning machine to perform the prediction of Big-Five personality traits. <xref ref-type="bibr" rid="B3">Beyan et al. (2021)</xref> presented a classification method of perceived personality traits on the basis of novel deep visual activity (VA)-based features derived only from key-dynamic images in dynamic video sequences. They adopted a dynamic image construction, which aimed to learn long-term VA with CNN + LSTM, and detect spatiotemporal saliency to decide key-dynamic images.</p>
</sec>
</sec>
<sec id="S3">
<title>3. The proposed method</title>
<p>To alleviate the problem of single modality based personality trait recognition, this paper proposes a multimodal personality trait recognition method integrating audio and visual modalities based on a hybrid deep learning framework. <xref ref-type="fig" rid="F1">Figure 1</xref> depicts the flowchart of the proposed method. As depicted in <xref ref-type="fig" rid="F1">Figure 1</xref>, the proposed method adopts two modalities as its input: one is the audio signals, the other is the visual signals including the global scene images and facial images. The used hybrid deep learning framework comprises of three different deep learning models like CNN, Bi-LSTM, and Transformer, which are used for high-level feature learning tasks. The proposed method consists of three key steps: video data preprocessing, audio-visual feature extraction, and decision-level fusion, as described below.</p>
<sec id="S3.SS1">
<title>3.1. Video data preprocessing</title>
<p>For audio signals in the video data, we use the pre-trained VGGish model (<xref ref-type="bibr" rid="B17">Hershey et al., 2017</xref>) to extract high-level audio segment-level features. It is noted that the length of speech segments as input of VGGish is required to be 0.96 s. To this end, the original audio signals in the video data are divided into to a certain number of adjacent segments which last a time period of 0.96 s.</p>
<p>For visual signals in the video data, two preprocessing tasks are implemented. For global scene images in a video, 100 scene images are selected at equal intervals form each original video sample. Then, the resolution of each global scene image is resampled from the original 1280&#x00D7;720 pixels to 224&#x00D7;224 as inputs of VGG-Face model (<xref ref-type="bibr" rid="B28">Parkhi et al., 2015</xref>). For local face images in a video, we employ the popular Multi-Task Convolutional Neural Network (MTCNN) (<xref ref-type="bibr" rid="B43">Zhang et al., 2016</xref>) to conduct face detection tasks. The resolution of face image detected in each frame is sampled to 224&#x00D7;224. Since some videos are affected by environmental factors such as illumination, MTCNN may detect face images with a low accuracy. As a tradeoff, 30 frames of detected face images are selected at equal intervals from the original video. For the video with less than 30 frames of detected face images, the first and last face images are repeatedly until the frame number of face video is 30.</p>
</sec>
<sec id="S3.SS2">
<title>3.2. Audio-visual feature extraction</title>
<p>Audio-visual feature extraction aims to learn the local and global feature representations from original audio and visual signals in a video for personality trait recognition, as described below.</p>
<sec id="S3.SS2.SSS1">
<title>3.2.1. Audio-visual local feature extraction</title>
<p>For the divided audio segment with 0.96 s, we leverage the VGGish model (<xref ref-type="bibr" rid="B17">Hershey et al., 2017</xref>) pre-trained on the AudioSet dataset (<xref ref-type="bibr" rid="B11">Gemmeke et al., 2017</xref>) to capture high-level segment-level deep audio features. The used VGGish model consists of 6 convolutional layers, 4 pooling layers, and 3 fully connected layers. The kernel size of convolutional layers and pooling layers is 3&#x00D7;3 and 2&#x00D7;2, respectively. Since the neuron number of the last fully connected layer in the VGGish network is 128, the learned audio features by the VGGish model are 128-dimension.</p>
<p>For each scene and face image in a video, we employ the VGG-Face model (<xref ref-type="bibr" rid="B28">Parkhi et al., 2015</xref>) pre-trained on the ImageNet dataset (<xref ref-type="bibr" rid="B5">Deng et al., 2009</xref>) to learn high-level frame-level deep visual feature representations for downstream scene and face global feature learning tasks, respectively. The VGG-Face model includes 13 convolution layers, 5 pooling layers, and 2 fully connected layers. Since the neuron number of the last full connection layer in the VGG-Face network is 4096, the dimension of visual frame-level features obtained by VGG-Face network is 4096.</p>
<p>Given <italic>i</italic>-th input video clip <italic>a</italic><sub><italic>i</italic></sub> (<italic>i</italic> = 1,2,&#x22EF;<italic>N</italic>) and its corresponding Big-Five personality score <italic>y</italic><sub><italic>i</italic></sub>, we fine-tune the pre-trained VGGish network (<xref ref-type="bibr" rid="B17">Hershey et al., 2017</xref>) to obtain deep segment-level audio feature representations, as described below:</p>
<disp-formula id="S3.E1">
<label>(1)</label>
<mml:math display="block" id="M1"><mml:mrow><mml:munder><mml:mo movablelimits="false">min</mml:mo><mml:mrow><mml:msup><mml:mi>W</mml:mi><mml:mrow><mml:mi>V</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>G</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>V</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>G</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:munder><mml:mrow><mml:munderover><mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:mrow><mml:mi>L</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>sigmoid</mml:mtext><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mi>W</mml:mi><mml:mrow><mml:mi>V</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>G</mml:mi></mml:mrow></mml:msup><mml:mo>&#x2062;</mml:mo><mml:msup><mml:mi>&#x03B7;</mml:mi><mml:mrow><mml:mi>V</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>G</mml:mi></mml:mrow></mml:msup><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>;</mml:mo><mml:msup><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>V</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>G</mml:mi></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:math>
</disp-formula>
<p>where &#x03B7;<italic><sup>VG</sup></italic>(<italic>a</italic><sub><italic>i</italic></sub>;&#x03B8;<italic><sup>VG</sup></italic>) represents the output of the last full connected layer in the VGGish network. &#x03B8;<italic><sup>VG</sup></italic> and <italic>W<sup>VG</sup></italic> separately denotes the network parameters of the VGGish network and the weights of the sigmoid layer. The cross-entropy loss function <italic>L</italic> is defined as:</p>
<disp-formula id="S3.E2">
<label>(2)</label>
<mml:math display="block" id="M2"><mml:mrow><mml:mrow><mml:mi>L</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>V</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>G</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>-</mml:mo><mml:mrow><mml:munderover><mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:mrow><mml:msub><mml:mi>y</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>y</mml:mi><mml:mi>j</mml:mi><mml:mi>p</mml:mi></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:math>
</disp-formula>
<p>where <italic>y</italic><sub><italic>j</italic></sub> is the <italic>j</italic>-th ground-truth Big-Five personality score, and <inline-formula><mml:math id="INEQ10"><mml:msubsup><mml:mi>y</mml:mi><mml:mi>j</mml:mi><mml:mi>p</mml:mi></mml:msubsup></mml:math></inline-formula> is represented by the predicted Big-Five personality score.</p>
<p>For deep visual scene and face feature extraction on each frame of video, we fine-tune the pre-trained VGG-Face network (<xref ref-type="bibr" rid="B28">Parkhi et al., 2015</xref>) to learn high-level visual feature representations. The process of fine-tuning the pre-trained VGG-Face network is similar to the above-mentioned Eqs 1, 2.</p>
</sec>
<sec id="S3.SS2.SSS2">
<title>3.2.2. Audio-visual global feature extraction</title>
<p>After completing the local audio and visual feature extraction tasks, it is necessary to individually learn the global audio features, visual scene features, and visual face features from the entire videos so as to conduct personality trait prediction tasks. To this end, we adopt the Bi-LSTM (<xref ref-type="bibr" rid="B33">Schuster and Paliwal, 1997</xref>) and recently emerged Transformer (<xref ref-type="bibr" rid="B35">Vaswani et al., 2017</xref>) to independently model long-term dependencies of temporal dynamics in video sequences, as described below.</p>
<p>Given an input sequence <italic>e</italic><sub><italic>t</italic></sub>, the learning process of the Bi-LSTM network is:</p>
<disp-formula id="S3.E3">
<label>(3)</label>
<mml:math display="block" id="M3"><mml:mrow><mml:mi>E</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mi>Bi</mml:mi><mml:mo>-</mml:mo><mml:mrow><mml:mi>LSTM</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mrow><mml:mi>B</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mo>-</mml:mo><mml:mrow><mml:mi>L</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>S</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>T</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>M</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:math>
</disp-formula>
<p>where <italic>E</italic> &#x2208; &#x211D;<sup>1&#x00D7;<italic>d</italic></sup> is the learned temporal features, and <italic>W</italic><sub><italic>Bi&#x2013;LSTM</italic></sub> is weight parameters of Bi-LSTM.</p>
<p>The original Transformer (<xref ref-type="bibr" rid="B35">Vaswani et al., 2017</xref>) is developed based on self-attention mechanisms like a Multi-Head attention without any recurrent structures and convolutions. A Multi-Head attention module consists of several Scaled Dot-Product Attention (SDPA) modules in parallel and then their outputs are concatenated as an input of a linear layer. Given the input query (<italic>Q</italic>), key (<italic>K</italic>), and value (<italic>V</italic>), the output of each SDPA module is defined as:</p>
<disp-formula id="S3.E4">
<label>(4)</label>
<mml:math display="block" id="M4"><mml:mrow><mml:mrow><mml:mtext>Attention</mml:mtext><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi>Q</mml:mi><mml:mo>,</mml:mo><mml:mi>K</mml:mi><mml:mo>,</mml:mo><mml:mi>V</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mtext>soft</mml:mtext><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mi>max</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mi>Q</mml:mi><mml:mo>&#x2062;</mml:mo><mml:msup><mml:mi>K</mml:mi><mml:mi>T</mml:mi></mml:msup></mml:mrow><mml:msqrt><mml:msub><mml:mi>d</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:msqrt></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>&#x2062;</mml:mo><mml:mi>V</mml:mi></mml:mrow></mml:mrow></mml:math>
</disp-formula>
<p>where <italic>d</italic><sub><italic>k</italic></sub> is the feature dimension of the key matrix <italic>K</italic>.</p>
</sec>
</sec>
<sec id="S3.SS3">
<title>3.3. Decision-level fusion</title>
<p>After obtaining audio-visual global features extracted by a Bi-LSTM model and a Transformer model, we adopt a linear regression layer to predict the Big-Five personality and interview scores. The linear regression layer is calculated as follows:</p>
<disp-formula id="S3.E5">
<label>(5)</label>
<mml:math display="block" id="M5"><mml:mrow><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>+</mml:mo><mml:mi>b</mml:mi></mml:mrow></mml:mrow></mml:math>
</disp-formula>
<p>where <italic>x</italic><sub><italic>i</italic></sub>, <italic>w</italic><sub><italic>i</italic></sub>, and <italic>b</italic> represent the <italic>i</italic>-th input sample, the corresponding weight value, and bias, respectively. <italic>f</italic><sub><italic>i</italic></sub>(<italic>x</italic>) is the <italic>i</italic>-th prediction score value.</p>
<p>As shown in <xref ref-type="fig" rid="F1">Figure 1</xref>, when using the learned audio features, visual scene features, and visual face features as inputs of a linear regression layer, we can obtain six different recognition results. To effectively fuse these six different recognition results, a weighted decision-level fusion strategy is employed, as described below:</p>
<disp-formula id="S3.E6">
<label>(6)</label>
<mml:math display="block" id="M6"><mml:mrow><mml:mrow><mml:mover><mml:mo mathvariant="italic" movablelimits="false">f</mml:mo><mml:mo>&#x223C;</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:munderover><mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mn>6</mml:mn></mml:munderover><mml:mrow><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:math>
</disp-formula>
<p>where &#x03B1;<sub><italic>i</italic></sub> is the weight value, <italic>f</italic><sub><italic>i</italic></sub>(<italic>x</italic>) is the predicted value of each type of features, and <inline-formula><mml:math id="INEQ14"><mml:mrow><mml:mrow><mml:msubsup><mml:mo largeop="true" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mn>6</mml:mn></mml:msubsup><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:math></inline-formula>. The mean squared error (MSE) loss is computed as follows:</p>
<disp-formula id="S3.E7">
<label>(7)</label>
<mml:math display="block" id="M7"><mml:mrow><mml:mrow><mml:mtext>MSE</mml:mtext><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover><mml:mo mathvariant="italic" movablelimits="false">f</mml:mo><mml:mo>&#x223C;</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mi>E</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mover><mml:mo mathvariant="italic" movablelimits="false">f</mml:mo><mml:mo>&#x223C;</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>-</mml:mo><mml:mi>Y</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mi>E</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:munderover><mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mn>6</mml:mn></mml:munderover><mml:mrow><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>-</mml:mo><mml:mi>Y</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:math>
</disp-formula>
<p>where <italic>Y</italic> is the ground-truth score. Our goal is to minimize the MSE loss subject to <inline-formula><mml:math id="INEQ15"><mml:mrow><mml:mrow><mml:msubsup><mml:mo largeop="true" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mn>6</mml:mn></mml:msubsup><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:math></inline-formula>. To this end, the Lagrangian expression of this problem is expressed as:</p>
<disp-formula id="S3.E8">
<label>(8)</label>
<mml:math display="block" id="M8"><mml:mrow><mml:mrow><mml:mi>L</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mtext>MSE</mml:mtext><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover><mml:mo mathvariant="italic" movablelimits="false">f</mml:mo><mml:mo>&#x223C;</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>-</mml:mo><mml:mrow><mml:mi>&#x03BB;</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mrow><mml:munderover><mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mn>6</mml:mn></mml:munderover><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:math>
</disp-formula>
<p>where &#x03BB; is the Lagrange multiplier.</p>
<p>Then, we calculate the partial derivation of Eq. 8 based on &#x03B1;<sub><italic>m</italic></sub> for <italic>m</italic> = 1,2,&#x22EF;6, as defined as:</p>
<disp-formula id="S3.E9">
<label>(9)</label>
<mml:math display="block" id="M9"><mml:mrow><mml:mfrac><mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mo>&#x2061;</mml:mo><mml:mi>L</mml:mi></mml:mrow><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x2202;</mml:mo><mml:mo>&#x2061;</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>m</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mi>E</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:munderover><mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mn>6</mml:mn></mml:munderover><mml:mrow><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>-</mml:mo><mml:mi>Y</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mi>m</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>-</mml:mo><mml:mi>Y</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo>-</mml:mo><mml:mi>&#x03BB;</mml:mi></mml:mrow></mml:mrow></mml:math>
</disp-formula>
<p>We set the gradient to be 0, and get:</p>
<disp-formula id="S3.E10">
<label>(10)</label>
<mml:math display="block" id="M10"><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:munderover><mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mn>6</mml:mn></mml:munderover><mml:mrow><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:mi>E</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>-</mml:mo><mml:mi>Y</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mi>m</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>-</mml:mo><mml:mi>Y</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mo>-</mml:mo><mml:mi>&#x03BB;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x22EF;</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mn>6</mml:mn></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:math>
</disp-formula>
<p>Let &#x03B1; = <sup>[&#x03B1;<sub>1</sub>,&#x03B1;<sub>2</sub>,&#x03B1;<sub>3</sub>,&#x03B1;<sub>4</sub>,&#x03B1;<sub>5</sub>,&#x03B1;<sub>6</sub>]<italic>T</italic></sup>, &#x03A9; = [<italic>w</italic><sub><italic>ij</italic></sub>] = <italic>E</italic>[(<italic>f</italic><sub><italic>i</italic></sub>(<italic>X</italic>)&#x2212;<italic>Y</italic>)(<italic>f</italic><sub><italic>j</italic></sub>(<italic>X</italic>)&#x2212;<italic>Y</italic>)], Eq. 10 can be transformed as:</p>
<disp-formula id="S3.E11">
<label>(11)</label>
<mml:math display="block" id="M11"><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03A9;</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi mathvariant="normal">&#x03B1;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mfrac><mml:mi mathvariant="normal">&#x03BB;</mml:mi><mml:mn>2</mml:mn></mml:mfrac><mml:mo>&#x2062;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mrow></mml:math>
</disp-formula>
<p>Then, the optimal weight vector &#x03B1; can be obtained by:</p>
<disp-formula id="S3.E12">
<label>(12)</label>
<mml:math display="block" id="M12"><mml:mrow><mml:mi mathvariant="normal">&#x03B1;</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mi mathvariant="normal">&#x03A9;</mml:mi><mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>&#x2062;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msup><mml:mn>1</mml:mn><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x2062;</mml:mo><mml:msup><mml:mi mathvariant="normal">&#x03A9;</mml:mi><mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>&#x2062;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac></mml:mrow></mml:math>
</disp-formula>
</sec>
</sec>
<sec id="S4">
<title>4. Experiments</title>
<sec id="S4.SS1">
<title>4.1. Dataset</title>
<p>To verify the effectiveness of the proposed method, the public ChaLearn First Impression-V2 (<xref ref-type="bibr" rid="B7">Escalante et al., 2017</xref>) is employed for personality and interview prediction. This dataset contains 10,000 video clips collected from more than 3,000 different YouTube videos. The language involved in video participants is English. The resolution of the video is 1280&#x00D7;720, and the duration of each video clip is about 15 s. This dataset annotates the &#x201C;Interview&#x201D; scene labels for interview analysis. The divided training set, testing set and validation set in this dataset contain 6,000, 2,000, 2,000 video clips, respectively. In this work, we use the training and validation sets for experiments because the testing set is only open to competitors. Each video in this dataset is labeled by using the Big-Five personality score [0,1]. <xref ref-type="fig" rid="F2">Figure 2</xref> shows several image samples from the ChaLearn First Impression-V2 dataset.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption><p>Image samples with the labeled Big-Five personality score from the ChaLearn First Impression-V2 dataset.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnins-16-1107284-g002.tif"/>
</fig>
</sec>
<sec id="S4.SS2">
<title>4.2. Implementation details</title>
<p>When training all used deep learning models, the batch size is set to 32, and the initial learning rate is 1&#x00D7;<italic>e</italic><sup>&#x2212;4</sup>. After each epoch, the learning rate will become a half of the original learning rate. The maximum epoch number of is 30, and the Adam optimizer is used. The MSE loss function is adopted. The experimental platform is NVIDIA GPU Quadro M6000 with 24 GB memory. In order to improve the generalization performance of trained deep learning models and avoid overfitting, the early stopping strategy (<xref ref-type="bibr" rid="B30">Prechelt, 1998</xref>) is used.</p>
<p>In this work, we choose a two-layer Bi-LSTM to capture temporal dynamics related to video sequences. The number of neurons in each layer of Bi-LSTM is 2048. The number of encoding layer in the Transformer model is 6 for its best performance, and its last layer output 1024-dimension features. To compare with these deep learning models, several classical regression models such as Support Vector Regression (SVR) with polynomial (poly), radial basis function (RBF), and linear kernel functions, Decision Tree Regression (DTR) are employed. In the SVR model, the degree of polynomial kernel function is 3, the penalty factor &#x201C;<italic>C</italic>&#x201D; of radial basis kernel function is 2, and the parameter &#x201C;gamma&#x201D; is 0.5. The DTR model is implemented for its default parameters, such as the splitting policy &#x201C;split = best&#x201D; at each node, &#x201C;min _ samples _ split = 2&#x201D; for splitting an internal node. For these classical regression models, a simple average-pooling strategy is conducted on these extracted audio-visual local features so as to produce the global features as their inputs.</p>
<p>The evaluation metric for evaluating the predicted personality trait or interview scores is defined as:</p>
<disp-formula id="S4.E13">
<label>(13)</label>
<mml:math display="block" id="M13"><mml:mrow><mml:mi>S</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mrow><mml:munderover><mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:msubsup><mml:mi>y</mml:mi><mml:mi>j</mml:mi><mml:mi>p</mml:mi></mml:msubsup><mml:mo>-</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mi>N</mml:mi></mml:mfrac></mml:mrow></mml:mrow></mml:mrow></mml:math>
</disp-formula>
<p>where <italic>N</italic> is the number of samples, <inline-formula><mml:math id="INEQ21"><mml:msubsup><mml:mi>y</mml:mi><mml:mi>j</mml:mi><mml:mi>p</mml:mi></mml:msubsup></mml:math></inline-formula> is the predicted value, and <italic>y</italic><sub><italic>j</italic></sub> is the ground-truth value. The higher the value <italic>S</italic> is, the better the obtained performance on personality or interview prediction tasks is.</p>
</sec>
<sec id="S4.SS3">
<title>4.3. Experimental results and analysis</title>
<p>In this section, two groups of experiments are carried out on the ChaLearn First Impression-V2 data set to verify the effectiveness of all used methods. One is the single-modal personality trait recognition, the other is multi-modal personality trait recognition.</p>
<sec id="S4.SS3.SSS1">
<title>4.3.1. Results of single-modal personality trait recognition</title>
<p>For single-modal personality recognition, we present the experiment results and analysis based on the single extracted audio features, visual scene features, and visual face features by using the pre-trained deep models.</p>
<p><xref ref-type="table" rid="T1">Table 1</xref> shows the prediction results of deep audio features extracted by the pre-trained VGGish for different methods. &#x201C;Transformer + Bi-LSTM&#x201D; denotes that the learned features with Transformer and Bi-LSTM are directly concatenated to form a whole feature vector as inputs of the latter linear regression layer for prediction. It can be seen from <xref ref-type="table" rid="T1">Table 1</xref> that Transformer + Bi-LSTM performs best based on deep audio features. More specially, the average Big-Five personality prediction score is 0.8952 and the corresponding interview prediction score of 0.8953, thereby outperforming other used methods. The ranking order for other used methods is Bi-LSTM, SVR (linear), SVR (rbf), Transformer, SVR (poly), and DTR. This shows the advantages of Transformer + Bi-LSTM on audio personality trait recognition tasks. It is noted that Transformer + Bi-LSTM performs better than Transformer and Bi-LSTM, indicating that there is a certain complementarity between Transformer and Bi-LSTM.</p>
<table-wrap position="float" id="T1">
<label>TABLE 1</label>
<caption><p>Prediction results of deep audio features extracted by the pre-trained VGGish for different methods.</p></caption>
<table cellspacing="5" cellpadding="5" frame="box" rules="all">
<thead>
<tr>
<td valign="top" align="left" style="color:#ffffff;background-color: #7f8080;">Models</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">O</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">C</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">E</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">A</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">N</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Average score</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Interview score</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">SVR (poly)</td>
<td valign="top" align="center">0.8540</td>
<td valign="top" align="center">0.8329</td>
<td valign="top" align="center">0.8624</td>
<td valign="top" align="center">0.8402</td>
<td valign="top" align="center">0.8744</td>
<td valign="top" align="center">0.8528</td>
<td valign="top" align="center">0.8319</td>
</tr>
<tr>
<td valign="top" align="left">SVR (rbf)</td>
<td valign="top" align="center">0.8967</td>
<td valign="top" align="center">0.8844</td>
<td valign="top" align="center">0.8932</td>
<td valign="top" align="center">0.9012</td>
<td valign="top" align="center">0.8906</td>
<td valign="top" align="center">0.8932</td>
<td valign="top" align="center">0.8920</td>
</tr>
<tr>
<td valign="top" align="left">SVR (linear)</td>
<td valign="top" align="center">0.8980</td>
<td valign="top" align="center">0.8846</td>
<td valign="top" align="center">0.8935</td>
<td valign="top" align="center">0.9025</td>
<td valign="top" align="center">0.8920</td>
<td valign="top" align="center">0.8941</td>
<td valign="top" align="center">0.8945</td>
</tr>
<tr>
<td valign="top" align="left">DTR</td>
<td valign="top" align="center">0.8541</td>
<td valign="top" align="center">0.8411</td>
<td valign="top" align="center">0.8542</td>
<td valign="top" align="center">0.8610</td>
<td valign="top" align="center">0.8453</td>
<td valign="top" align="center">0.8511</td>
<td valign="top" align="center">0.8511</td>
</tr>
<tr>
<td valign="top" align="left">Transformer</td>
<td valign="top" align="center">0.8972</td>
<td valign="top" align="center">0.8814</td>
<td valign="top" align="center">0.8920</td>
<td valign="top" align="center">0.9035</td>
<td valign="top" align="center">0.8907</td>
<td valign="top" align="center">0.8930</td>
<td valign="top" align="center">0.8915</td>
</tr>
<tr>
<td valign="top" align="left">Bi-LSTM</td>
<td valign="top" align="center">0.8986</td>
<td valign="top" align="center">0.8834</td>
<td valign="top" align="center">0.8932</td>
<td valign="top" align="center">0.9045</td>
<td valign="top" align="center">0.8928</td>
<td valign="top" align="center">0.8945</td>
<td valign="top" align="center">0.8947</td>
</tr>
<tr>
<td valign="top" align="left">Transformer + Bi-LSTM</td>
<td valign="top" align="center"><bold>0.8989</bold></td>
<td valign="top" align="center"><bold>0.8847</bold></td>
<td valign="top" align="center"><bold>0.8938</bold></td>
<td valign="top" align="center"><bold>0.9048</bold></td>
<td valign="top" align="center"><bold>0.8935</bold></td>
<td valign="top" align="center"><bold>0.8952</bold></td>
<td valign="top" align="center"><bold>0.8953</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn><p>Bold values denote the highest performance.</p></fn>
</table-wrap-foot>
</table-wrap>
<p><xref ref-type="table" rid="T2">Tables 2</xref>, <xref ref-type="table" rid="T3">3</xref> separately present personality prediction results of deep visual scene features and deep visual face features extracted by the pre-trained VGG-Face for different methods. It can be observed from <xref ref-type="table" rid="T2">Tables 2</xref>, <xref ref-type="table" rid="T3">3</xref> that Transformer + Bi-LSTM still obtains better performance other methods. In particular, Transformer + Bi-LSTM employs deep visual scene features and face features to produce the average Big-Five personality prediction scores of 0.9039 and 0.9124, respectively, and the interview prediction scores of 0.9057 and 0.9163, respectively. The ranking order for other used methods is Bi-LSTM, Transformer, SVR (poly), SVR (linear), SVR (rbf), and DTR. This shows the superiority of Transformer + Bi-LSTM on deep visual (scene and face) personality trait recognition tasks. The visual face images outperforms the visual scene images on personality trait recognition tasks. This may be because face images are more correlated with personality traits than scene images.</p>
<table-wrap position="float" id="T2">
<label>TABLE 2</label>
<caption><p>Prediction results of deep visual scene features extracted by the pre-trained VGG-Face for different methods.</p></caption>
<table cellspacing="5" cellpadding="5" frame="box" rules="all">
<thead>
<tr>
<td valign="top" align="left" style="color:#ffffff;background-color: #7f8080;">Models</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">O</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">C</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">E</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">A</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">N</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Average score</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Interview score</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">SVR (poly)</td>
<td valign="top" align="center">0.8921</td>
<td valign="top" align="center">0.8896</td>
<td valign="top" align="center">0.8896</td>
<td valign="top" align="center">0.8962</td>
<td valign="top" align="center">0.8850</td>
<td valign="top" align="center">0.8905</td>
<td valign="top" align="center">0.8890</td>
</tr>
<tr>
<td valign="top" align="left">SVR (rbf)</td>
<td valign="top" align="center">0.8841</td>
<td valign="top" align="center">0.8736</td>
<td valign="top" align="center">0.8804</td>
<td valign="top" align="center">0.8963</td>
<td valign="top" align="center">0.8780</td>
<td valign="top" align="center">0.8825</td>
<td valign="top" align="center">0.8818</td>
</tr>
<tr>
<td valign="top" align="left">SVR (linear)</td>
<td valign="top" align="center">0.8896</td>
<td valign="top" align="center">0.8872</td>
<td valign="top" align="center">0.8867</td>
<td valign="top" align="center">0.8922</td>
<td valign="top" align="center">0.8809</td>
<td valign="top" align="center">0.8873</td>
<td valign="top" align="center">0.8865</td>
</tr>
<tr>
<td valign="top" align="left">DTR</td>
<td valign="top" align="center">0.8636</td>
<td valign="top" align="center">0.8607</td>
<td valign="top" align="center">0.8627</td>
<td valign="top" align="center">0.8711</td>
<td valign="top" align="center">0.8586</td>
<td valign="top" align="center">0.8633</td>
<td valign="top" align="center">0.8639</td>
</tr>
<tr>
<td valign="top" align="left">Transformer</td>
<td valign="top" align="center">0.8941</td>
<td valign="top" align="center">0.8844</td>
<td valign="top" align="center">0.8909</td>
<td valign="top" align="center">0.9021</td>
<td valign="top" align="center">0.8884</td>
<td valign="top" align="center">0.8920</td>
<td valign="top" align="center">0.8920</td>
</tr>
<tr>
<td valign="top" align="left">Bi-LSTM</td>
<td valign="top" align="center">0.9042</td>
<td valign="top" align="center">0.9013</td>
<td valign="top" align="center">0.9012</td>
<td valign="top" align="center">0.9091</td>
<td valign="top" align="center">0.8993</td>
<td valign="top" align="center">0.9030</td>
<td valign="top" align="center">0.9050</td>
</tr>
<tr>
<td valign="top" align="left">Transformer + Bi-LSTM</td>
<td valign="top" align="center"><bold>0.9043</bold></td>
<td valign="top" align="center"><bold>0.9025</bold></td>
<td valign="top" align="center"><bold>0.9035</bold></td>
<td valign="top" align="center"><bold>0.9093</bold></td>
<td valign="top" align="center"><bold>0.9000</bold></td>
<td valign="top" align="center"><bold>0.9039</bold></td>
<td valign="top" align="center"><bold>0.9057</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn><p>Bold values denote the highest performance.</p></fn>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T3">
<label>TABLE 3</label>
<caption><p>Prediction results of deep visual face features extracted by the pre-trained VGG-Face for different methods.</p></caption>
<table cellspacing="5" cellpadding="5" frame="box" rules="all">
<thead>
<tr>
<td valign="top" align="left" style="color:#ffffff;background-color: #7f8080;">Models</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">O</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">C</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">E</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">A</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">N</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Average score</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Interview score</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">SVR (poly)</td>
<td valign="top" align="center">0.8871</td>
<td valign="top" align="center">0.8922</td>
<td valign="top" align="center">0.8923</td>
<td valign="top" align="center">0.8980</td>
<td valign="top" align="center">0.8855</td>
<td valign="top" align="center">0.8910</td>
<td valign="top" align="center">0.8963</td>
</tr>
<tr>
<td valign="top" align="left">SVR (rbf)</td>
<td valign="top" align="center">0.8841</td>
<td valign="top" align="center">0.8736</td>
<td valign="top" align="center">0.8804</td>
<td valign="top" align="center">0.8963</td>
<td valign="top" align="center">0.8780</td>
<td valign="top" align="center">0.8825</td>
<td valign="top" align="center">0.8818</td>
</tr>
<tr>
<td valign="top" align="left">SVR (linear)</td>
<td valign="top" align="center">0.8953</td>
<td valign="top" align="center">0.8922</td>
<td valign="top" align="center">0.8986</td>
<td valign="top" align="center">0.8974</td>
<td valign="top" align="center">0.8913</td>
<td valign="top" align="center">0.8950</td>
<td valign="top" align="center">0.8960</td>
</tr>
<tr>
<td valign="top" align="left">DTR</td>
<td valign="top" align="center">0.8714</td>
<td valign="top" align="center">0.8683</td>
<td valign="top" align="center">0.8702</td>
<td valign="top" align="center">0.8760</td>
<td valign="top" align="center">0.8674</td>
<td valign="top" align="center">0.8706</td>
<td valign="top" align="center">0.8721</td>
</tr>
<tr>
<td valign="top" align="left">Transformer</td>
<td valign="top" align="center">0.9023</td>
<td valign="top" align="center">0.9000</td>
<td valign="top" align="center">0.9029</td>
<td valign="top" align="center">0.9068</td>
<td valign="top" align="center">0.8968</td>
<td valign="top" align="center">0.9017</td>
<td valign="top" align="center">0.9017</td>
</tr>
<tr>
<td valign="top" align="left">Bi-LSTM</td>
<td valign="top" align="center">0.9103</td>
<td valign="top" align="center"><bold>0.9155</bold></td>
<td valign="top" align="center">0.9129</td>
<td valign="top" align="center">0.9135</td>
<td valign="top" align="center">0.9085</td>
<td valign="top" align="center">0.9121</td>
<td valign="top" align="center">0.9161</td>
</tr>
<tr>
<td valign="top" align="left">Transformer + Bi-LSTM</td>
<td valign="top" align="center"><bold>0.9110</bold></td>
<td valign="top" align="center">0.9148</td>
<td valign="top" align="center"><bold>0.9130</bold></td>
<td valign="top" align="center"><bold>0.9143</bold></td>
<td valign="top" align="center"><bold>0.9087</bold></td>
<td valign="top" align="center"><bold>0.9124</bold></td>
<td valign="top" align="center"><bold>0.9163</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn><p>Bold values denote the highest performance.</p></fn>
</table-wrap-foot>
</table-wrap>
<p>In summary, the results in <xref ref-type="table" rid="T1">Tables 1</xref>&#x2013;<xref ref-type="table" rid="T3">3</xref> demonstrate that for single-modal personality recognition the visual face features perform best on personality trait and interview prediction tasks, followed by deep visual scene features and deep audio features. This shows that the facial images related to facial expression contain more discriminant information for personality trait recognition.</p>
</sec>
<sec id="S4.SS3.SSS2">
<title>4.3.2. Results of multimodal personality trait recognition</title>
<p>For multimodal personality recognition tasks, we compare the performance of three typical multimodal information fusion methods, such as feature-level fusion, decision-level fusion, and model-level fusion. In feature-level fusion, the audio-visual global features learned by Bi-LSTM and Transformer networks, are concatenated into a whole feature vector as input of the linear regression layer for personality trait prediction. In this case, feature-level fusion is also called early fusion (EF). In model-level fusion (MF), the concatenated audio-visual global features are fed into a 4-layer full-collection layer network (1024-512-256-128) for personality trait prediction. In decision-level fusion, we adopt Eq. 12 to obtain the analytical solution of the optimal weight values in Eq. 6. In this case, decision-level fusion is also called late fusion (LF).</p>
<p><xref ref-type="table" rid="T4">Table 4</xref> presents the comparisons of recognition results obtained by different fusion methods such as EF, MF, and LF, as well as the single modality methods. From the results in <xref ref-type="table" rid="T4">Table 4</xref>, we can see that: (1) among three used fusion methods, the used LF method combining audio, scene, and face obtains the best performance with an average score of 0.9167 on personality trait recognition tasks, and an average score of 0.9200 on interview prediction tasks. For personality trait recognition, the used EF method slightly outperforms the MF method, yielding an average score of 0.9154. By contrast, the used MF method slightly outperforms the EF method on interview prediction tasks. In particular, the MF method gives an average interview score of 0.9180. (2) All used fusion methods such as LF, MF, and EF provide superior performance to the single modality methods. This indicates the complementarity to some extent among audio, scene, and face modality on target recognition tasks.</p>
<table-wrap position="float" id="T4">
<label>TABLE 4</label>
<caption><p>Comparisons of recognition results obtained by different methods.</p></caption>
<table cellspacing="5" cellpadding="5" frame="box" rules="all">
<thead>
<tr>
<td valign="top" align="left" style="color:#ffffff;background-color: #7f8080;">Modality</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">O</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">C</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">E</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">A</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">N</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Average score</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Interview score</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">A</td>
<td valign="top" align="center">0.8989</td>
<td valign="top" align="center">0.8847</td>
<td valign="top" align="center">0.8938</td>
<td valign="top" align="center">0.9048</td>
<td valign="top" align="center">0.8935</td>
<td valign="top" align="center">0.8952</td>
<td valign="top" align="center">0.8953</td>
</tr>
<tr>
<td valign="top" align="left">S</td>
<td valign="top" align="center">0.9043</td>
<td valign="top" align="center">0.9025</td>
<td valign="top" align="center">0.9035</td>
<td valign="top" align="center">0.9093</td>
<td valign="top" align="center">0.9000</td>
<td valign="top" align="center">0.9039</td>
<td valign="top" align="center">0.9057</td>
</tr>
<tr>
<td valign="top" align="left">F</td>
<td valign="top" align="center">0.9110</td>
<td valign="top" align="center">0.9148</td>
<td valign="top" align="center">0.9130</td>
<td valign="top" align="center">0.9143</td>
<td valign="top" align="center">0.9087</td>
<td valign="top" align="center">0.9124</td>
<td valign="top" align="center">0.9153</td>
</tr>
<tr>
<td valign="top" align="left">A + S + F (EF)</td>
<td valign="top" align="center">0.9145</td>
<td valign="top" align="center"><bold>0.9176</bold></td>
<td valign="top" align="center">0.9171</td>
<td valign="top" align="center">0.9158</td>
<td valign="top" align="center">0.9121</td>
<td valign="top" align="center">0.9154</td>
<td valign="top" align="center">0.9178</td>
</tr>
<tr>
<td valign="top" align="left">A + S + F (MF)</td>
<td valign="top" align="center">0.9151</td>
<td valign="top" align="center">0.9172</td>
<td valign="top" align="center">0.9156</td>
<td valign="top" align="center">0.9150</td>
<td valign="top" align="center">0.9123</td>
<td valign="top" align="center">0.9150</td>
<td valign="top" align="center">0.9180</td>
</tr>
<tr>
<td valign="top" align="left">A + S + F (LF)</td>
<td valign="top" align="center"><bold>0.9167</bold></td>
<td valign="top" align="center">0.9163</td>
<td valign="top" align="center"><bold>0.9176</bold></td>
<td valign="top" align="center"><bold>0.9177</bold></td>
<td valign="top" align="center"><bold>0.9150</bold></td>
<td valign="top" align="center"><bold>0.9167</bold></td>
<td valign="top" align="center"><bold>0.9200</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn><p>A, audio; S, scene; F, face; EF, early fusion; MF, model-level fusion; LF, late fusion. Bold values denote the highest performance.</p></fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="S4.SS3.SSS3">
<title>4.3.3. Comparisons with other existing methods</title>
<p>To further verify the effectiveness of the proposed method, <xref ref-type="table" rid="T5">Table 5</xref> presents the comparisons of different used methods. <xref ref-type="table" rid="T5">Table 5</xref> shows that the proposed method obtains an average score of 0.9167, which is better than other reported results obtained by audio, visual, and text modalities. This demonstrates the advantage of our method on personality trait recognition tasks. These comparing works are described as follows.</p>
<table-wrap position="float" id="T5">
<label>TABLE 5</label>
<caption><p>Comparisons with other existing methods.</p></caption>
<table cellspacing="5" cellpadding="5" frame="box" rules="all">
<thead>
<tr>
<td valign="top" align="left" style="color:#ffffff;background-color: #7f8080;">References</td>
<td valign="top" align="left" style="color:#ffffff;background-color: #7f8080;">Modality</td>
<td valign="top" align="left" style="color:#ffffff;background-color: #7f8080;">Feature extraction</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Fusion methods</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Average score</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><xref ref-type="bibr" rid="B13">G&#x00FC;&#x00E7;l&#x00FC;t&#x00FC;rk et al., 2016</xref></td>
<td valign="top" align="left">Audio, visual</td>
<td valign="top" align="left">Audio:ResNet-17<break/>Visual:ResNet-17</td>
<td valign="top" align="center">EF</td>
<td valign="top" align="center">0.9109</td>
</tr>
<tr>
<td valign="top" align="left"><xref ref-type="bibr" rid="B12">G&#x00FC;&#x00E7;l&#x00FC;t&#x00FC;rk et al., 2017</xref></td>
<td valign="top" align="left">Audio, visual, text</td>
<td valign="top" align="left">Audio:ResNet-17<break/>Visual:ResNet-17<break/>Text:skip-thought vectors</td>
<td valign="top" align="center">EF</td>
<td valign="top" align="center">0.9118</td>
</tr>
<tr>
<td valign="top" align="left"><xref ref-type="bibr" rid="B39">Wei et al., 2017</xref></td>
<td valign="top" align="left">Audio, visual</td>
<td valign="top" align="left">Audio:MFCCs<break/>Visual:DAN</td>
<td valign="top" align="center">LF</td>
<td valign="top" align="center">0.9130</td>
</tr>
<tr>
<td valign="top" align="left"><xref ref-type="bibr" rid="B31">Principi et al., 2021</xref></td>
<td valign="top" align="left">Audio, visual</td>
<td valign="top" align="left">Audio:1D CNN<break/>Visual:ResNet-50</td>
<td valign="top" align="center">MF</td>
<td valign="top" align="center">0.9160</td>
</tr>
<tr>
<td valign="top" align="left"><xref ref-type="bibr" rid="B8">Escalante et al., 2022</xref></td>
<td valign="top" align="left">Audio, visual, text</td>
<td valign="top" align="left">Audio:ResNet-18<break/>Visual:ResNet-18<break/>Text: skip-thought vectors</td>
<td valign="top" align="center">LF</td>
<td valign="top" align="center">0.9161</td>
</tr>
<tr>
<td valign="top" align="left">Ours</td>
<td valign="top" align="left">Audio, visual</td>
<td valign="top" align="left">Audio:VGGish<break/>Visual:VGG-Face</td>
<td valign="top" align="center">LF</td>
<td valign="top" align="center"><bold>0.9167</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn><p>EF, early fusion; MF, model-level fusion; LF, late fusion. Bold values denote the highest performance.</p></fn>
</table-wrap-foot>
</table-wrap>
<p><xref ref-type="bibr" rid="B13">G&#x00FC;&#x00E7;l&#x00FC;t&#x00FC;rk et al. (2016)</xref> provided an audio-visual personality trait recognition based on 17-layer deep residual networks (ResNet-17). They concatenated the learned features of audio-visual streams at feature-level as an input of a fully connected layer and reported an average score of 0.9109 for final personality trait prediction. In this case, the used network does not need any feature engineering or visual analysis like face detection, face landmark alignment. Similarly, they also presented an multimodal personality trait analysis integrating audio, visual, and text modalities by using the 17-layer deep residual networks (<xref ref-type="bibr" rid="B12">G&#x00FC;&#x00E7;l&#x00FC;t&#x00FC;rk et al., 2017</xref>). Here, they extracted skip-thought vectors as text features. They fused these modalities at feature-level and reported an average score of 0.9118. <xref ref-type="bibr" rid="B39">Wei et al. (2017)</xref> presented a deep bimodal regression method of personality traits on short video sequences. For audio modality, they extracted MFCCs and logfbank features. For visual modality, they employed a modified CNN model called Descriptor Aggregation Network (DAN) to extract visual features. Finally, they fused these predicted regression scores of audio-visual modalities at decision-level, and reported an average score of 0.9130. <xref ref-type="bibr" rid="B31">Principi et al. (2021)</xref> presented a multimodal deep learning method integrating the raw audio and visual modalities for personality trait prediction. For audio modality, a 14-layer 1D CNN was used for audio feature extraction. For visual modality, they employed a pre-trained ResNet-50 network for visual feature extraction. Finally, they employed a fully connected layer to jointly learn audio-visual feature representations at model-level for final personality trait recognition, and achieved an average score of 0.9160. <xref ref-type="bibr" rid="B8">Escalante et al. (2022)</xref> proposed a multimodal deep personality trait recognition method based on audio, visual, and text modalities. They adopted a ResNet-18 to extract audio and visual features, and skip-thought vectors as text features. Then, a late fusion strategy was utilized to fuse all three modalities, and yielded an average score of 0.9161.</p>
</sec>
</sec>
</sec>
<sec id="S5" sec-type="conclusion">
<title>5. Conclusion</title>
<p>This paper presents a multimodal personality trait recognition method based on CNN + Bi-LSTM + Transformer network. In this work, CNN, Bi-LSTM, and Transformer are combined to capture high-level audio-visual spatio-temporal feature representations for personality trait recognition. Finally, we compare multimodal personality prediction results based on three different fusion methods such as feature-level fusion, model-level fusion, and decision-level fusion. Experiments on the public ChaLearn First Impression-V2 dataset show that decision-level fusion achieves the best multimodal personality trait recognition results with an average score of 0.9167, outperforming other existing methods.</p>
<p>It is noted that this work only focuses on integrating audio and visual modalities for multimodal personality trait recognition. Considering the diversity of modal information related to the expression of personality traits, it is interesting to combine current audio-visual modalities with other modalities such as physiological signals, text cues, etc., to further improve the performance of personality trait recognition. In addition, exploring a more advanced deep learning model for personality trait recognition is also an important direction in our future work.</p>
</sec>
<sec id="S6" sec-type="data-availability">
<title>Data availability statement</title>
<p>Publicly available datasets were analyzed in this study. This data can be found here: <ext-link ext-link-type="uri" xlink:href="https://chalearnlap.cvc.uab.cat/dataset/24/description/">https://chalearnlap.cvc.uab.cat/dataset/24/description/</ext-link>.</p>
</sec>
<sec id="S7" sec-type="author-contributions">
<title>Author contributions</title>
<p>XZ contributed to the writing and drafted the article. YL, ZT, YX, XT, DW, and GW contributed to the data preprocessing and analysis, software and experiment simulation. HL contributed to the project administration and writing&#x2014;reviewing and editing. All authors contributed to the article and approved the submitted version.</p>
</sec>
</body>
<back>
<sec id="S8" sec-type="funding-information">
<title>Funding</title>
<p>This work was supported by National Natural Science Foundation of China (NSFC) and Zhejiang Provincial Natural Science Foundation of China under Grant Nos. 61976149, 62276180, LZ20F020002, and LQ21F020002.</p>
</sec>
<sec id="S9" sec-type="COI-statement">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec id="S10" sec-type="disclaimer">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>An</surname> <given-names>G.</given-names></name> <name><surname>Levitan</surname> <given-names>S. I.</given-names></name> <name><surname>Levitan</surname> <given-names>R.</given-names></name> <name><surname>Rosenberg</surname> <given-names>A.</given-names></name> <name><surname>Levine</surname> <given-names>M.</given-names></name> <name><surname>Hirschberg</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). &#x201C;<article-title>Automatically classifying self-rated personality scores from speech</article-title>,&#x201D; in <source><italic>Proceedings of the INTERSPEECH Conference 2016</italic></source>, (<publisher-loc>Incheon</publisher-loc>: <publisher-name>ISCA</publisher-name>), <fpage>1412</fpage>&#x2013;<lpage>1416</lpage>. <pub-id pub-id-type="doi">10.21437/Interspeech.2016-1328</pub-id></citation></ref>
<ref id="B2"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bathurst</surname> <given-names>K.</given-names></name> <name><surname>Gottfried</surname> <given-names>A. W.</given-names></name> <name><surname>Gottfried</surname> <given-names>A. E.</given-names></name></person-group> (<year>1997</year>). <article-title>Normative data for the MMPI-2 in child custody litigation.</article-title> <source><italic>Psychol. Assess.</italic></source> <volume>9</volume>:<issue>205</issue>. <pub-id pub-id-type="doi">10.1037/1040-3590.9.3.205</pub-id> <pub-id pub-id-type="pmid">31681037</pub-id></citation></ref>
<ref id="B3"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Beyan</surname> <given-names>C.</given-names></name> <name><surname>Zunino</surname> <given-names>A.</given-names></name> <name><surname>Shahid</surname> <given-names>M.</given-names></name> <name><surname>Murino</surname> <given-names>V.</given-names></name></person-group> (<year>2021</year>). <article-title>Personality traits classification using deep visual activity-based nonverbal features of key-dynamic images.</article-title> <source><italic>IEEE Trans. Affect. Comput.</italic></source> <volume>12</volume> <fpage>1084</fpage>&#x2013;<lpage>1099</lpage>. <pub-id pub-id-type="doi">10.1109/TAFFC.2019.2944614</pub-id></citation></ref>
<ref id="B4"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Costa</surname> <given-names>P. T.</given-names></name> <name><surname>McCrae</surname> <given-names>R. R.</given-names></name></person-group> (<year>1998</year>). &#x201C;<article-title>Trait theories of personality</article-title>,&#x201D; in <source><italic>Advanced Personality</italic></source>, <role>eds</role> <person-group person-group-type="editor"><name><surname>Barone</surname> <given-names>D. F.</given-names></name> <name><surname>Hersen</surname> <given-names>M.</given-names></name> <name><surname>Hasselt</surname> <given-names>V. B.</given-names></name></person-group> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>103</fpage>&#x2013;<lpage>121</lpage>. <pub-id pub-id-type="doi">10.1007/978-1-4419-8580-4_5</pub-id></citation></ref>
<ref id="B5"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Deng</surname> <given-names>J.</given-names></name> <name><surname>Dong</surname> <given-names>W.</given-names></name> <name><surname>Socher</surname> <given-names>R.</given-names></name> <name><surname>Li</surname> <given-names>L.-J.</given-names></name> <name><surname>Li</surname> <given-names>K.</given-names></name> <name><surname>Fei-Fei</surname> <given-names>L.</given-names></name></person-group> (<year>2009</year>). &#x201C;<article-title>Imagenet: a large-scale hierarchical image database</article-title>,&#x201D; in <source><italic>Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition</italic></source>, <publisher-loc>Miami, FL</publisher-loc>, <fpage>248</fpage>&#x2013;<lpage>255</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2009.5206848</pub-id> <pub-id pub-id-type="pmid">26886976</pub-id></citation></ref>
<ref id="B6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Elman</surname> <given-names>J. L.</given-names></name></person-group> (<year>1990</year>). <article-title>Finding structure in time.</article-title> <source><italic>Cogn. Sci.</italic></source> <volume>14</volume> <fpage>179</fpage>&#x2013;<lpage>211</lpage>. <pub-id pub-id-type="doi">10.1207/s15516709cog1402_1</pub-id></citation></ref>
<ref id="B7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Escalante</surname> <given-names>H. J.</given-names></name> <name><surname>Guyon</surname> <given-names>I.</given-names></name> <name><surname>Escalera</surname> <given-names>S.</given-names></name> <name><surname>Jacques</surname> <given-names>J.</given-names></name> <name><surname>Madadi</surname> <given-names>M.</given-names></name> <name><surname>Bar&#x00F3;</surname> <given-names>X.</given-names></name><etal/></person-group> (<year>2017</year>). &#x201C;<article-title>Design of an explainable machine learning challenge for video interviews</article-title>,&#x201D; in <source><italic>Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN)</italic></source>, (<publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>3688</fpage>&#x2013;<lpage>3695</lpage>. <pub-id pub-id-type="doi">10.1109/IJCNN.2017.7966320</pub-id></citation></ref>
<ref id="B8"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Escalante</surname> <given-names>H. J.</given-names></name> <name><surname>Kaya</surname> <given-names>H.</given-names></name> <name><surname>Salah</surname> <given-names>A. A.</given-names></name> <name><surname>Escalera</surname> <given-names>S.</given-names></name> <name><surname>G&#x00FC;&#x00E7;l&#x00FC;t&#x00FC;rk</surname> <given-names>Y.</given-names></name> <name><surname>G&#x00FC;&#x00E7;l&#x00FC;</surname> <given-names>U.</given-names></name><etal/></person-group> (<year>2022</year>). <article-title>Modeling, recognizing, and explaining apparent personality from videos.</article-title> <source><italic>IEEE Trans. Affect. Comput.</italic></source> <volume>13</volume> <fpage>894</fpage>&#x2013;<lpage>911</lpage>. <pub-id pub-id-type="doi">10.1109/TAFFC.2020.2973984</pub-id> <pub-id pub-id-type="pmid">35044134</pub-id></citation></ref>
<ref id="B9"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Furnham</surname> <given-names>A.</given-names></name></person-group> (<year>1996</year>). <article-title>The big five versus the big four: the relationship between the Myers-Briggs Type Indicator (MBTI) and NEO-PI five factor model of personality.</article-title> <source><italic>Pers. Individ. Diff.</italic></source> <volume>21</volume> <fpage>303</fpage>&#x2013;<lpage>307</lpage>. <pub-id pub-id-type="doi">10.1016/0191-8869(96)00033-5</pub-id></citation></ref>
<ref id="B10"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gao</surname> <given-names>J.</given-names></name> <name><surname>Li</surname> <given-names>P.</given-names></name> <name><surname>Chen</surname> <given-names>Z.</given-names></name> <name><surname>Zhang</surname> <given-names>J.</given-names></name></person-group> (<year>2020</year>). <article-title>A survey on deep learning for multimodal data fusion.</article-title> <source><italic>Neural Comput.</italic></source> <volume>32</volume> <fpage>829</fpage>&#x2013;<lpage>864</lpage>. <pub-id pub-id-type="doi">10.1162/neco_a_01273</pub-id></citation></ref>
<ref id="B11"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gemmeke</surname> <given-names>J. F.</given-names></name> <name><surname>Ellis</surname> <given-names>D. P.</given-names></name> <name><surname>Freedman</surname> <given-names>D.</given-names></name> <name><surname>Jansen</surname> <given-names>A.</given-names></name> <name><surname>Lawrence</surname> <given-names>W.</given-names></name> <name><surname>Moore</surname> <given-names>R. C.</given-names></name><etal/></person-group> (<year>2017</year>). &#x201C;<article-title>Audio set: an ontology and human-labeled dataset for audio events</article-title>,&#x201D; in <source><italic>Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</italic></source>, (<publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>776</fpage>&#x2013;<lpage>780</lpage>. <pub-id pub-id-type="doi">10.1109/ICASSP.2017.7952261</pub-id></citation></ref>
<ref id="B12"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>G&#x00FC;&#x00E7;l&#x00FC;t&#x00FC;rk</surname> <given-names>Y.</given-names></name> <name><surname>G&#x00FC;&#x00E7;l&#x00FC;</surname> <given-names>U.</given-names></name> <name><surname>Baro</surname> <given-names>X.</given-names></name> <name><surname>Escalante</surname> <given-names>H. J.</given-names></name> <name><surname>Guyon</surname> <given-names>I.</given-names></name> <name><surname>Escalera</surname> <given-names>S.</given-names></name><etal/></person-group> (<year>2017</year>). <article-title>Multimodal first impression analysis with deep residual networks.</article-title> <source><italic>IEEE Trans. Affect. Comput.</italic></source> <volume>9</volume> <fpage>316</fpage>&#x2013;<lpage>329</lpage>. <pub-id pub-id-type="doi">10.1109/TAFFC.2017.2751469</pub-id></citation></ref>
<ref id="B13"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>G&#x00FC;&#x00E7;l&#x00FC;t&#x00FC;rk</surname> <given-names>Y.</given-names></name> <name><surname>G&#x00FC;&#x00E7;l&#x00FC;</surname> <given-names>U.</given-names></name> <name><surname>Van Gerven</surname> <given-names>M. A.</given-names></name> <name><surname>Van Lier</surname> <given-names>R.</given-names></name></person-group> (<year>2016</year>). &#x201C;<article-title>Deep impression: audiovisual deep residual networks for multimodal apparent personality trait recognition</article-title>,&#x201D; in <source><italic>Proceedings of the European Conference on Computer Vision</italic></source>, (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>349</fpage>&#x2013;<lpage>358</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-49409-8_28</pub-id></citation></ref>
<ref id="B14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Guntuku</surname> <given-names>S. C.</given-names></name> <name><surname>Qiu</surname> <given-names>L.</given-names></name> <name><surname>Roy</surname> <given-names>S.</given-names></name> <name><surname>Lin</surname> <given-names>W.</given-names></name> <name><surname>Jakhetiya</surname> <given-names>V.</given-names></name></person-group> (<year>2015</year>). &#x201C;<article-title>Do others perceive you as you want them to? Modeling personality based on selfies</article-title>,&#x201D; in <source><italic>Proceedings of the 1st International Workshop on Affect &#x0026; Sentiment in Multimedia, Association for Computing Machinery</italic></source>, (<publisher-loc>New York, NY</publisher-loc>), <fpage>21</fpage>&#x2013;<lpage>26</lpage>. <pub-id pub-id-type="doi">10.1145/2813524.2813528</pub-id></citation></ref>
<ref id="B15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>G&#x00FC;rp&#x0131;nar</surname> <given-names>F.</given-names></name> <name><surname>Kaya</surname> <given-names>H.</given-names></name> <name><surname>Salah</surname> <given-names>A. A.</given-names></name></person-group> (<year>2016</year>). &#x201C;<article-title>Combining deep facial and ambient features for first impression estimation</article-title>,&#x201D; in <source><italic>Proceedings of the European Conference on Computer Vision</italic></source>, (<publisher-loc>Berlin</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>372</fpage>&#x2013;<lpage>385</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-49409-8_30</pub-id></citation></ref>
<ref id="B16"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hayat</surname> <given-names>H.</given-names></name> <name><surname>Ventura</surname> <given-names>C.</given-names></name> <name><surname>Lapedriza</surname> <given-names>&#x00C0;</given-names></name></person-group> (<year>2019</year>). <article-title>On the use of interpretable CNN for personality trait recognition from audio.</article-title> <source><italic>CCIA</italic></source> <volume>319</volume> <fpage>135</fpage>&#x2013;<lpage>144</lpage>.</citation></ref>
<ref id="B17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hershey</surname> <given-names>S.</given-names></name> <name><surname>Chaudhuri</surname> <given-names>S.</given-names></name> <name><surname>Ellis</surname> <given-names>D. P.</given-names></name> <name><surname>Gemmeke</surname> <given-names>J. F.</given-names></name> <name><surname>Jansen</surname> <given-names>A.</given-names></name> <name><surname>Moore</surname> <given-names>R. C.</given-names></name><etal/></person-group> (<year>2017</year>). &#x201C;<article-title>CNN architectures for large-scale audio classification</article-title>,&#x201D; in <source><italic>Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</italic></source>, (<publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>131</fpage>&#x2013;<lpage>135</lpage>. <pub-id pub-id-type="doi">10.1109/ICASSP.2017.7952132</pub-id></citation></ref>
<ref id="B18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hochreiter</surname> <given-names>S.</given-names></name> <name><surname>Schmidhuber</surname> <given-names>J.</given-names></name></person-group> (<year>1997</year>). <article-title>Long short-term memory.</article-title> <source><italic>Neural Comput.</italic></source> <volume>9</volume> <fpage>1735</fpage>&#x2013;<lpage>1780</lpage>. <pub-id pub-id-type="doi">10.1162/neco.1997.9.8.1735</pub-id> <pub-id pub-id-type="pmid">9377276</pub-id></citation></ref>
<ref id="B19"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Junior</surname> <given-names>J. C. S. J.</given-names></name> <name><surname>G&#x00FC;&#x00E7;l&#x00FC;t&#x00FC;rk</surname> <given-names>Y.</given-names></name> <name><surname>P&#x00E9;rez</surname> <given-names>M.</given-names></name> <name><surname>G&#x00FC;&#x00E7;l&#x00FC;</surname> <given-names>U.</given-names></name> <name><surname>Andujar</surname> <given-names>C.</given-names></name> <name><surname>Bar&#x00F3;</surname> <given-names>X.</given-names></name><etal/></person-group> (<year>2019</year>). <article-title>First impressions: a survey on vision-based apparent personality trait analysis.</article-title> <source><italic>IEEE Trans. Affect. Comput.</italic></source> <volume>13</volume> <fpage>75</fpage>&#x2013;<lpage>95</lpage>. <pub-id pub-id-type="doi">10.1109/TAFFC.2019.2930058</pub-id></citation></ref>
<ref id="B20"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Karson</surname> <given-names>S.</given-names></name> <name><surname>O&#x2019;Dell</surname> <given-names>J. W.</given-names></name></person-group> (<year>1976</year>). <source><italic>A Guide to The Clinical Use of the 16 PF.</italic></source> <publisher-loc>Chandigarh</publisher-loc>: <publisher-name>Inst for Personality &#x0026; Ability Test</publisher-name>.</citation></ref>
<ref id="B21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Krizhevsky</surname> <given-names>A.</given-names></name> <name><surname>Sutskever</surname> <given-names>I.</given-names></name> <name><surname>Hinton</surname> <given-names>G. E.</given-names></name></person-group> (<year>2012</year>). &#x201C;<article-title>Imagenet classification with deep convolutional neural networks</article-title>,&#x201D; in <source><italic>Advances in Neural Information Processing Systems</italic></source>, <role>eds</role> <person-group person-group-type="editor"><name><surname>Guyon</surname> <given-names>I.</given-names></name> <name><surname>Von Luxburg</surname> <given-names>U.</given-names></name> <name><surname>Bengio</surname> <given-names>S.</given-names></name> <name><surname>Wallach</surname> <given-names>H.</given-names></name> <name><surname>Fergus</surname> <given-names>R.</given-names></name> <name><surname>Vishwanathan</surname> <given-names>S.</given-names></name><etal/></person-group> (<publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>MIT Press</publisher-name>), <fpage>1097</fpage>&#x2013;<lpage>1105</lpage>.</citation></ref>
<ref id="B22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>LeCun</surname> <given-names>Y.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Hinton</surname> <given-names>G.</given-names></name></person-group> (<year>2015</year>). <article-title>Deep learning.</article-title> <source><italic>Nature</italic></source> <volume>521</volume> <fpage>436</fpage>&#x2013;<lpage>444</lpage>. <pub-id pub-id-type="doi">10.1038/nature14539</pub-id> <pub-id pub-id-type="pmid">26017442</pub-id></citation></ref>
<ref id="B23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>LeCun</surname> <given-names>Y.</given-names></name> <name><surname>Bottou</surname> <given-names>L.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Haffner</surname> <given-names>P.</given-names></name></person-group> (<year>1998</year>). <article-title>Gradient-based learning applied to document recognition.</article-title> <source><italic>Proc. IEEE</italic></source> <volume>86</volume> <fpage>2278</fpage>&#x2013;<lpage>2324</lpage>. <pub-id pub-id-type="doi">10.1109/5.726791</pub-id></citation></ref>
<ref id="B24"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liang</surname> <given-names>Y.</given-names></name> <name><surname>Li</surname> <given-names>S.</given-names></name> <name><surname>Yan</surname> <given-names>C.</given-names></name> <name><surname>Li</surname> <given-names>M.</given-names></name> <name><surname>Jiang</surname> <given-names>C.</given-names></name></person-group> (<year>2021</year>). <article-title>Explaining the black-box model: a survey of local interpretation methods for deep neural networks.</article-title> <source><italic>Neurocomputing</italic></source> <volume>419</volume> <fpage>168</fpage>&#x2013;<lpage>182</lpage>. <pub-id pub-id-type="doi">10.1016/j.neucom.2020.08.011</pub-id></citation></ref>
<ref id="B25"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>McCrae</surname> <given-names>R. R.</given-names></name> <name><surname>John</surname> <given-names>O. P.</given-names></name></person-group> (<year>1992</year>). <article-title>An introduction to the five-factor model and its applications.</article-title> <source><italic>J. Personal.</italic></source> <volume>60</volume> <fpage>175</fpage>&#x2013;<lpage>215</lpage>. <pub-id pub-id-type="doi">10.1111/j.1467-6494.1992.tb00970.x</pub-id> <pub-id pub-id-type="pmid">1635039</pub-id></citation></ref>
<ref id="B26"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mohammadi</surname> <given-names>G.</given-names></name> <name><surname>Vinciarelli</surname> <given-names>A.</given-names></name></person-group> (<year>2012</year>). <article-title>Automatic personality perception: prediction of trait attribution based on prosodic features.</article-title> <source><italic>IEEE Trans. Affect. Comput.</italic></source> <volume>3</volume> <fpage>273</fpage>&#x2013;<lpage>284</lpage>. <pub-id pub-id-type="doi">10.1109/T-AFFC.2012.5</pub-id></citation></ref>
<ref id="B27"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Montag</surname> <given-names>C.</given-names></name> <name><surname>Davis</surname> <given-names>K. L.</given-names></name></person-group> (<year>2018</year>). <article-title>Affective neuroscience theory and personality: an update.</article-title> <source><italic>Personal. Neurosci.</italic></source> <volume>1</volume>:<issue>e12</issue>. <pub-id pub-id-type="doi">10.1017/pen.2018.10</pub-id> <pub-id pub-id-type="pmid">32435731</pub-id></citation></ref>
<ref id="B28"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Parkhi</surname> <given-names>O. M.</given-names></name> <name><surname>Vedaldi</surname> <given-names>A.</given-names></name> <name><surname>Zisserman</surname> <given-names>A.</given-names></name></person-group> (<year>2015</year>). &#x201C;<article-title>Deep face recognition</article-title>,&#x201D; in <source><italic>Proceedings of the British Machine Vision Conference (BMVC)</italic></source>, <publisher-loc>Aberystwyth</publisher-loc>, <fpage>411</fpage>&#x2013;<lpage>412</lpage>. <pub-id pub-id-type="doi">10.5244/C.29.41</pub-id></citation></ref>
<ref id="B29"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ponce-L&#x00F3;pez</surname> <given-names>V.</given-names></name> <name><surname>Chen</surname> <given-names>B.</given-names></name> <name><surname>Oliu</surname> <given-names>M.</given-names></name> <name><surname>Corneanu</surname> <given-names>C.</given-names></name> <name><surname>Clap&#x00E9;s</surname> <given-names>A.</given-names></name> <name><surname>Guyon</surname> <given-names>I.</given-names></name><etal/></person-group> (<year>2016</year>). &#x201C;<article-title>Chalearn lap 2016: first round challenge on first impressions-dataset and results</article-title>,&#x201D; in <source><italic>Proceedings of the European Conference on Computer Vision</italic></source>, (<publisher-loc>Berlin</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>400</fpage>&#x2013;<lpage>418</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-49409-8_32</pub-id></citation></ref>
<ref id="B30"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Prechelt</surname> <given-names>L.</given-names></name></person-group> (<year>1998</year>). <article-title>Automatic early stopping using cross validation: quantifying the criteria.</article-title> <source><italic>Neural Netw.</italic></source> <volume>11</volume> <fpage>761</fpage>&#x2013;<lpage>767</lpage>. <pub-id pub-id-type="doi">10.1016/S0893-6080(98)00010-0</pub-id> <pub-id pub-id-type="pmid">12662814</pub-id></citation></ref>
<ref id="B31"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Principi</surname> <given-names>R. D. P.</given-names></name> <name><surname>Palmero</surname> <given-names>C.</given-names></name> <name><surname>Junior</surname> <given-names>J. C.</given-names></name> <name><surname>Escalera</surname> <given-names>S.</given-names></name></person-group> (<year>2021</year>). <article-title>On the effect of observed subject biases in apparent personality analysis from audio-visual signals.</article-title> <source><italic>IEEE Trans. Affect. Comput.</italic></source> <volume>12</volume> <fpage>607</fpage>&#x2013;<lpage>621</lpage>. <pub-id pub-id-type="doi">10.1109/TAFFC.2019.2956030</pub-id></citation></ref>
<ref id="B32"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schuller</surname> <given-names>B.</given-names></name> <name><surname>Steidl</surname> <given-names>S.</given-names></name> <name><surname>Batliner</surname> <given-names>A.</given-names></name> <name><surname>Vinciarelli</surname> <given-names>A.</given-names></name> <name><surname>Scherer</surname> <given-names>K.</given-names></name> <name><surname>Ringeval</surname> <given-names>F.</given-names></name><etal/></person-group> (<year>2013</year>). &#x201C;<article-title>The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism</article-title>,&#x201D; in <source><italic>Proceedings of the INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association</italic></source>, <publisher-loc>Lyon</publisher-loc>. <pub-id pub-id-type="doi">10.21437/Interspeech.2013-56</pub-id></citation></ref>
<ref id="B33"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schuster</surname> <given-names>M.</given-names></name> <name><surname>Paliwal</surname> <given-names>K. K.</given-names></name></person-group> (<year>1997</year>). <article-title>Bidirectional recurrent neural networks.</article-title> <source><italic>IEEE Trans. Signal Process.</italic></source> <volume>45</volume> <fpage>2673</fpage>&#x2013;<lpage>2681</lpage>. <pub-id pub-id-type="doi">10.1109/78.650093</pub-id></citation></ref>
<ref id="B34"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Teijeiro-Mosquera</surname> <given-names>L.</given-names></name> <name><surname>Biel</surname> <given-names>J.-I.</given-names></name> <name><surname>Alba-Castro</surname> <given-names>J. L.</given-names></name> <name><surname>Gatica-Perez</surname> <given-names>D.</given-names></name></person-group> (<year>2014</year>). <article-title>What your face vlogs about: expressions of emotion and big-five traits impressions in YouTube.</article-title> <source><italic>IEEE Trans. Affect. Comput.</italic></source> <volume>6</volume> <fpage>193</fpage>&#x2013;<lpage>205</lpage>. <pub-id pub-id-type="doi">10.1109/TAFFC.2014.2370044</pub-id></citation></ref>
<ref id="B35"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vaswani</surname> <given-names>A.</given-names></name> <name><surname>Shazeer</surname> <given-names>N.</given-names></name> <name><surname>Parmar</surname> <given-names>N.</given-names></name> <name><surname>Uszkoreit</surname> <given-names>J.</given-names></name> <name><surname>Jones</surname> <given-names>L.</given-names></name> <name><surname>Gomez</surname> <given-names>A. N.</given-names></name><etal/></person-group> (<year>2017</year>). &#x201C;<article-title>Attention is all you need</article-title>,&#x201D; in <source><italic>Advances in Neural Information Processing Systems</italic></source>, <role>eds</role> <person-group person-group-type="editor"><name><surname>Guyon</surname> <given-names>I.</given-names></name> <name><surname>Von Luxburg</surname> <given-names>U.</given-names></name> <name><surname>Bengio</surname> <given-names>S.</given-names></name> <name><surname>Wallach</surname> <given-names>H.</given-names></name> <name><surname>Fergus</surname> <given-names>R.</given-names></name> <name><surname>Vishwanathan</surname> <given-names>S.</given-names></name><etal/></person-group> (<publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>MIT Press</publisher-name>), <fpage>5998</fpage>&#x2013;<lpage>6008</lpage>.</citation></ref>
<ref id="B36"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vinciarelli</surname> <given-names>A.</given-names></name> <name><surname>Mohammadi</surname> <given-names>G.</given-names></name></person-group> (<year>2014</year>). <article-title>A survey of personality computing.</article-title> <source><italic>IEEE Trans. Affect. Comput.</italic></source> <volume>5</volume> <fpage>273</fpage>&#x2013;<lpage>291</lpage>. <pub-id pub-id-type="doi">10.1109/TAFFC.2014.2330816</pub-id></citation></ref>
<ref id="B37"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>D.</given-names></name> <name><surname>Zhao</surname> <given-names>X.</given-names></name></person-group> (<year>2022</year>). <article-title>Affective video recommender systems: a survey.</article-title> <source><italic>Front. Neurosci.</italic></source> <volume>16</volume>:<issue>984404</issue>. <pub-id pub-id-type="doi">10.3389/fnins.2022.984404</pub-id> <pub-id pub-id-type="pmid">36090291</pub-id></citation></ref>
<ref id="B38"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>M.</given-names></name> <name><surname>Deng</surname> <given-names>W.</given-names></name></person-group> (<year>2021</year>). <article-title>Deep face recognition: a survey.</article-title> <source><italic>Neurocomputing</italic></source> <volume>429</volume> <fpage>215</fpage>&#x2013;<lpage>244</lpage>. <pub-id pub-id-type="doi">10.1016/j.neucom.2020.10.081</pub-id></citation></ref>
<ref id="B39"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wei</surname> <given-names>X. S.</given-names></name> <name><surname>Zhang</surname> <given-names>C. L.</given-names></name> <name><surname>Zhang</surname> <given-names>H.</given-names></name> <name><surname>Wu</surname> <given-names>J. X.</given-names></name></person-group> (<year>2017</year>). <article-title>Deep bimodal regression of apparent personality traits from short video sequences.</article-title> <source><italic>IEEE Trans. Affect. Comput.</italic></source> <volume>9</volume> <fpage>303</fpage>&#x2013;<lpage>315</lpage>. <pub-id pub-id-type="doi">10.1109/TAFFC.2017.2762299</pub-id></citation></ref>
<ref id="B40"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xu</surname> <given-names>J.</given-names></name> <name><surname>Tian</surname> <given-names>W.</given-names></name> <name><surname>Lv</surname> <given-names>G.</given-names></name> <name><surname>Liu</surname> <given-names>S.</given-names></name> <name><surname>Fan</surname> <given-names>Y.</given-names></name></person-group> (<year>2021</year>). <article-title>Prediction of the big five personality traits using static facial images of college students with different academic backgrounds.</article-title> <source><italic>IEEE Access</italic></source> <volume>9</volume> <fpage>76822</fpage>&#x2013;<lpage>76832</lpage>. <pub-id pub-id-type="doi">10.1109/ACCESS.2021.3076989</pub-id></citation></ref>
<ref id="B41"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yan</surname> <given-names>A.</given-names></name> <name><surname>Chen</surname> <given-names>Z.</given-names></name> <name><surname>Zhang</surname> <given-names>H.</given-names></name> <name><surname>Peng</surname> <given-names>L.</given-names></name> <name><surname>Yan</surname> <given-names>Q.</given-names></name> <name><surname>Hassan</surname> <given-names>M. U.</given-names></name><etal/></person-group> (<year>2021</year>). <article-title>Effective detection of mobile malware behavior based on explainable deep neural network.</article-title> <source><italic>Neurocomputing</italic></source> <volume>453</volume> <fpage>482</fpage>&#x2013;<lpage>492</lpage>. <pub-id pub-id-type="doi">10.1016/j.neucom.2020.09.082</pub-id></citation></ref>
<ref id="B42"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ye</surname> <given-names>M.</given-names></name> <name><surname>Shen</surname> <given-names>J.</given-names></name> <name><surname>Lin</surname> <given-names>G.</given-names></name> <name><surname>Xiang</surname> <given-names>T.</given-names></name> <name><surname>Shao</surname> <given-names>L.</given-names></name> <name><surname>Hoi</surname> <given-names>S. C.</given-names></name></person-group> (<year>2021</year>). <article-title>Deep learning for person re-identification: a survey and outlook.</article-title> <source><italic>IEEE Trans. Pattern Anal. Mach. Intellig.</italic></source> <volume>44</volume> <fpage>2872</fpage>&#x2013;<lpage>2893</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2021.3054775</pub-id> <pub-id pub-id-type="pmid">33497329</pub-id></citation></ref>
<ref id="B43"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>K.</given-names></name> <name><surname>Zhang</surname> <given-names>Z.</given-names></name> <name><surname>Li</surname> <given-names>Z.</given-names></name> <name><surname>Qiao</surname> <given-names>Y.</given-names></name></person-group> (<year>2016</year>). <article-title>Joint face detection and alignment using multitask cascaded convolutional networks.</article-title> <source><italic>IEEE Signal Process. Lett.</italic></source> <volume>23</volume> <fpage>1499</fpage>&#x2013;<lpage>1503</lpage>. <pub-id pub-id-type="doi">10.1109/LSP.2016.2603342</pub-id></citation></ref>
<ref id="B44"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>S.</given-names></name> <name><surname>Zhao</surname> <given-names>X.</given-names></name> <name><surname>Tian</surname> <given-names>Q.</given-names></name></person-group> (<year>2022</year>). <article-title>Spontaneous speech emotion recognition using multiscale deep convolutional LSTM.</article-title> <source><italic>IEEE Trans. Affect. Comput.</italic></source> <volume>13</volume> <fpage>680</fpage>&#x2013;<lpage>688</lpage>. <pub-id pub-id-type="doi">10.1109/TAFFC.2019.2947464</pub-id></citation></ref>
<ref id="B45"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>S.</given-names></name> <name><surname>Gholaminejad</surname> <given-names>A.</given-names></name> <name><surname>Ding</surname> <given-names>G.</given-names></name> <name><surname>Gao</surname> <given-names>Y.</given-names></name> <name><surname>Han</surname> <given-names>J.</given-names></name> <name><surname>Keutzer</surname> <given-names>K.</given-names></name></person-group> (<year>2019</year>). <article-title>Personalized emotion recognition by personality-aware high-order learning of physiological signals.</article-title> <source><italic>ACM Trans. Multimedia Comput. Commun. Appl.</italic></source> <volume>15</volume> <fpage>1</fpage>&#x2013;<lpage>18</lpage>. <pub-id pub-id-type="doi">10.1145/3233184</pub-id> <pub-id pub-id-type="pmid">27534393</pub-id></citation></ref>
<ref id="B46"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>X.</given-names></name> <name><surname>Tang</surname> <given-names>Z.</given-names></name> <name><surname>Zhang</surname> <given-names>S.</given-names></name></person-group> (<year>2022</year>). <article-title>Deep personality trait recognition: a survey.</article-title> <source><italic>Front. Psychol.</italic></source> <volume>13</volume>:<issue>839619</issue>. <pub-id pub-id-type="doi">10.3389/fpsyg.2022.839619</pub-id> <pub-id pub-id-type="pmid">35645923</pub-id></citation></ref>
<ref id="B47"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhu</surname> <given-names>M.</given-names></name> <name><surname>Xie</surname> <given-names>X.</given-names></name> <name><surname>Zhang</surname> <given-names>L.</given-names></name> <name><surname>Wang</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). &#x201C;<article-title>Automatic personality perception from speech in mandarin</article-title>,&#x201D; in <source><italic>Proceedings of the 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP)</italic></source>, <publisher-loc>Taipei</publisher-loc>, <fpage>309</fpage>&#x2013;<lpage>313</lpage>. <pub-id pub-id-type="doi">10.1109/ISCSLP.2018.8706692</pub-id></citation></ref>
</ref-list>
</back>
</article>