<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Psychol.</journal-id>
<journal-title>Frontiers in Psychology</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Psychol.</abbrev-journal-title>
<issn pub-type="epub">1664-1078</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fpsyg.2021.762795</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Psychology</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>MIFAD-Net: Multi-Layer Interactive Feature Fusion Network With Angular Distance Loss for Face Emotion Recognition</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Cai</surname> <given-names>Weiwei</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1304268/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Gao</surname> <given-names>Ming</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1445353/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Liu</surname> <given-names>Runmin</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1445262/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Mao</surname> <given-names>Jie</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1377753/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>College of Sports Engineering and Information Technology, Wuhan Sports University</institution>, <addr-line>Wuhan</addr-line>, <country>China</country></aff>
<aff id="aff2"><sup>2</sup><institution>School of Logistics and Transportation, Central South University of Forestry and Technology</institution>, <addr-line>Changsha</addr-line>, <country>China</country></aff>
<aff id="aff3"><sup>3</sup><institution>College of Sports Science and Technology, Wuhan Sports University</institution>, <addr-line>Wuhan</addr-line>, <country>China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Yaoru Sun, Tongji University, China</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Yuanpeng Zhang, Nantong University, China; Yufeng Yao, Changshu Institute of Technology, China</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Jie Mao <email>maojie&#x00040;whsu.edu.cn</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Emotion Science, a section of the journal Frontiers in Psychology</p></fn></author-notes>
<pub-date pub-type="epub">
<day>22</day>
<month>10</month>
<year>2021</year>
</pub-date>
<pub-date pub-type="collection">
<year>2021</year>
</pub-date>
<volume>12</volume>
<elocation-id>762795</elocation-id>
<history>
<date date-type="received">
<day>22</day>
<month>08</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>23</day>
<month>09</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2021 Cai, Gao, Liu and Mao.</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Cai, Gao, Liu and Mao</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract><p>Understanding human emotions and psychology is a critical step toward realizing artificial intelligence, and correct recognition of facial expressions is essential for judging emotions. However, the differences caused by changes in facial expression are very subtle, and different expression features are less distinguishable, making it difficult for computers to recognize human facial emotions accurately. Therefore, this paper proposes a novel multi-layer interactive feature fusion network model with angular distance loss. To begin, a multi-layer and multi-scale module is designed to extract global and local features of facial emotions in order to capture part of the feature relationships between different scales, thereby improving the model&#x00027;s ability to discriminate subtle features of facial emotions. Second, a hierarchical interactive feature fusion module is designed to address the issue of loss of useful feature information caused by layer-by-layer convolution and pooling of convolutional neural networks. In addition, the attention mechanism is also used between convolutional layers at different levels. Improve the neural network&#x00027;s discriminative ability by increasing the saliency of information about different features on the layers and suppressing irrelevant information. Finally, we use the angular distance loss function to improve the proposed model&#x00027;s inter-class feature separation and intra-class feature clustering capabilities, addressing the issues of large intra-class differences and high inter-class similarity in facial emotion recognition. We conducted comparison and ablation experiments on the FER2013 dataset. The results illustrate that the performance of the proposed MIFAD-Net is 1.02&#x02013;4.53% better than the compared methods, and it has strong competitiveness.</p></abstract>
<kwd-group>
<kwd>face emotion</kwd>
<kwd>emotion recognition</kwd>
<kwd>multi-layer interactive</kwd>
<kwd>feature fusion</kwd>
<kwd>deep learning</kwd>
<kwd>neural networks</kwd>
</kwd-group>
<counts>
<fig-count count="10"/>
<table-count count="6"/>
<equation-count count="12"/>
<ref-count count="41"/>
<page-count count="11"/>
<word-count count="6696"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>Introduction</title>
<p>Emotions are extremely important in everyday life. It is often necessary to accompany the correct understanding of other people&#x00027;s emotions in the process of human daily communication and behavior judgment, and facial expressions contain a lot of information about emotions and mental states. Therefore, it is possible to say that recognizing facial expressions (Crivelli et al., <xref ref-type="bibr" rid="B8">2017</xref>; Chengeta and Viriri, <xref ref-type="bibr" rid="B6">2019</xref>; Gonz&#x000E1;lez-Lozoya et al., <xref ref-type="bibr" rid="B11">2020</xref>) is the key to understanding emotions. According to psychologists&#x00027; research, only 7% of information in the process of human communication comes from pure language expression, 38% from sound information such as speech pitch, and 55% from visuals such as facial emotions. The content has been communicated. As a result, accurate recognition of facial expressions is critical for understanding information in human communication.</p>
<p>Facial emotion recognition (Sreedharan et al., <xref ref-type="bibr" rid="B28">2018</xref>; Jain et al., <xref ref-type="bibr" rid="B14">2019</xref>) can be used in a variety of situations. In terms of human-computer interaction, accurate facial expression recognition to determine human emotions can make machines more appropriate, accurate, and effective in interacting with humans, resulting in a more natural interaction. Interaction and exchange with humans. In terms of security scenarios, it is possible to effectively identify suspects with criminal intent in public by accurately identifying facial expressions and subtle expressions. In terms of transportation, it is possible to better judge whether a driver is fatigued by recognizing the facial expressions of drivers of vehicles such as vehicles (Theagarajan et al., <xref ref-type="bibr" rid="B32">2017</xref>; Zepf et al., <xref ref-type="bibr" rid="B38">2020</xref>). Furthermore, facial expression recognition has gotten a lot of attention in the advertising (Hamelin et al., <xref ref-type="bibr" rid="B12">2017</xref>) and marketing, automation, and communications fields.</p>
<p>In recent years, facial emotion recognition based on deep learning technology (Cai and Wei, <xref ref-type="bibr" rid="B5">2020</xref>; Cai et al., <xref ref-type="bibr" rid="B4">2021</xref>; Zhang et al., <xref ref-type="bibr" rid="B39">2021</xref>) has made great progress, but there are still many problems to be solved. For example, the recognition accuracy in real scenes is still not ideal. Among the basic emotion categories of human faces, negative emotions, including angry, disgust, disappointment, etc., have no relatively uniform standard for facial expressions, and feature differences are minimal, which are not conducive to computer feature learning and are often difficult to correctly recognize. Furthermore, because the face area occupies a relatively small area in an image, the data used for facial emotion recognition model training has a small input size. The current convolutional neural network (CNN) model (Bendjoudi et al., <xref ref-type="bibr" rid="B1">2020</xref>; Kollias and Zafeiriou, <xref ref-type="bibr" rid="B17">2020</xref>; Kwon, <xref ref-type="bibr" rid="B19">2021</xref>) necessitates a relatively large image size as input. Excessive use of interpolation and other methods to increase image size results in more calculations. On the contrary, the recognition effect has not improved significantly.</p>
<p>Based on the above observations, in this paper, a multi-layer and multi-scale module is designed to extract the global and local features of facial expressions to capture part of the feature relationships between different scales, thereby enhancing the model&#x00027;s ability to discriminate subtle features of facial expressions. Secondly, in view of the problem of loss of useful feature information due to layer-by-layer convolution and pooling of CNNs, a hierarchical interactive feature fusion module is designed. The attention mechanism (Gao et al., <xref ref-type="bibr" rid="B10">2021</xref>; Liu et al., <xref ref-type="bibr" rid="B22">2021</xref>) is used between convolutional layers at different levels to control the network. Strengthen the saliency information of different characteristics in the Internet and suppress irrelevant information, thereby improving the discriminative ability of the network. Finally, for the problem of large intra-class differences and high inter-class similarity in facial expression recognition, we use the angular distance loss function to improve the capabilities of the proposed algorithm for feature separation between classes and clustering of features within classes.</p>
<p>The main innovations of this paper are as follows:</p>
<list list-type="simple">
<list-item><p>(1) To address the problem of subtle differences in facial emotion causing difficulty in classification, we designed a multi-layer and multi-scale module to extract global and local facial emotion features to capture partial feature relationships between different scales, thereby improving the model&#x00027;s ability to discriminate subtle facial emotion features.</p></list-item>
<list-item><p>(2) To address the issue of loss of useful feature information caused by layer-by-layer convolution and pooling of convolutional neural networks, we created a hierarchical interactive feature fusion module that controls the network using the attention mechanism between convolutional layers at different levels. The importance of various characteristics is increased, while irrelevant information is reduced.</p></list-item>
<list-item><p>(3) We use the angular distance loss function to improve the capabilities of the proposed algorithm for feature separation between classes and clustering of features within classes, with the goal of addressing the problem of large intra-class differences and high inter-class similarity in face emotion recognition.</p></list-item>
</list>
<p>The remainder of this article is organized as follows. We introduce relevant work in section Related work, describe the proposed algorithm in section Methodology, and present the experimental results in section Experiments and Results. This paper&#x00027;s research conclusions are presented in section Conclusion.</p>
</sec>
<sec id="s2">
<title>Related work</title>
<sec>
<title>Emotion Recognition Based on Traditional Machine Learning</title>
<p>The emotion recognition method based on traditional machine learning (Bota et al., <xref ref-type="bibr" rid="B2">2019</xref>; Kerkeni et al., <xref ref-type="bibr" rid="B15">2019</xref>; Dom&#x000ED;nguez-Jim&#x000E9;nez et al., <xref ref-type="bibr" rid="B9">2020</xref>) is mainly to manually extract the emotion image features and then use the appropriate classification algorithm to classify the emotion. The specific method is to manually select some appropriate feature extraction operators to extract facial features, and then do appropriate dimension reduction processing on the extracted facial features, and finally select a classifier to classify the facial features after dimension reduction. Kumar et al. (<xref ref-type="bibr" rid="B18">2016</xref>) believed that features in different regions of the face contribute to expression recognition to different extent, and important locations have important feature information, such as mouth and eyes. Therefore, they proposed a weighted projection LBP feature extraction algorithm for different information regions, and improved the accuracy of expression recognition by cascading the weighted features of different regions. As a linear filter, Gabor is robust to light changes, and it can also change the frequency and direction to analyze texture features. Zhang et al. (<xref ref-type="bibr" rid="B40">2014</xref>) proposed an emotion recognition method in the case of occlusion. This method adopted Monte Carlo algorithm to extract features based on Gabor template in the image, and the features obtained were robust for occlusion. Harit et al. (<xref ref-type="bibr" rid="B13">2018</xref>) proposed an automatic expression recognition algorithm that constructed an expression feature space using multiple Gabor filters at reference points before sending it to a neural network for classification. Wang et al. (<xref ref-type="bibr" rid="B35">2017</xref>) developed a multi-scale geometric feature extraction method, which mapped the original expression information to geometric feature functions, and then used the feature functions for further analysis. Tarannum et al. (<xref ref-type="bibr" rid="B31">2016</xref>) took the Euclidean distance between various facial regions as a feature and then used Canberra distance to classify the features.</p>
<p>After the expression feature extraction is completed, the classifier can specifically classify the expression feature into a certain expression category. Common facial expression classifiers include SVM algorithm (Xu et al., <xref ref-type="bibr" rid="B36">2012</xref>), KNN algorithm and so on. Liew and Yairi (<xref ref-type="bibr" rid="B21">2015</xref>) proposed to use SVM as a classifier to classify the Hog expression features extracted in the early stage. This method has achieved good results on the JAFFE data set. Ouellet (<xref ref-type="bibr" rid="B24">2014</xref>) replaced the sofhnax classification layer of the Alexnet network with SVM Multi-classifiers, and achieved better recognition results on the CK&#x0002B; expression library. Rieger et al. (<xref ref-type="bibr" rid="B26">2014</xref>) used a pattern recognition paradigm with spectral feature extraction and a set of KNN classifiers to investigate speech-based emotion recognition, and found that using two KNNs yielded the best results.</p>
</sec>
<sec>
<title>Emotion Recognition Based on Deep Learning</title>
<p>CNNs, in contrast to traditional machine learning methods, can automatically extract deep-level features of facial expressions by constructing multiple convolutional layers. On the one hand, it avoids errors caused by artificial feature extraction, on the other hand, it has strong robustness and generalization ability, so it has gradually become the mainstream method. Researchers are beginning to study applying deep learning to facial expression recognition tasks. Mollahosseini et al. (<xref ref-type="bibr" rid="B23">2016</xref>) proposed a 7-layer convolutional neural network that combined AlexNet and GoogleNet models and then verified them using seven public expression data sets, which was faster and more accurate than a traditional convolutional neural network. Zhang et al. (<xref ref-type="bibr" rid="B41">2019</xref>) used a stacked hybrid autoencoder to recognize expression. Three encoders were used in the network structure: a denoising autoencoder, a sparse autoencoder, and an autoencoder. The feature extraction was done with the denoising encoder, and the sparse autoencoder was utilized for cascading to extract more abstract sparse features. Tang (<xref ref-type="bibr" rid="B30">2013</xref>) combined CNNs and svm, and used the hinge loss instead of the common cross-entropy loss function in convolutional neural networks. On the FER2013 data set, the detection rate was 71.2 percent, and the team won the Kaggle facial expression recognition competition in 2013. After analyzing the network structure of expression recognition based on deep CNNs, Pramerdorfer and Kampel (<xref ref-type="bibr" rid="B25">2016</xref>) improved the classic ResNet and input a single face image to extract facial expression features, achieving an average recognition rate of 72.4% on the FER2013 data set. Lee et al. (<xref ref-type="bibr" rid="B20">2019</xref>) proposed a deep network for context-aware emotion recognition, which not only uses human facial expressions, but also uses context information in a joint and enhanced manner, which effectively improves the performance of emotion recognition.</p>
<p>Although deep learning technology has achieved excellent results in face emotion recognition tasks, the differences caused by facial expression changes are very subtle, and different facial expression features are not distinguishable, resulting in low face emotion recognition accuracy.</p>
</sec>
</sec>
<sec sec-type="methods" id="s3">
<title>Methodology</title>
<p>The MIFAD-Net model of this article is shown in <xref ref-type="fig" rid="F1">Figure 1</xref>. The three columns show that networks with different thickness scales using convolutional scale kernels of 7, 5, and 3 can extract more refined facial emotion features. Each column of the network has six convolutional layers and five BN layers. The three-column network has a common facial emotion image input. After that, the feature maps of the last three convolutional layers of the same depth position of the three-column network are interacted through the feature splicing strategy to integrate different cross-layer features of the same network and different networks to capture the deep connection between different levels to facilitate subsequent facial emotions Feature classification. In addition, the three-column network also interacts with the collection of features through the addition strategy, and uses the attention mechanism to focus on the effective features. Finally, we are employing the angular distance loss function to improve the model&#x00027;s capacity to segregate features between classes and cluster features within classes, which is a challenge with big intra-class variances in facial expression recognition and high similarity between classes. Then, using Softmax, establish the face expression category. After that, we&#x00027;ll go over the proposed model in more detail.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Schematic diagram of the proposed MIFAD-Net algorithm.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpsyg-12-762795-g0001.tif"/>
</fig>
<sec>
<title>Multi-Scale and Multi-Layer Interactive Feature Fusion</title>
<p>This paper proposes a multi-scale and multi-layer interactive feature fusion module, as shown in <xref ref-type="fig" rid="F2">Figure 2</xref>, to make better use of the different scale features of facial emotion images. The module first merges the feature maps 3<sup>&#x0002A;</sup>3, 5<sup>&#x0002A;</sup>5, and 7<sup>&#x0002A;</sup>7 of the three coarse and fine scale networks through 3 &#x000D7; 3. One branch is activated by Sigmoid to generate feature weights and then multiplied by the three feature map elements to obtain the re-calibrated features. Figure, and finally get the final output through the feature splicing strategy. The module can self-update learning according to back propagation, and automatically select the multi-scale features that each branch needs to be fused.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Schematic diagram of multi-scale and multi-layer interactive feature fusion.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpsyg-12-762795-g0002.tif"/>
</fig>
<sec>
<title>Multi-Scale Convolution</title>
<p>The use of a multi-scale convolution kernel has two major advantages, as discussed in this article. First and foremost, the multi-scale convolution kernel has the advantage of allowing different-sized convolution kernels to extract multiple scales of facial emotion picture data, allowing the filter to extract and learn richer high-dimensional features. Second, the convolutional neural network trains the model by learning the filter&#x00027;s parameters (weight and offset), i.e., continuously learning the filter&#x00027;s parameters to acquire the ideal value closest to the label. This article employs a multi-scale convolution kernel with the goal of allowing a single convolution layer to have several filters, so diversifying the weight and bias learning, and thereby extracting and learning the semantic aspects of facial emotion photos fully and efficiently. A schematic diagram of multi-scale convolution is shown in <xref ref-type="fig" rid="F3">Figure 3</xref>.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Schematic diagram of multi-scale convolution.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpsyg-12-762795-g0003.tif"/>
</fig>
<p>To achieve the best results, multi-scale inference approaches are commonly used in computer vision models. Fine details are better predicted at larger sizes, larger objects are better predicted at smaller sizes, and the network&#x00027;s receiving field can interpret the scene better at smaller sizes. The 3<sup>&#x0002A;</sup>3, 5<sup>&#x0002A;</sup>5, and 7<sup>&#x0002A;</sup>7 scale convolution kernels were employed in this article&#x00027;s multi-scale convolution. The following is the calculating formula:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mi>Y</mml:mi></mml:mrow><mml:mrow><mml:mi>H</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>&#x003D5;</mml:mi><mml:mrow><mml:mo stretchy="true">{</mml:mo><mml:mrow><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msubsup><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>H</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x0002A;</mml:mo><mml:msubsup><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>H</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">j</mml:mtext></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="true">}</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>H</italic> &#x000D7; <italic>W</italic> represents the size of the convolution kernel.</p>
</sec>
<sec>
<title>Attention Mechanism</title>
<p>Different areas in facial emotion images have different weights for different tasks. The higher the relevance to the task, the more important the field is. In this article, the attention module we designed is composed of cascaded channel attention and spatial attention. The schematic diagram of the channel attention mechanism is shown in <xref ref-type="fig" rid="F4">Figure 4</xref>.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Schematic diagram of Channel attention module.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpsyg-12-762795-g0004.tif"/>
</fig>
<p>The given input feature <inline-formula><mml:math id="M2"><mml:msub><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is fed into the proposed channel attention module. First, global average pooling (GAP) and max-pooling are used to compress the feature map along the spatial axis in parallel to generate two <italic>C</italic> &#x000D7; 1 &#x000D7; 1 dimensional feature vectors <inline-formula><mml:math id="M3"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> And <inline-formula><mml:math id="M4"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, and then perform element-wise summation to obtain the aggregate characteristic <inline-formula><mml:math id="M5"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> of all features. After that, go through a convolutional layer with a kernel size of 1 &#x000D7; 1, and then execute PReLU and BatchNorm to get the middle feature map <inline-formula><mml:math id="M6"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, then:</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M7"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>C</mml:mi><mml:mtext class="textrm" mathvariant="normal">onv</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x02295;</mml:mo><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mi>&#x003D5;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x000D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x02295;</mml:mo><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x02295; represents the element summation, and &#x003D5; represents the convolution operation. Then, the <inline-formula><mml:math id="M8"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is deformed and transposed to obtain two feature maps with dimensions <italic>C</italic>&#x000D7;1 and 1&#x000D7;<italic>C</italic>, and then matrix multiplication and softmax operations are performed to obtain the channel attention matrix <italic>A</italic><sub><italic>c</italic></sub>. The calculation equation is as follows:</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M9"><mml:mrow><mml:mtable columnalign='left'><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:msub><mml:mi>A</mml:mi><mml:mi>c</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>s</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mo>&#x000A0;</mml:mo><mml:mtext class="textrm" mathvariant="normal">max</mml:mtext><mml:mo stretchy='true'>(</mml:mo><mml:msubsup><mml:mi>F</mml:mi><mml:mi>p</mml:mi><mml:mi>c</mml:mi></mml:msubsup><mml:mo>&#x02297;</mml:mo><mml:msup><mml:mrow><mml:mo stretchy='true'>(</mml:mo><mml:msubsup><mml:mi>F</mml:mi><mml:mi>p</mml:mi><mml:mi>c</mml:mi></mml:msubsup><mml:mo stretchy='true'>)</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo stretchy='true'>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:math></disp-formula>
<p>where &#x02297; represents matrix multiplication, and the following equation can be obtained:</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M10"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mtext class="textrm" mathvariant="normal">c</mml:mtext></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">i</mml:mtext><mml:mo>,</mml:mo><mml:mtext class="textrm" mathvariant="normal">j</mml:mtext></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mtext class="textrm" mathvariant="normal">exp</mml:mtext><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">i</mml:mtext><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msubsup><mml:mtext class="textrm" mathvariant="normal">exp</mml:mtext><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>A</italic><sub><italic>c</italic><sub><italic>i,j</italic></sub></sub> represents the influence of the <italic>i</italic>-th channel on the <italic>j</italic>-th channel. Finally, the input feature <italic>F</italic><sub><italic>in</italic></sub> is multiplied by the channel attention matrix <italic>A</italic><sub><italic>c</italic></sub>, and then the refined feature <inline-formula><mml:math id="M11"><mml:msub><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> of the channel is obtained through learning.</p>
<disp-formula id="E5"><label>(5)</label><mml:math id="M12"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02295;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B1;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02297;</mml:mo><mml:msub><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003B1; is a learnable parameter, and generally the initial value is set to 0 to reduce the difficulty of the convergence process of the first few training cycles. In this way, the channel attention matrix <italic>A</italic><sub><italic>c</italic></sub> can be regarded as a kernel selector to select the filter used to describe the emotional characteristics of the face.</p>
<p>In addition to channel attention, we also cascade a spatial attention module to learn the relationship between the spatial structure of the intermediate feature maps. The spatial attention module, which can be used in conjunction with the channel attention module, generates a spatial attention matrix to focus attention on the part that best represents facial feature information. Apply average pooling and maximum pooling along the channel axis, cascade them to get an effective feature descriptor, and then use matrix calculation and softmax layer to perform convolution operation to get the final Note the matrix of space, following the same strategy as the channel attention module. The spatial attention module is depicted schematically in <xref ref-type="fig" rid="F5">Figure 5</xref>.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Schematic diagram of spatial attention module.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpsyg-12-762795-g0005.tif"/>
</fig>
<p>Given a channel refinement feature <inline-formula><mml:math id="M13"><mml:msub><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, first pass the spatial attention module, use GAP and max-pooling to compress the feature map along the channel axis in parallel, and then obtain two feature vectors <inline-formula><mml:math id="M14"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula><mml:math id="M15"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> with a dimension of 1 &#x000D7; <italic>H</italic> &#x000D7; <italic>W</italic>. Then the channel cascade is performed to merge the aggregation characteristics of the <inline-formula><mml:math id="M16"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>. After the channel cascade, the convolutional layer with the kernel size of 3&#x000D7;3 is performed first, and then the PReLU and BatchNorm operations are performed to obtain the intermediate feature map <inline-formula><mml:math id="M17"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>. During the convolution process, the step size is set to 1, and the filling value is also 1. In order to ensure that the size of the feature map remains unchanged, then:</p>
<disp-formula id="E6"><label>(6)</label><mml:math id="M18"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>v</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>;</mml:mo><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mi>&#x003C6;</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn><mml:mo>&#x000D7;</mml:mo><mml:mn>3</mml:mn></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>;</mml:mo><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003C6; represents the convolution operation. Transform and transpose the middle <inline-formula><mml:math id="M19"><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> to obtain two feature maps of <italic>HW</italic> &#x000D7; 1 and 1&#x000D7;<italic>HW</italic>, and then perform matrix multiplication and softmax operations to obtain the spatial attention matrix <italic>A</italic><sub><italic>s</italic></sub>, and perform the softmax operation on each row of the spatial matrix, then:</p>
<disp-formula id="E7"><label>(7)</label><mml:math id="M20"><mml:mrow><mml:mtable columnalign='left'><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:msub><mml:mi>A</mml:mi><mml:mi>c</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>s</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mo>&#x000A0;</mml:mo><mml:mtext class="textrm" mathvariant="normal">max</mml:mtext><mml:mo stretchy='true'>(</mml:mo><mml:msubsup><mml:mi>F</mml:mi><mml:mi>p</mml:mi><mml:mi>s</mml:mi></mml:msubsup><mml:mo>&#x02297;</mml:mo><mml:msup><mml:mrow><mml:mo stretchy='true'>(</mml:mo><mml:msubsup><mml:mi>F</mml:mi><mml:mi>p</mml:mi><mml:mi>s</mml:mi></mml:msubsup><mml:mo stretchy='true'>)</mml:mo></mml:mrow><mml:mi>T</mml:mi></mml:msup><mml:mo stretchy='true'>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:math></disp-formula>
<p>where &#x02297; represents matrix multiplication, and the following equation can be obtained:</p>
<disp-formula id="E8"><label>(8)</label><mml:math id="M21"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mtext class="textrm" mathvariant="normal">exp</mml:mtext><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">i</mml:mtext><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>H</mml:mi><mml:mi>W</mml:mi></mml:mrow></mml:msubsup><mml:mtext class="textrm" mathvariant="normal">exp</mml:mtext><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>A</italic><sub><italic>s</italic><sub><italic>i,j</italic></sub></sub> represents the influence of the <italic>i</italic>-th channel on the <italic>j</italic>-th channel. Finally, the channel refined <italic>F</italic><sub><italic>C</italic></sub> is multiplied by the channel attention matrix <italic>A</italic><sub><italic>s</italic></sub>, and then the refined feature <inline-formula><mml:math id="M22"><mml:msub><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>H</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is obtained through learning.</p>
<disp-formula id="E9"><label>(9)</label><mml:math id="M23"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02295;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B2;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02297;</mml:mo><mml:msub><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003B2; is a learnable parameter, and generally the initial value is set to 0 to reduce the difficulty of the convergence process of the first few training cycles. The spatial attention matrix <italic>A</italic><sub><italic>s</italic></sub> can be regarded as a position mask to focus on describing the most important part of the facial feature map. Therefore, the final structure of the attention module in the proposed algorithm is shown in <xref ref-type="fig" rid="F6">Figure 6</xref>.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Schematic diagram of attention module.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpsyg-12-762795-g0006.tif"/>
</fig>
</sec>
<sec>
<title>Interaction Mechanism</title>
<p><xref ref-type="fig" rid="F7">Figure 7</xref> depicts a schematic representation of multi-layer feature interaction and flow. The orange features are retrieved by the 7<sup>&#x0002A;</sup>7 convolution kernel, the blue features are extracted by the 5<sup>&#x0002A;</sup>5 convolution kernel, and the green features are extracted by the 3<sup>&#x0002A;</sup>3 convolution kernel. The multi-layer feature interaction module can capture the feature information between layers of different scales, and through the spatial attention mechanism, it can also extract the feature relationships between layers of different scales.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Schematic diagram of multi-layer feature interaction and feature flow.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpsyg-12-762795-g0007.tif"/>
</fig>
</sec>
</sec>
<sec>
<title>Angular Distance Loss</title>
<p>In order to further deal with the problems of large intra-class differences and high similarity between classes in facial emotion image recognition, we use the angular distance loss function to improve the capabilities of the proposed algorithm for feature separation between classes and clustering of features within classes. The calculation equation of angular distance loss is as follows:</p>
<disp-formula id="E10"><label>(10)</label><mml:math id="M24"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">ads</mml:mtext></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mtext class="textrm" mathvariant="normal">-</mml:mtext><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">m</mml:mtext></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">i</mml:mtext><mml:mo>=</mml:mo><mml:mtext class="textrm" mathvariant="normal">1</mml:mtext></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">m</mml:mtext></mml:mrow></mml:munderover></mml:mstyle><mml:mtext class="textrm" mathvariant="normal">log</mml:mtext><mml:mfrac><mml:mrow><mml:mtext class="textrm" mathvariant="normal">exp</mml:mtext><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mtext class="textrm" mathvariant="normal">s</mml:mtext><mml:mo>&#x000D7;</mml:mo><mml:mtext class="textrm" mathvariant="normal">cos</mml:mtext><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mtext class="textrm" mathvariant="normal">y</mml:mtext></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">i</mml:mtext></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mtext class="textrm" mathvariant="normal">m</mml:mtext></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow><mml:mtext>&#x000A0;</mml:mtext></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mtext>&#x000A0;</mml:mtext></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">exp</mml:mtext><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mtext class="textrm" mathvariant="normal">s</mml:mtext><mml:mo>&#x000D7;</mml:mo><mml:mtext class="textrm" mathvariant="normal">cos</mml:mtext><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mtext class="textrm" mathvariant="normal">y</mml:mtext></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">i</mml:mtext></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mtext class="textrm" mathvariant="normal">m</mml:mtext></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">j</mml:mtext><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">n</mml:mtext></mml:mrow></mml:munderover></mml:mstyle><mml:mtext class="textrm" mathvariant="normal">exp</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext class="textrm" mathvariant="normal">s</mml:mtext><mml:mo>&#x000D7;</mml:mo><mml:mtext class="textrm" mathvariant="normal">cos</mml:mtext><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mtext class="textrm" mathvariant="normal">j</mml:mtext></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>s</italic> is the scaling factor, cos(&#x003B8;<sub>y<sub>i</sub></sub> &#x0002B; m) is the angular distance, and m determines the size of the distance. The decision boundary of the softmax and the angular distance loss function in the case of two classifications is shown in <xref ref-type="fig" rid="F8">Figure 8</xref>. The blue dashed line represents the classification decision boundary. Softmax classifies by angle, and the angle distance loss directly controls the distance of the classification decision boundary in the angle space through the decision margin m, thereby increasing the distance between classes, which is conducive to classification decision.</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>Schematic diagram of angular distance loss.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpsyg-12-762795-g0008.tif"/>
</fig>
</sec>
</sec>
<sec id="s4">
<title>Experiments and Results</title>
<sec>
<title>Experimental Setup</title>
<p>All experiments in this section are run on the same server to ensure a fair evaluation of the proposed algorithm. The server&#x00027;s specific configuration is as follows: the operating system is Windows 10, the GPU is NVIDIA GTX1080 (11G), the memory is 16G, and the CPU is AMD Ryzen 7 1700X; the deep learning development framework is Keras 2.1.5, install CUDA9&#x0002B;cudnn7, and the programming language is Python 3.6.5, Adam is the optimizer, and batch size = 16, learning rate = 0.001, Epochs = 300.</p>
</sec>
<sec>
<title>Experimental Data Set</title>
<p>The FER2013 data set is the official data set for Kaggle&#x00027;s facial expression recognition competition in 2013. Because the majority of the images are downloaded from web crawlers, they will be compared because they comprise images of various ages, angles, and partially obscured images, among other things. There will be some inaccuracies, but it will be near to natural facial emotions. There are 35,887 photos in all in FER2013. The training set, public test set, and private test set are the three elements of the data set. The training set contains 28,709 photographs. Both the public and private test sets contain 3,589 images. <xref ref-type="table" rid="T1">Table 1</xref> shows the data distribution in this data set, with the tags corresponding to the seven expressions numbered 0&#x02013;6. Because the majority of the FER2013 data set comes from web crawlers, the background is more complex, making identification difficult.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Data distribution of FER2013 dataset.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th/>
<th valign="top" align="center"><bold>Angry</bold></th>
<th valign="top" align="center"><bold>Fear</bold></th>
<th valign="top" align="center"><bold>Disgust</bold></th>
<th valign="top" align="center"><bold>Happy</bold></th>
<th valign="top" align="center"><bold>Sad</bold></th>
<th valign="top" align="center"><bold>Surprise</bold></th>
<th valign="top" align="center"><bold>Neutral</bold></th>
<th valign="top" align="center"><bold>Total</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Label</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">4</td>
<td valign="top" align="center">5</td>
<td valign="top" align="center">6</td>
<td valign="top" align="center">&#x02013;</td>
</tr>
<tr>
<td valign="top" align="left">Train set</td>
<td valign="top" align="center">3,995</td>
<td valign="top" align="center">4,097</td>
<td valign="top" align="center">436</td>
<td valign="top" align="center">7,215</td>
<td valign="top" align="center">4,830</td>
<td valign="top" align="center">3,171</td>
<td valign="top" align="center">4,965</td>
<td valign="top" align="center">28,709</td>
</tr>
<tr>
<td valign="top" align="left">Validation set</td>
<td valign="top" align="center">467</td>
<td valign="top" align="center">496</td>
<td valign="top" align="center">56</td>
<td valign="top" align="center">895</td>
<td valign="top" align="center">653</td>
<td valign="top" align="center">415</td>
<td valign="top" align="center">607</td>
<td valign="top" align="center">3,589</td>
</tr>
<tr>
<td valign="top" align="left">Test set</td>
<td valign="top" align="center">491</td>
<td valign="top" align="center">528</td>
<td valign="top" align="center">55</td>
<td valign="top" align="center">879</td>
<td valign="top" align="center">594</td>
<td valign="top" align="center">416</td>
<td valign="top" align="center">626</td>
<td valign="top" align="center">3,589</td>
</tr>
<tr>
<td valign="top" align="left">Total</td>
<td valign="top" align="center">4,953</td>
<td valign="top" align="center">5,121</td>
<td valign="top" align="center">547</td>
<td valign="top" align="center">8,989</td>
<td valign="top" align="center">6,077</td>
<td valign="top" align="center">4,002</td>
<td valign="top" align="center">6,198</td>
<td valign="top" align="center">35,887</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>Evaluation Method</title>
<p>Facial emotion recognition research mainly uses accuracy and confusion matrix as the evaluation indicators of the model. The accuracy rate represents the ratio of the number of correctly identified samples to the total number of samples, which can reveal the overall recognition ability of the model. The calculation equation is as follows:</p>
<disp-formula id="E11"><label>(11)</label><mml:math id="M26"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>A</mml:mi><mml:mi>c</mml:mi><mml:mi>c</mml:mi><mml:mi>u</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>c</mml:mi><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mi>T</mml:mi><mml:msup><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>TP</italic><sup><italic>i</italic></sup> represents the number of correctly classified samples in the <italic>i</italic>-th category, <italic>C</italic> represents the number of categories, and <italic>N</italic> represents the total number of samples.</p>
<p>The confusion matrix is a square matrix of size (<italic>Z, Z</italic>), in which the true label provided by the element <italic>CP</italic><sub><italic>ij</italic></sub> in the <italic>i</italic>-th row and <italic>j</italic>-column is the probability of the <italic>i</italic>-th category and the predicted label is the <italic>j</italic>-th category. The calculation equation is as follows:</p>
<disp-formula id="E12"><label>(12)</label><mml:math id="M27"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>C</mml:mi><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>n</italic><sup><italic>ij</italic></sup> represents the true label is the <italic>i</italic>-th class and the predicted label is the number of samples in the <italic>j</italic>-th class, and <italic>n</italic><sup><italic>i</italic></sup> represents the total number of samples in the <italic>i</italic>-th class. By analyzing the confusion matrix, the accuracy performance of the model in each category can be measured.</p>
</sec>
<sec>
<title>Experimental Results</title>
<p>We compared several well-known methods on the FER2013 data set to better evaluate the effectiveness of the proposed algorithm, and the results are shown in <xref ref-type="table" rid="T2">Table 2</xref>. Turan et al. (<xref ref-type="bibr" rid="B33">2018</xref>) proposed Soft Locality Preserving Map, a new and more effective manifold learning method that aims to control the diffusion level of different classes, effectively reducing the dimensionality of feature vectors and enhancing the extracted features. The improvement effect is not ideal for facial expression recognition distinguishing ability. Yang et al. (<xref ref-type="bibr" rid="B37">2018</xref>) proposed a facial expression recognition method based on residual expressions. Residual error learning is used to generate the residuals of the middle layer of the model. The residuals contain the expression components of any generated model of the input expression image, but the feature connection between the levels is not captured, and the classification effect is not good. The expression recognition rate of Shao and Qian (<xref ref-type="bibr" rid="B27">2019</xref>) in the two data sets is not high, and there is a problem that the recognition rate is low due to insufficient expression feature extraction. In addition, we also compared with InceptionV4, DNNRL, Multi-scale CNN, and Hybrid CNN-SIFT aggregator.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Comparison of recognition rates of different algorithms on FER2013 dataset.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Methods</bold></th>
<th valign="top" align="center"><bold>Acc</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">InceptionV4 (Szegedy et al., <xref ref-type="bibr" rid="B29">2017</xref>)</td>
<td valign="top" align="center">0.7080</td>
</tr>
<tr>
<td valign="top" align="left">DNNRL (Kim et al., <xref ref-type="bibr" rid="B16">2016</xref>)</td>
<td valign="top" align="center">0.7082</td>
</tr>
<tr>
<td valign="top" align="left">Multi-scale CNN (Wang and Yuan, <xref ref-type="bibr" rid="B34">2016</xref>)</td>
<td valign="top" align="center">0.7282</td>
</tr>
<tr>
<td valign="top" align="left">SLPM (Turan et al., <xref ref-type="bibr" rid="B33">2018</xref>)</td>
<td valign="top" align="center">0.7091</td>
</tr>
<tr>
<td valign="top" align="left">DeRL (Yang et al., <xref ref-type="bibr" rid="B37">2018</xref>)</td>
<td valign="top" align="center">0.7264</td>
</tr>
<tr>
<td valign="top" align="left">Shao (Shao and Qian, <xref ref-type="bibr" rid="B27">2019</xref>)</td>
<td valign="top" align="center">0.7114</td>
</tr>
<tr>
<td valign="top" align="left">Hybrid CNN-SIFT aggregator (Connie et al., <xref ref-type="bibr" rid="B7">2017</xref>)</td>
<td valign="top" align="center">0.7340</td>
</tr>
<tr>
<td valign="top" align="left"><bold>Ours</bold></td>
<td valign="top" align="center"><bold>0.7416</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>It can be seen from <xref ref-type="table" rid="T2">Table 2</xref> that the proposed algorithm has achieved the best classification effect on the FER2013 data set in a complex environment. This is because the method in this paper makes full use of the multi-scale features of the multi-layer interactive feature fusion network, and integrates the cross-layer deep feature representation, captures the subtle changes in the deep level of expression, and restores the expression image gradually through multi-layer feature fusion. The useful feature information lost in the layer transfer process solves the problem of interaction between model layers and multi-layer feature fusion, and improves the network&#x00027;s ability to distinguish facial expressions caused by subtle changes in the corners of the mouth, eyebrows, and eyes. In addition, we propose Using the angular distance loss function effectively alleviates the problems of large intra-class differences and high inter-class similarity in facial expression recognition, and is more suitable for subtle facial expression classification. We also show the confusion matrix of the proposed algorithm on the test set in <xref ref-type="fig" rid="F9">Figure 9</xref>. In addition, as shown in <xref ref-type="fig" rid="F10">Figure 10</xref>, although the training data is unbalanced, the proposed algorithm overcomes this problem and achieves a competitive classification performance.</p>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption><p>Confusion matrix of MIFAD-Net on FER2013 testing set.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpsyg-12-762795-g0009.tif"/>
</fig>
<fig id="F10" position="float">
<label>Figure 10</label>
<caption><p>FER2013 training data distribution.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fpsyg-12-762795-g0010.tif"/>
</fig>
</sec>
<sec>
<title>Ablation Experiment for Different Loss Functions</title>
<p>To further verify the effectiveness of the angular distance loss function in the proposed algorithm, ablation experiments are set up in this section. We introduce Island loss and center loss for ablation studies. In addition, for fair comparison, all experiments are performed in the same environment. The results of the ablation experiment are shown in <xref ref-type="table" rid="T3">Table 3</xref>.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Results of ablation experiments for different loss functions on FER2013 dataset.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Methods</bold></th>
<th valign="top" align="center"><bold>Acc</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Softmax</td>
<td valign="top" align="center">0.7211</td>
</tr>
<tr>
<td valign="top" align="left">Island&#x0002B;Softmax</td>
<td valign="top" align="center">0.7385</td>
</tr>
<tr>
<td valign="top" align="left">Center loss&#x0002B;Softmax</td>
<td valign="top" align="center">0.7294</td>
</tr>
<tr>
<td valign="top" align="left"><bold>Ours</bold></td>
<td valign="top" align="center"><bold>0.7416</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>It can be seen from <xref ref-type="table" rid="T3">Table 3</xref> that the best classification performance is obtained by using the angular distance loss. In addition, we find that the Island loss is better than the center loss, and the softmax performance alone is the worst. Because Softmax predicts the probability of each category, it does not optimize the distance between the classes and the class, which leads to the lack of distinction between features. In order to reduce the difference in features within the class, the authors (Cai et al., <xref ref-type="bibr" rid="B3">2018</xref>) proposed the optimization and improvement of Center and Island Loss. Island Loss increases the constraints of facial expression features to make the distance between classes larger, thereby improving the classification performance.</p>
</sec>
<sec>
<title>Ablation Experiment for Multi-Layer and Multi-Scale</title>
<p>A parallel three-branch network is used in the proposed algorithm. An ablation experiment is set up in this section to further prove its effectiveness. For comparison, we add two-branch and four-branch networks. The ablation experiment&#x00027;s results are shown in <xref ref-type="table" rid="T4">Table 4</xref>.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Results of ablation experiments for multi-layer and multi-scale on FER2013 dataset.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Methods</bold></th>
<th valign="top" align="center"><bold>Acc</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Two-branch</td>
<td valign="top" align="center">0.7105</td>
</tr>
<tr>
<td valign="top" align="left">Four-branch</td>
<td valign="top" align="center">0.7391</td>
</tr>
<tr>
<td valign="top" align="left"><bold>Ours</bold></td>
<td valign="top" align="center"><bold>0.7416</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="table" rid="T4">Table 4</xref> shows that using a two-branch network reduces the model&#x00027;s classification performance significantly, whereas using a four-branch network does not improve classification performance. As a result, the proposed algorithm is proven to be effective.</p>
</sec>
<sec>
<title>Ablation Experiment for Attention Mechanism</title>
<p>This section sets up an ablation experiment to test the effect of the proposed algorithm&#x00027;s attention mechanism on classification performance in order to verify its effectiveness. The term &#x0201C;No-attention&#x0201D; refers to the lack of use of the attention mechanism. <xref ref-type="table" rid="T5">Table 5</xref> shows the results of the ablation experiment.</p>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>Results of ablation experiments for attention mechanism on FER2013 dataset.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Methods</bold></th>
<th valign="top" align="center"><bold>Acc</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">No-attention</td>
<td valign="top" align="center">0.7262</td>
</tr>
<tr>
<td valign="top" align="left"><bold>Ours</bold></td>
<td valign="top" align="center"><bold>0.7416</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>It can be seen from <xref ref-type="table" rid="T5">Table 5</xref> that if the attention mechanism is not used, the classification performance of the model will be reduced by 1.97%. Because facial expressions consist of muscle movements in specific parts of the face. The features produced by these local areas contain the information that best describes expressions. Therefore, using the attention mechanism to quantify the importance of each spatial position in the feature map and focusing on the areas with rich emotional information is beneficial to the recognition task.</p>
</sec>
<sec>
<title>Ablation Experiment for Feature Fusion Strategy</title>
<p>To further verify the influence of the feature fusion strategy on the experimental results, an ablation experiment was carried out in this section. &#x0201C;Add&#x0201D; stands for addition strategy, &#x0201C;C&#x0201D; stands for concat strategy, and &#x0201C;Mul&#x0201D; stands for multiplication strategy. The results of the ablation experiment are shown in <xref ref-type="table" rid="T6">Table 6</xref>.</p>
<table-wrap position="float" id="T6">
<label>Table 6</label>
<caption><p>Results of ablation experiments for feature fusion strategy on FER2013 dataset.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Methods</bold></th>
<th valign="top" align="center"><bold>Acc</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Add</td>
<td valign="top" align="center">0.7325</td>
</tr>
<tr>
<td valign="top" align="left">Mul</td>
<td valign="top" align="center">0.7298</td>
</tr>
<tr>
<td valign="top" align="left"><bold>Ours (C)</bold></td>
<td valign="top" align="center"><bold>0.7416</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>It can be clearly seen from <xref ref-type="table" rid="T6">Table 6</xref> that the Concat strategy used by the proposed algorithm achieves the best results. Secondly, the addition and multiplication is better than the multiplication strategy, which proves that the extraction of multi-scale features effectively improves the classification performance of the proposed algorithm, and also further prove the superiority of the proposed algorithm.</p>
</sec>
</sec>
<sec sec-type="conclusions" id="s5">
<title>Conclusion</title>
<p>In this paper, we propose a novel multi-layer interactive feature fusion network model with angular distance loss. First, a multi-layer and multi-scale module is designed to extract the global and local features of facial expressions to capture part of the feature relationships between different scales, thereby enhancing the model&#x00027;s ability to discriminate subtle features of facial expressions. Secondly, in view of the problem of loss of useful feature information due to layer-by-layer convolution and pooling of convolutional neural networks, a hierarchical interactive feature fusion module is designed. The attention mechanism is used between convolutional layers at different levels to control the network. Strengthen the saliency information of different characteristics in the Internet and suppress irrelevant information, thereby improving the discriminative ability of the network. Finally, for the problem of large intra-class differences and high similarity between classes in facial expression recognition, we use the angular distance loss function to improve the capabilities of the proposed algorithm for feature separation between classes and clustering of features within classes. We conducted comparison and ablation experiments on the FER2013 data set. The results illustrate that the proposed MIFAD-Net outperforms a number of well-known methods and is highly competitive.</p>
</sec>
<sec sec-type="data-availability" id="s6">
<title>Data Availability Statement</title>
<p>Publicly available datasets were analyzed in this study. This data can be found here: <ext-link ext-link-type="uri" xlink:href="https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge/data/">https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge/data/</ext-link>.</p>
</sec>
<sec id="s7">
<title>Author Contributions</title>
<p>WC: conceptualization, methodology, software, and writing. JM: investigation. MG: data curation, software, and validation. RL: data curation and investigation. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec sec-type="funding-information" id="s8">
<title>Funding</title>
<p>This research was funded by the Scientific Research Program of Education Department of Hubei Province, China (D20184101), Higher Education Reform Project of Hubei Province, China (201707), East Lake Scholar of Wuhan Sport University Fund, China and Hubei Provincial University Specialty subject group construction Special fund, China.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x00027;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bendjoudi</surname> <given-names>I.</given-names></name> <name><surname>Vanderhaegen</surname> <given-names>F.</given-names></name> <name><surname>Hamad</surname> <given-names>D.</given-names></name> <name><surname>Dornaika</surname> <given-names>F.</given-names></name></person-group> (<year>2020</year>). <article-title>Multi-label, multi-task CNN approach for context-based emotion recognition</article-title>. <source>Inf. Fusion</source>. <volume>76</volume>, <fpage>422</fpage>&#x02013;<lpage>428</lpage>. <pub-id pub-id-type="doi">10.1016/j.inffus.2020.11.007</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bota</surname> <given-names>P. J.</given-names></name> <name><surname>Wang</surname> <given-names>C.</given-names></name> <name><surname>Fred</surname> <given-names>A. L.</given-names></name> <name><surname>Da Silva</surname> <given-names>H. P.</given-names></name></person-group> (<year>2019</year>). <article-title>A review, current challenges, and future possibilities on emotion recognition using machine learning and physiological signals</article-title>. <source>IEEE Access</source> <volume>7</volume>, <fpage>140990</fpage>&#x02013;<lpage>141020</lpage>. <pub-id pub-id-type="doi">10.1109/ACCESS.2019.2944001</pub-id></citation>
</ref>
<ref id="B3">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Cai</surname> <given-names>J.</given-names></name> <name><surname>Meng</surname> <given-names>Z.</given-names></name> <name><surname>Khan</surname> <given-names>A. S.</given-names></name> <name><surname>Li</surname> <given-names>Z.</given-names></name> <name><surname>O&#x00027;Reilly</surname> <given-names>J.</given-names></name> <name><surname>Tong</surname> <given-names>Y.</given-names></name></person-group> (<year>2018</year>). <article-title>Island loss for learning discriminative features in facial expression recognition</article-title>, in <source>2018 13th IEEE International Conference on Automatic Face &#x00026; Gesture Recognition (FG 2018)</source> (<publisher-loc>Xi&#x00027;an</publisher-loc>), <fpage>302</fpage>&#x02013;<lpage>309</lpage>. <pub-id pub-id-type="doi">10.1109/FG.2018.00051</pub-id></citation>
</ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cai</surname> <given-names>W.</given-names></name> <name><surname>Song</surname> <given-names>Y.</given-names></name> <name><surname>Wei</surname> <given-names>Z.</given-names></name></person-group> (<year>2021</year>). <article-title>Multimodal data guided spatial feature fusion and grouping strategy for E-commerce commodity demand forecasting</article-title>. <source>Mobile Inf. Syst.</source> (2021) <volume>2021</volume>:<fpage>5541298</fpage>. <pub-id pub-id-type="doi">10.1155/2021/5568208</pub-id></citation>
</ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cai</surname> <given-names>W.</given-names></name> <name><surname>Wei</surname> <given-names>Z.</given-names></name></person-group> (<year>2020</year>). <article-title>PiiGAN: generative adversarial networks for pluralistic image inpainting</article-title>. <source>IEEE Access</source> <volume>8</volume>, <fpage>48451</fpage>&#x02013;<lpage>48463</lpage>. <pub-id pub-id-type="doi">10.1109/ACCESS.2020.2979348</pub-id></citation>
</ref>
<ref id="B6">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chengeta</surname> <given-names>K.</given-names></name> <name><surname>Viriri</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). <article-title>A review of local, holistic and deep learning approaches in facial expressions Recognition</article-title>, in <source>2019 Conference on Information Communications Technology and Society (ICTAS)</source> (<publisher-loc>Durban</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>7</lpage>.</citation>
</ref>
<ref id="B7">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Connie</surname> <given-names>T.</given-names></name> <name><surname>Al-Shabi</surname> <given-names>M.</given-names></name> <name><surname>Cheah</surname> <given-names>W. P.</given-names></name> <name><surname>Goh</surname> <given-names>M.</given-names></name></person-group> (<year>2017</year>). <article-title>Facial expression recognition using a hybrid CNN-SIFT aggregator</article-title>, in <source>International Workshop on Multi-disciplinary Trends in Artificial Intelligence</source> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>139</fpage>&#x02013;<lpage>149</lpage>.</citation>
</ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Crivelli</surname> <given-names>C.</given-names></name> <name><surname>Russell</surname> <given-names>J. A.</given-names></name> <name><surname>Jarillo</surname> <given-names>S.</given-names></name> <name><surname>Fern&#x000E1;ndez-Dols</surname> <given-names>J. M.</given-names></name></person-group> (<year>2017</year>). <article-title>Recognizing spontaneous facial expressions of emotion in a small-scale society of Papua New Guinea</article-title>. <source>Emotion</source> <volume>17</volume>:<fpage>337</fpage>. <pub-id pub-id-type="doi">10.1037/emo0000236</pub-id><pub-id pub-id-type="pmid">27736108</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dom&#x000ED;nguez-Jim&#x000E9;nez</surname> <given-names>J. A.</given-names></name> <name><surname>Campo-Landines</surname> <given-names>K. C.</given-names></name> <name><surname>Mart&#x000ED;nez-Santos</surname> <given-names>J. C.</given-names></name> <name><surname>Delahoz</surname> <given-names>E. J.</given-names></name> <name><surname>Contreras-Ortiz</surname> <given-names>S. H.</given-names></name></person-group> (<year>2020</year>). <article-title>A machine learning model for emotion recognition from physiological signals</article-title>. <source>Biomed. Signal Process. Control</source> <volume>55</volume>:<fpage>101646</fpage>. <pub-id pub-id-type="doi">10.1016/j.bspc.2019.101646</pub-id></citation>
</ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gao</surname> <given-names>M.</given-names></name> <name><surname>Cai</surname> <given-names>W.</given-names></name> <name><surname>Liu</surname> <given-names>R.</given-names></name></person-group> (<year>2021</year>). <article-title>AGTH-Net: attention-based graph convolution-guided third-order hourglass network for sports video classification</article-title>. <source>J. Healthc Eng.</source> (2021) <volume>2021</volume>:<fpage>8517161</fpage>. <pub-id pub-id-type="doi">10.1155/2021/8517161</pub-id><pub-id pub-id-type="pmid">34306600</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gonz&#x000E1;lez-Lozoya</surname> <given-names>S. M.</given-names></name> <name><surname>de la Calleja</surname> <given-names>J.</given-names></name> <name><surname>Pellegrin</surname> <given-names>L.</given-names></name> <name><surname>Escalante</surname> <given-names>H. J.</given-names></name> <name><surname>Medina</surname> <given-names>M. A.</given-names></name> <name><surname>Benitez-Ruiz</surname> <given-names>A.</given-names></name></person-group> (<year>2020</year>). <article-title>Recognition of facial expressions based on CNN features</article-title>. <source>Multimed. Tools Appl.</source> <volume>79</volume>, <fpage>13987</fpage>&#x02013;<lpage>14007</lpage>. <pub-id pub-id-type="doi">10.1007/s11042-020-08681-4</pub-id></citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hamelin</surname> <given-names>N.</given-names></name> <name><surname>El Moujahid</surname> <given-names>O.</given-names></name> <name><surname>Thaichon</surname> <given-names>P.</given-names></name></person-group> (<year>2017</year>). <article-title>Emotion and advertising effectiveness: a novel facial expression analysis approach</article-title>. <source>J. Retail. Consum. Serv.</source> <volume>36</volume>, <fpage>103</fpage>&#x02013;<lpage>111</lpage>. <pub-id pub-id-type="doi">10.1016/j.jretconser.2017.01.001</pub-id></citation>
</ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Harit</surname> <given-names>A.</given-names></name> <name><surname>Joshi</surname> <given-names>J. C.</given-names></name> <name><surname>Gupta</surname> <given-names>K. K.</given-names></name></person-group> (<year>2018</year>). <article-title>Facial emotions recognition using gabor transform and facial animation parameters with neural networks</article-title>. <source>IOP Conf. Ser. Mater. Sci. Eng.</source> <volume>331</volume>:<fpage>012013</fpage>. <pub-id pub-id-type="doi">10.1088/1757-899X/331/1/012013</pub-id></citation>
</ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jain</surname> <given-names>D. K.</given-names></name> <name><surname>Shamsolmoali</surname> <given-names>P.</given-names></name> <name><surname>Sehdev</surname> <given-names>P.</given-names></name></person-group> (<year>2019</year>). <article-title>Extended deep neural network for facial emotion recognition</article-title>. <source>Pattern Recognit. Lett.</source> <volume>120</volume>, <fpage>69</fpage>&#x02013;<lpage>74</lpage>. <pub-id pub-id-type="doi">10.1016/j.patrec.2019.01.008</pub-id></citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kerkeni</surname> <given-names>L.</given-names></name> <name><surname>Serrestou</surname> <given-names>Y.</given-names></name> <name><surname>Raoof</surname> <given-names>K.</given-names></name> <name><surname>Mbarki</surname> <given-names>M.</given-names></name> <name><surname>Mahjoub</surname> <given-names>M. A.</given-names></name> <name><surname>Cleder</surname> <given-names>C.</given-names></name></person-group> (<year>2019</year>). <article-title>Automatic speech emotion recognition using an optimal combination of features based on EMD-TKEO</article-title>. <source>Speech Commun.</source> <volume>114</volume>, <fpage>22</fpage>&#x02013;<lpage>35</lpage>. <pub-id pub-id-type="doi">10.1016/j.specom.2019.09.002</pub-id></citation>
</ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>B. K.</given-names></name> <name><surname>Roh</surname> <given-names>J.</given-names></name> <name><surname>Dong</surname> <given-names>S. Y.</given-names></name> <name><surname>Lee</surname> <given-names>S. Y.</given-names></name></person-group> (<year>2016</year>). <article-title>Hierarchical committee of deep convolutional neural networks for robust facial expression recognition</article-title>. <source>J. Multimodal User Interf.</source> <volume>10</volume>, <fpage>173</fpage>&#x02013;<lpage>189</lpage>. <pub-id pub-id-type="doi">10.1007/s12193-015-0209-0</pub-id></citation>
</ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kollias</surname> <given-names>D.</given-names></name> <name><surname>Zafeiriou</surname> <given-names>S. P.</given-names></name></person-group> (<year>2020</year>). <article-title>Exploiting multi-cnn features in cnn-rnn based dimensional emotion recognition on the omg in-the-wild dataset</article-title>. <source>IEEE Trans. Affect. Comput</source>. <volume>12</volume>, <fpage>595</fpage>&#x02013;<lpage>606</lpage>. <pub-id pub-id-type="doi">10.1109/TAFFC.2020.3014171</pub-id></citation>
</ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kumar</surname> <given-names>S.</given-names></name> <name><surname>Bhuyan</surname> <given-names>M. K.</given-names></name> <name><surname>Chakraborty</surname> <given-names>B. K.</given-names></name></person-group> (<year>2016</year>). <article-title>Extraction of informative regions of a face for facial expression recognition</article-title>. <source>IET Comput. Vis.</source> <volume>10</volume>, <fpage>567</fpage>&#x02013;<lpage>576</lpage>. <pub-id pub-id-type="doi">10.1049/iet-cvi.2015.0273</pub-id></citation>
</ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kwon</surname> <given-names>S.</given-names></name></person-group> (<year>2021</year>). <article-title>MLT-DNet: speech emotion recognition using 1D dilated CNN based on multi-learning trick approach</article-title>. <source>Expert Syst. Appl.</source> <volume>167</volume>:<fpage>114177</fpage>. <pub-id pub-id-type="doi">10.1016/j.eswa.2020.114177</pub-id></citation>
</ref>
<ref id="B20">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>J.</given-names></name> <name><surname>Kim</surname> <given-names>S.</given-names></name> <name><surname>Kim</surname> <given-names>S.</given-names></name> <name><surname>Park</surname> <given-names>J.</given-names></name> <name><surname>Sohn</surname> <given-names>K.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Context-aware emotion recognition networks</article-title>, in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source> (<publisher-loc>Seoul</publisher-loc>), <fpage>10143</fpage>&#x02013;<lpage>10152</lpage>.</citation>
</ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liew</surname> <given-names>C. F.</given-names></name> <name><surname>Yairi</surname> <given-names>T.</given-names></name></person-group> (<year>2015</year>). <article-title>Facial expression recognition and analysis: a comparison study of feature descriptors</article-title>. <source>IPSJ Trans. Comput. Vis. Appl.</source> <volume>7</volume>, <fpage>104</fpage>&#x02013;<lpage>120</lpage>. <pub-id pub-id-type="doi">10.2197/ipsjtcva.7.104</pub-id></citation>
</ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>R.</given-names></name> <name><surname>Ning</surname> <given-names>X.</given-names></name> <name><surname>Cai</surname> <given-names>W.</given-names></name> <name><surname>Li</surname> <given-names>G.</given-names></name></person-group> (<year>2021</year>). <article-title>Multiscale dense cross-attention mechanism with covariance pooling for hyperspectral image scene classification</article-title>. <source>Mobile Inf. Syst.</source> (2021) <volume>2021</volume>:<fpage>9962057</fpage>. <pub-id pub-id-type="doi">10.1155/2021/9962057</pub-id></citation>
</ref>
<ref id="B23">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mollahosseini</surname> <given-names>A.</given-names></name> <name><surname>Chan</surname> <given-names>D.</given-names></name> <name><surname>Mahoor</surname> <given-names>M. H.</given-names></name></person-group> (<year>2016</year>). <article-title>Going deeper in facial expression recognition using deep neural networks</article-title>, in <source>2016 IEEE Winter Conference on Applications of Computer Vision (WACV)</source> (<publisher-loc>Lake Placid, NY</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>10</lpage>.</citation>
</ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ouellet</surname> <given-names>S.</given-names></name></person-group> (<year>2014</year>). <article-title>Real-time emotion recognition for gaming using deep convolutional network features</article-title>. <source>arXiv:1408.3750 [arXiv preprint]</source>.</citation>
</ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pramerdorfer</surname> <given-names>C.</given-names></name> <name><surname>Kampel</surname> <given-names>M.</given-names></name></person-group> (<year>2016</year>). <article-title>Facial expression recognition using convolutional neural networks: state of the art</article-title>. <source>arXiv:1612.02903 [arXiv preprint]</source>.</citation>
</ref>
<ref id="B26">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Rieger</surname> <given-names>S. A.</given-names></name> <name><surname>Muraleedharan</surname> <given-names>R.</given-names></name> <name><surname>Ramachandran</surname> <given-names>R. P.</given-names></name></person-group> (<year>2014</year>). <article-title>Speech based emotion recognition using spectral feature extraction and an ensemble of kNN classifiers</article-title>, in <source>The 9th International Symposium on Chinese Spoken Language Processing</source> (<publisher-loc>Singapore</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>589</fpage>&#x02013;<lpage>593</lpage>.</citation>
</ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shao</surname> <given-names>J.</given-names></name> <name><surname>Qian</surname> <given-names>Y.</given-names></name></person-group> (<year>2019</year>). <article-title>Three convolutional neural network models for facial expression recognition in the wild</article-title>. <source>Neurocomputing</source> <volume>355</volume>, <fpage>82</fpage>&#x02013;<lpage>92</lpage>. <pub-id pub-id-type="doi">10.1016/j.neucom.2019.05.005</pub-id></citation>
</ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sreedharan</surname> <given-names>N. P. N.</given-names></name> <name><surname>Ganesan</surname> <given-names>B.</given-names></name> <name><surname>Raveendran</surname> <given-names>R.</given-names></name> <name><surname>Sarala</surname> <given-names>P.</given-names></name> <name><surname>Dennis</surname> <given-names>B.</given-names></name></person-group> (<year>2018</year>). <article-title>Grey Wolf optimisation-based feature selection and classification for facial emotion recognition</article-title>. <source>IET Biometr.</source> <volume>7</volume>, <fpage>490</fpage>&#x02013;<lpage>499</lpage>. <pub-id pub-id-type="doi">10.1049/iet-bmt.2017.0160</pub-id></citation>
</ref>
<ref id="B29">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Szegedy</surname> <given-names>C.</given-names></name> <name><surname>Ioffe</surname> <given-names>S.</given-names></name> <name><surname>Vanhoucke</surname> <given-names>V.</given-names></name> <name><surname>Alemi</surname> <given-names>A. A.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Inception-v4, inception-resnet and the impact of residual connections on learning,&#x0201D;</article-title> in <source>Thirty-First AAAI Conference on Artificial Intelligence</source> (<publisher-loc>San Francisco, CA</publisher-loc>).</citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tang</surname> <given-names>Y.</given-names></name></person-group> (<year>2013</year>). <article-title>Deep learning using linear support vector machines</article-title>. <source>arXiv:1306.0239 [arXiv preprint]</source>.</citation>
</ref>
<ref id="B31">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Tarannum</surname> <given-names>T.</given-names></name> <name><surname>Paul</surname> <given-names>A.</given-names></name> <name><surname>Talukder</surname> <given-names>K. H.</given-names></name></person-group> (<year>2016</year>). <article-title>Human expression recognition based on facial features</article-title>, in <source>2016 5th International Conference on Informatics, Electronics and Vision (ICIEV)</source> (<publisher-loc>Dhaka</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>990</fpage>&#x02013;<lpage>994</lpage>.</citation>
</ref>
<ref id="B32">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Theagarajan</surname> <given-names>R.</given-names></name> <name><surname>Bhanu</surname> <given-names>B.</given-names></name> <name><surname>Cruz</surname> <given-names>A.</given-names></name> <name><surname>Le</surname> <given-names>B.</given-names></name> <name><surname>Tambo</surname> <given-names>A.</given-names></name></person-group> (<year>2017</year>). <article-title>Novel representation for driver emotion recognition in motor vehicle videos</article-title>, in <source>2017 IEEE International Conference on Image Processing (ICIP)</source> (<publisher-loc>Beijing</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>810</fpage>&#x02013;<lpage>814</lpage>.</citation>
</ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Turan</surname> <given-names>C.</given-names></name> <name><surname>Lam</surname> <given-names>K. M.</given-names></name> <name><surname>He</surname> <given-names>X.</given-names></name></person-group> (<year>2018</year>). <article-title>Soft locality preserving map (SLPM) for facial expression recognition</article-title>. <source>arXiv:1801.03754 [arXiv preprint]</source>.</citation>
</ref>
<ref id="B34">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Yuan</surname> <given-names>C.</given-names></name></person-group> (<year>2016</year>). <article-title>Facial expression recognition with multi-scale convolution neural network</article-title>, in <source>Pacific Rim Conference on Multimedia</source> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>376</fpage>&#x02013;<lpage>385</lpage>.</citation>
</ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>Q.</given-names></name> <name><surname>Lu</surname> <given-names>X.</given-names></name> <name><surname>Li</surname> <given-names>P.</given-names></name> <name><surname>Gao</surname> <given-names>Z.</given-names></name> <name><surname>Piao</surname> <given-names>Y.</given-names></name></person-group> (<year>2017</year>). <article-title>An information geometry-based distance between high-dimensional covariances for scalable classification</article-title>. <source>IEEE Trans. Circ. Syst. Video Technol.</source> <volume>28</volume>, <fpage>2449</fpage>&#x02013;<lpage>2459</lpage>. <pub-id pub-id-type="doi">10.1109/TCSVT.2017.2712704</pub-id></citation>
</ref>
<ref id="B36">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Xu</surname> <given-names>C.</given-names></name> <name><surname>Dong</surname> <given-names>C.</given-names></name> <name><surname>Feng</surname> <given-names>Z.</given-names></name> <name><surname>Cao</surname> <given-names>T.</given-names></name></person-group> (<year>2012</year>). <article-title>Facial expression pervasive analysis based on haar-like features and svm</article-title>, in <source>International Conference on E-business Technology and Strategy</source> (<publisher-loc>Berlin; Heidelberg</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>521</fpage>&#x02013;<lpage>529</lpage>.</citation>
</ref>
<ref id="B37">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>H.</given-names></name> <name><surname>Ciftci</surname> <given-names>U.</given-names></name> <name><surname>Yin</surname> <given-names>L.</given-names></name></person-group> (<year>2018</year>). <article-title>Facial expression recognition by de-expression residue learning</article-title>, in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Salt Lake City, UT</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2168</fpage>&#x02013;<lpage>2177</lpage>.</citation>
</ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zepf</surname> <given-names>S.</given-names></name> <name><surname>Hernandez</surname> <given-names>J.</given-names></name> <name><surname>Schmitt</surname> <given-names>A.</given-names></name> <name><surname>Minker</surname> <given-names>W.</given-names></name> <name><surname>Picard</surname> <given-names>R. W.</given-names></name></person-group> (<year>2020</year>). <article-title>Driver emotion recognition for intelligent vehicles: a survey</article-title>. <source>ACM Compu. Surv.</source> <volume>53</volume>, <fpage>1</fpage>&#x02013;<lpage>30</lpage>. <pub-id pub-id-type="doi">10.1145/3388790</pub-id></citation>
</ref>
<ref id="B39">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>L.</given-names></name> <name><surname>Sun</surname> <given-names>L.</given-names></name> <name><surname>Yu</surname> <given-names>L.</given-names></name> <name><surname>Dong</surname> <given-names>X.</given-names></name> <name><surname>Chen</surname> <given-names>J.</given-names></name> <name><surname>Cai</surname> <given-names>W.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>ARFace: attention-aware and regularization for Face Recognition with Reinforcement Learning</article-title>, in <source>IEEE Transactions on Biometrics, Behavior, and Identity Science</source> (<publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>IEEE</publisher-name>). <pub-id pub-id-type="doi">10.1109/TBIOM.2021.3104014</pub-id></citation>
</ref>
<ref id="B40">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>L.</given-names></name> <name><surname>Tjondronegoro</surname> <given-names>D.</given-names></name> <name><surname>Chandran</surname> <given-names>V.</given-names></name></person-group> (<year>2014</year>). <article-title>Random Gabor based templates for facial expression recognition in images with facial occlusion</article-title>. <source>Neurocomputing</source> <volume>145</volume>, <fpage>451</fpage>&#x02013;<lpage>464</lpage>. <pub-id pub-id-type="doi">10.1016/j.neucom.2014.05.008</pub-id></citation>
</ref>
<ref id="B41">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>Z. Y.</given-names></name> <name><surname>Wang</surname> <given-names>R. Q.</given-names></name> <name><surname>Wei</surname> <given-names>M. M.</given-names></name></person-group> (<year>2019</year>). <article-title>Stack hybrid self-encoder facial expression recognition method</article-title>. <source>Comput. Eng. Appl.</source> <volume>55</volume>, <fpage>1</fpage>&#x02013;<lpage>8</lpage>. <pub-id pub-id-type="doi">10.3778/j.issn.1002-8331.1803-0398</pub-id></citation>
</ref>
</ref-list>
</back>
</article>