<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="review-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Sig. Proc.</journal-id>
<journal-title>Frontiers in Signal Processing</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Sig. Proc.</abbrev-journal-title>
<issn pub-type="epub">2673-8198</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">861469</article-id>
<article-id pub-id-type="doi">10.3389/frsip.2022.861469</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Signal Processing</subject>
<subj-group>
<subject>Review</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Spatiotemporal Features Fusion From Local Facial Regions for Micro-Expressions Recognition</article-title>
<alt-title alt-title-type="left-running-head">Aouayeb et al.</alt-title>
<alt-title alt-title-type="right-running-head">STFF for Micro-Expression Recognition</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Aouayeb</surname>
<given-names>Mouath</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1595706/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Soladie</surname>
<given-names>Catherine</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1749115/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Hamidouche</surname>
<given-names>Wassim</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1181828/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Kpalma</surname>
<given-names>Kidiyo</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1375840/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Seguier</surname>
<given-names>Renaud</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1749135/overview"/>
</contrib>
</contrib-group>
<aff id="aff1">
<sup>1</sup>
<institution>Univ. Rennes</institution>, <institution>INSA Rennes</institution>, <institution>CNRS</institution>, <institution>IETR&#x2014;UMR</institution>, <addr-line>Rennes</addr-line>, <country>France</country>
</aff>
<aff id="aff2">
<sup>2</sup>
<institution>Univ. Rennes</institution>, <institution>CentraleSup&#xe9;lec</institution>, <institution>CNRS</institution>, <institution>IETR&#x2014;UMR</institution>, <addr-line>Rennes</addr-line>, <country>France</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1126800/overview">Yuming Fang</ext-link>, Jiangxi University of Finance and Economics, China</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1289999/overview">Xianye Ben</ext-link>, Shandong University, China</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/551367/overview">Sze Teng Liong</ext-link>, Feng Chia University, Taiwan</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Mouath Aouayeb, <email>aouayeb.mouath@insa-rennes.fr</email>
</corresp>
<fn fn-type="other">
<p>This article was submitted to Image Processing, a section of the journal Frontiers in Signal Processing</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>13</day>
<month>04</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>2</volume>
<elocation-id>861469</elocation-id>
<history>
<date date-type="received">
<day>24</day>
<month>01</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>14</day>
<month>03</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2022 Aouayeb, Soladie, Hamidouche, Kpalma and Seguier.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Aouayeb, Soladie, Hamidouche, Kpalma and Seguier</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>Facial micro-expressions (MiEs) analysis has applications in various fields, including emotional intelligence, psychotherapy, and police investigation. However, because MiEs are fast, subtle, and local reactions, there is a challenge for humans and machines to detect and recognize them. In this article, we propose a deep learning approach that addresses the locality and the temporal aspects of MiE by learning spatiotemporal features from local facial regions. Our proposed method is particularly unique in that we use two fusion-based squeeze and excitation (SE) strategies to drive the model to learn the optimal combination of extracted spatiotemporal features from each area. The proposed architecture enhances a previous solution of an automatic system for micro-expression recognition (MER) from local facial regions using a composite deep learning model of convolutional neural network (CNN) and long short-term memory (LSTM). Experiments on three spontaneous MiE datasets show that the proposed solution outperforms state-of-the-art approaches. Our code is presented at <ext-link ext-link-type="uri" xlink:href="https://github.com/MouathAb/AnalyseMiE-CNN_LSTM_SE">https://github.com/MouathAb/AnalyseMiE-CNN_LSTM_SE</ext-link> as an open source.</p>
</abstract>
<kwd-group>
<kwd>micro-expression recognition</kwd>
<kwd>squeeze and excitation</kwd>
<kwd>CNN</kwd>
<kwd>LSTM</kwd>
<kwd>active patches</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>Analysis of MiEs plays an important role in several disciplines such as psychology, human&#x2013;machine interaction, and security due to its characteristics disclosed by (<xref ref-type="bibr" rid="B8">Ekman and Friesen, 1969</xref>) as universal, spontaneous, local, and low-intensity expression. However, analyzing them is challenging because they are subtle and fast reflexes that last only from 1/25 to 1/5 s.</p>
<p>Since then, numerous researchers have proposed automated approaches for MER. Various strategies, ranging from handmade to deep learning, are utilized to handle various issues such as the low-intensity aspect, the limitation of MiE samples, and the imbalance of the available data.</p>
<p>Our proposed solution relies on a recent and efficient region-based deep learning approach presented by <xref ref-type="bibr" rid="B1">Aouayeb et al. (2019</xref>). This method (<xref ref-type="bibr" rid="B1">Aouayeb et al., 2019</xref>) is unique in using an updated label vector based on emotion and its related action units (AUs) for each location in the spatial domain to learn more robust features. The main disadvantage of that method is the static selection of regions of interests (ROIs), with no guarantee that all areas of the region are essential for MER. Another drawback is that the spatiotemporal features from all regions are fused by a simple concatenation block. However, each region may contribute with different weights for different MiEs.</p>
<p>In this study, we aim to overcome these two issues. The proposed solution addresses the first issue by learning the active patches on each region and the second issue by learning the active region for each MiE sequence through time. Its novelty is to combine a deep learning architecture of CNN-LSTM for spatiotemporal features extraction with a fusion attention block called squeeze and excitation (SE) (<xref ref-type="bibr" rid="B12">Hu et al., 2018</xref>) to learn more local features. It results in training CNN efficiently on more local areas and learning the attention of each region&#x2019;s features extracted by LSTM, which helps classify them using fully connected layer (FCL) and outperforms state-of-the-art performance on 3 MiE datasets.</p>
<p>The principal contribution of this study concerns extracting more local characteristics of each ROI, identified by <xref ref-type="bibr" rid="B1">Aouayeb et al. (2019</xref>), using CNN and SE. By training the CNN with very local regions (patches), the model focuses on learning more local features avoiding unnecessary ones for MER (e.g., edges, shapes, and textures). However, it could augment the redundancy of the extracted spatial features from different patches and harm the model&#x2019;s training. To alleviate this issue, we employ SE as an attention block to learn the active patches. The originality is that it is the first time a deep learning model is trained on tiny regions to extract very local features, pointed out by different handcrafted approaches (<xref ref-type="bibr" rid="B36">Zhao and Xu, 2018</xref>; <xref ref-type="bibr" rid="B35">Zhao and Xu, 2019</xref>; <xref ref-type="bibr" rid="B34">Zhao et al., 2021</xref>) as essential for MER. The second contribution is to employ another SE block to learn the attention of the spatiotemporal features and identify the principal regions during a micro-expression sequence. As a result, a classifier could learn more efficiently.</p>
<p>The rest of the study is organized as follows. <xref ref-type="sec" rid="s2">Section 2</xref> presents the state-of-the-art solutions for MiE recognition. <xref ref-type="sec" rid="s3">Section 3</xref> describes the proposed spatiotemporal architecture for MiE recognition. The performance of the proposed solution is assessed and compared to the best-performing solutions in <xref ref-type="sec" rid="s4">Section 4</xref>. Finally, <xref ref-type="sec" rid="s5">Section 5</xref> concludes this paper.</p>
</sec>
<sec id="s2">
<title>2 Related Work</title>
<p>In this section, we review different approaches for MER. The state-of-the-art solutions are grouped into four categories: handcrafted, deep learning, hybrid, and region-based solutions. A complete survey on micro-expression databases, features, and algorithms is made by <xref ref-type="bibr" rid="B2">Ben et al. (2021)</xref> for further details.</p>
<sec id="s2-1">
<title>2.1 Handcrafted Solutions</title>
<p>The pioneer works on MiE recognition are handcrafted solutions. <xref ref-type="bibr" rid="B33">Zhao and Pietikainen (2007</xref>) proposed Local Binary Pattern on Three Orthogonal Planes (LBP-TOP) for features extraction to detect the appearance of face information that describes the variation of pixel intensity. Subsequently, many variants of LBP-TOP were proposed for MER. <xref ref-type="bibr" rid="B28">Wang et al. (2014</xref>) proposed Local Binary Pattern (LBP) with six intersection points of the planes (<italic>x</italic>, <italic>y</italic>), (<italic>x</italic>, <italic>t</italic>), and (<italic>y</italic>, <italic>t</italic>) to reduce redundancy in LBP-TOP. <xref ref-type="bibr" rid="B11">Guo et al. (2019</xref>) proposed Extended LBP-TOP (ELBP-TOP), which computes three components&#x2014;the LBP-TOP, the radial difference LBP-TOP, and the angular difference LBP-TOP&#x2014;to explore the second order of local information in angular and radial directions. Different from these methods, <xref ref-type="bibr" rid="B23">Polikovsky et al. (2009</xref>) used the Histogram of Oriented Gradient (HOG) as a descriptor on particular regions of the face to recognize MiE. In addition, <xref ref-type="bibr" rid="B7">Duque et al. (2020</xref>) proposed the Mean Oriented Riesz Features (MORF) descriptor, which uses a Riesz pyramid to create an image pair and then extracts spatiotemporal features from it. Despite the progress in handcrafted solutions for MER and other computer vision tasks, they show limits in terms of performance. On the contrary, based on the good results using deep learning methods for different computer vision problems, many researchers invested in using those methods for MER.</p>
</sec>
<sec id="s2-2">
<title>2.2 Deep Learning Solutions</title>
<p>Deep learning has been widely used for computer vision tasks such as face recognition, object detection, image segmentation, and tracking. Recently, deep learning architectures have been proposed to classify MiE videos/clips. <xref ref-type="bibr" rid="B22">Patel et al. (2016</xref>) used a pre-trained model on the ImageNet dataset and then fine-tuned its weights to classify macro- and micro-expressions. <xref ref-type="bibr" rid="B25">Reddy et al. (2019</xref>) proposed a 3D-CNN for spatiotemporal features extraction and then performed the classification using a FCL. <xref ref-type="bibr" rid="B24">Quang et al. (2019</xref>) adapted CapsuleNet (<xref ref-type="bibr" rid="B26">Sabour et al., 2017</xref>) for MER. Furthermore, <xref ref-type="bibr" rid="B4">Choi and Song (2020</xref>) created a 2D feature map based on the time variation of distance between facial landmarks. Then, they fed the sequence of 2D feature maps to a combined architecture of CNN and LSTM to extract spatiotemporal features and classify them.</p>
<p>The main challenge for deep learning solutions in MiE analysis is not only that the provided datasets of spontaneous MiE sequences are limited but also the imbalance between classes. To overcome these problems, <xref ref-type="bibr" rid="B31">Yu et al. (2020</xref>) used an improved architecture of conditional Generative Adversarial Nets (cGAN) (<xref ref-type="bibr" rid="B21">Mirza and Osindero, 2014</xref>) called Identity-aware and Capsule-Enhanced GAN (ICE-GAN) to synthesize and augment data. The proposed solution consists of a conditional encoder-decoder to generate synthesized MiE and a discriminator based on CapsuleNet (<xref ref-type="bibr" rid="B26">Sabour et al., 2017</xref>) to discriminate the real from the fake and identify the corresponding MiE class.</p>
<p>Considering the results of different deep learning solutions, we can notice the improvement compared to handcrafted solutions. However, the performance is still insufficient compared to other computer vision tasks. Hence, there is a need for other solutions.</p>
</sec>
<sec id="s2-3">
<title>2.3 Hybrid Solutions</title>
<p>Instead of choosing between handcrafted and deep learning approaches, some researchers consider benefiting from both of them. Typical structures of optical flow (OF) or LBP-TOP are usually employed, and the output is fed to a CNN or a combination of CNN and recurrent neural network (RNN).</p>
<p>
<xref ref-type="bibr" rid="B17">Liong et al. (2019</xref>) proposed Shallow Triple Stream Three-dimensional CNN (STSTNet): the model used only the onset and apex frames to generate optical flow images (optical strain, horizontal flow, and vertical flow). The optical flow images are stacked with the raw image, followed by three CNNs and a fusion layer. <xref ref-type="bibr" rid="B37">Zhou et al. (2019</xref>) considered another approach: instead of extracting deep features from a mix of handcrafted features, they mixed the deep features extracted from the handcrafted ones separately. Precisely, they used a dual CNN model: one for the horizontal component and the other for the vertical component of OF calculated from a mid-position frame that represents the apex and onset frame. The two outputs are merged by FCL to perform the classification. <xref ref-type="bibr" rid="B29">Xia et al. (2020</xref>) studied the effect of lower-resolution data on shallow architecture models. They proposed an OF map as input for a recurrent convolutional network with shallow architectures and used a neural architecture search (NAS) (<xref ref-type="bibr" rid="B18">Liu H et al., 2019</xref>) strategy to find an optimal combination of wide extension, short connection, and attention units for strong features with low learning complexity.</p>
<p>Hybrid solutions gained a significant performance improvement compared to previous approaches by mixing the handcrafted and deep learning approaches to cover their flaws. However, the results are still limited.</p>
</sec>
<sec id="s2-4">
<title>2.4 Region-Based Solutions</title>
<p>MiE video classification has evolved from handcrafted models (<xref ref-type="bibr" rid="B33">Zhao and Pietikainen, 2007</xref>; <xref ref-type="bibr" rid="B6">Davison A et al., 2018</xref>; <xref ref-type="bibr" rid="B7">Duque et al., 2020</xref>) to deep spatiotemporal networks (<xref ref-type="bibr" rid="B22">Patel et al., 2016</xref>; <xref ref-type="bibr" rid="B25">Reddy et al., 2019</xref>; <xref ref-type="bibr" rid="B31">Yu et al., 2020</xref>) and hybrid solutions (<xref ref-type="bibr" rid="B10">Gan et al., 2019</xref>; <xref ref-type="bibr" rid="B17">Liong et al., 2019</xref>; <xref ref-type="bibr" rid="B29">Xia et al., 2020</xref>). However, the improvements in MiE analysis are more modest compared to other computer vision tasks such as human action recognition (<xref ref-type="bibr" rid="B13">Ji et al., 2013</xref>). This observation reveals the challenge of MER and invites researchers to address the characteristics of MiE as a short expression in space. Previous works focused on the time and movement specificities of MiE. Recently, some researchers (<xref ref-type="bibr" rid="B36">Zhao and Xu, 2018</xref>, <xref ref-type="bibr" rid="B35">2019</xref>; <xref ref-type="bibr" rid="B1">Aouayeb et al., 2019</xref>) have proposed to adopt the previous approaches on selected regions of interest (ROI) instead of using the whole face to address the locality aspect of MiE. Such solutions lead to significant improvements over state-of-the-art works. The current work is also related to a region-based approach to extract robust spatiotemporal features from local regions using deep learning architecture for efficient MER. Inspired by existing works (<xref ref-type="bibr" rid="B12">Hu et al., 2018</xref>; <xref ref-type="bibr" rid="B1">Aouayeb et al., 2019</xref>; <xref ref-type="bibr" rid="B3">Chen et al., 2019</xref>), we integrate fusion units to learn active patches on each region and active regions along each MiE temporal sequence.</p>
</sec>
</sec>
<sec id="s3">
<title>3 Proposed Solution</title>
<p>In this section, the proposed approach is presented on a deeper level. The overall flow of the proposed system for automatic MER is illustrated in <xref ref-type="fig" rid="F1">Figure 1</xref>. The framework integrates a preprocessing step to normalize the input data. Besides, it includes two processing streams. The first is performed <italic>via</italic> a CNN to extract spatial structures of each region. The second stream is to extract spatiotemporal structures and classify them. To sum up, our ultimate goal is to reduce the non-useful features for MER extracted from the whole face. This is achieved by extracting features from only ROIs and integrating a double system of fusion in both space and time to add attention to the most relevant spatiotemporal features.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>Overview of the proposed solution. Images reproduced from the SAMM database with permission from <xref ref-type="bibr" rid="B5">Davison A. K et al. (2018</xref>).</p>
</caption>
<graphic xlink:href="frsip-02-861469-g001.tif"/>
</fig>
<sec id="s3-1">
<title>3.1 Preprocessing: ROI Extraction</title>
<p>The selected ROIs are based on the Necessary Morphological Patches (NMPs) presented by <xref ref-type="bibr" rid="B35">Zhao and Xu (2019</xref>). First, an automatic technique (<xref ref-type="bibr" rid="B14">Kazemi and Sullivan, 2014</xref>) based on HOG and linear classifier (the algorithm is provided on dlib<xref ref-type="fn" rid="fn1">
<sup>1</sup>
</xref> library) is used to detect the 68 facial landmarks. Second, we align and crop the face based on these landmarks. Then, we identify the ROIs that must contain the AUs responsible for a MiE.</p>
<p>According to <xref ref-type="bibr" rid="B9">Ekman and Friesen (1978</xref>), a facial MiE can be represented with Facial Action Coding System (FACS) by a combination of AUs. These AUs are mainly distributed in six regions \{the left and the right (eyes &#x2b; eyebrows), the nose, the left and the right cheeks, and the mouth \} as shown in <xref ref-type="table" rid="T1">Table 1</xref>.</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Facial regions and corresponding AUs: we focus on the local region where characteristics of MiE appear.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Regions of interest (ROI)</th>
<th align="center">AUs</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">1 &#x26; 2: eyes &#x2b; eyebrows</td>
<td align="center">1,2,4,7</td>
</tr>
<tr>
<td align="left">3: nose</td>
<td align="center">9</td>
</tr>
<tr>
<td align="left">4 &#x26; 5: cheeks</td>
<td align="center">6</td>
</tr>
<tr>
<td align="left">6: mouth</td>
<td align="center">10,12,14,15,25</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>To find the active location of the MiEs and their corresponding emotion label, <xref ref-type="bibr" rid="B35">Zhao and Xu (2019)</xref> used a random forest algorithm on the combination of optical flow&#x2019;s histogram with LBP-TOP&#x2019;s histogram. The result is depicted in <xref ref-type="fig" rid="F2">Figure 2</xref>.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>Regions of interest and the corresponding emotions. Image reproduced from the SAMM database with permission from <xref ref-type="bibr" rid="B5">Davison A. K et al. (2018</xref>).</p>
</caption>
<graphic xlink:href="frsip-02-861469-g002.tif"/>
</fig>
<p>After the localization of the ROIs, they are cropped from the entire face. Then, their size is normalized to a predefined size for each region. <xref ref-type="table" rid="T2">Table 2</xref> shows the size by region on each dataset and the average size among the different databases used in our experiments.</p>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>The dimension for each region on each dataset and the mean between the three databases (CASME II, SAMM, SMIC).</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">ROI</th>
<th align="center">SMIC</th>
<th align="center">CASME II</th>
<th align="center">SAMM</th>
<th align="center">Normalized Size</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">1 &#x26; 2</td>
<td align="center">68 &#xd7; 72</td>
<td align="center">80 &#xd7; 100</td>
<td align="char" char="&#xd7;">98 &#xd7; 134</td>
<td align="center">81 &#xd7; 102</td>
</tr>
<tr>
<td align="left">3</td>
<td align="center">68 &#xd7; 82</td>
<td align="center">80 &#xd7; 120</td>
<td align="char" char="&#xd7;">98 &#xd7; 160</td>
<td align="center">81 &#xd7; 120</td>
</tr>
<tr>
<td align="left">4 &#x26; 5</td>
<td align="center">48 &#xd7; 40</td>
<td align="center">60 &#xd7; 60</td>
<td align="char" char="&#xd7;">74 &#xd7; 80</td>
<td align="center">60 &#xd7; 60</td>
</tr>
<tr>
<td align="left">6</td>
<td align="center">50 &#xd7; 106</td>
<td align="center">60 &#xd7; 160</td>
<td align="char" char="&#xd7;">72 &#xd7; 214</td>
<td align="center">60 &#xd7; 160</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Next, each region is divided into <italic>m</italic> equal patches. One shall notice that our method differs from those of <xref ref-type="bibr" rid="B36">Zhao and Xu (2018</xref>) and <xref ref-type="bibr" rid="B35">Zhao and Xu (2019</xref>) in that we get the patches from the six ROIs, not from the entire face. Precisely, we have <italic>m</italic>&#x2217;6 patches, and we have different sizes for patches depending on the size of the region. Thus, a reshape is applied to fit the CNN input architecture. An ablation study of the number of patches is presented in <xref ref-type="table" rid="T3">Table 3</xref>. It tests the performance of the model using a different number of patches on the mixed dataset of SAMM and CASME II for five AU classification tasks. Further details on the mixed database are presented in the <xref ref-type="sec" rid="s9">Supplementary Material</xref>. It shows that <italic>m</italic> &#x3d; 9 is the best choice and outperforms the other choice on four different metrics: accuracy, f1-score, UAR, and UF1. For additional proof of concept, the confusion matrices are presented in the <xref ref-type="sec" rid="s9">Supplementary Material</xref>.</p>
<table-wrap id="T3" position="float">
<label>TABLE 3</label>
<caption>
<p>Ablation study of the number of patches. The proposed model is trained and evaluated using LOSO-CV protocol on a mixed dataset of SAMM and CASME II for the 5-AU classification task. <italic>m</italic> is a square root of non-negative numbers, and its maximum is 16 because of memory limitation. &#x2a; The batch size is reduced to 32 instead of 128 like the rest of the experiments.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">m</th>
<th align="center">Accuracy</th>
<th align="center">F1-score</th>
<th align="center">UAR</th>
<th align="center">UF1</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">1</td>
<td align="char" char=".">0.8493</td>
<td align="char" char=".">0.8389</td>
<td align="char" char=".">0.8190</td>
<td align="char" char=".">0.8110</td>
</tr>
<tr>
<td align="left">4</td>
<td align="char" char=".">0.8503</td>
<td align="char" char=".">0.8400</td>
<td align="char" char=".">0.8193</td>
<td align="char" char=".">0.8126</td>
</tr>
<tr>
<td align="left">9</td>
<td align="char" char=".">
<bold>0.8954</bold>
</td>
<td align="char" char=".">
<bold>0.8916</bold>
</td>
<td align="char" char=".">
<bold>0.8317</bold>
</td>
<td align="char" char=".">
<bold>0.8399</bold>
</td>
</tr>
<tr>
<td align="left">16&#x2a;</td>
<td align="char" char=".">0.8646</td>
<td align="char" char=".">0.8621</td>
<td align="char" char=".">0.8276</td>
<td align="char" char=".">0.8210</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s3-2">
<title>3.2 Spatial Features Extraction</title>
<p>Now that we have finished the preprocessing step and the data are prepared to be fed into the network, we introduce the spatial model for features extraction from each region. The proposed model is visualized in <xref ref-type="fig" rid="F3">Figure 3</xref>.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>The proposed model for spatial features extraction from the left \{eye &#x2b; eyebrow\} region. The model contains two main parts: the extraction of features from patch <italic>P</italic>
<sub>
<italic>j</italic>
</sub>(<italic>r</italic>) using CNN and the fusion of features using SE. Image reproduced from the SAMM database with permission from <xref ref-type="bibr" rid="B5">Davison A. K et al. (2018</xref>).</p>
</caption>
<graphic xlink:href="frsip-02-861469-g003.tif"/>
</fig>
<p>The proposed network first encodes each patch spatially using the CNN model. This provides a deep local and low-resolution features representation. Then, the following SE network fuses the features with an attention process to learn the activated patches and feed the output to FCL to classify them while reducing the dimension of the spatial features.</p>
<p>For the CNN model (<xref ref-type="fig" rid="F4">Figure 4</xref>), we used the same architecture proposed by <xref ref-type="bibr" rid="B1">Aouayeb et al. (2019</xref>) with the adaption of the input to the size of the patches. The model has a convolution layer of four filters with a size of 5 &#xd7; 5 followed by a second convolution layer of eight filters with a size of 3 &#xd7; 3. Then, a max-pooling layer with a pooling size of 2 &#xd7; 2 is employed in parallel with four convolution layers of 16 filters with sizes of 1 &#xd7; 1, 3 &#xd7; 3, 5 &#xd7; 5, and 7 &#xd7; 7, respectively. A Rectified Linear Unit (ReLU) as an activation function and a max-pooling layer with a size of 2 &#xd7; 2, to reduce the spatial dimensions, are employed after convolution operations. After that, we concatenate the output of the last parallel max-pooling layers. This model is formulated by <xref ref-type="disp-formula" rid="e1">Eq. 1</xref>. Let us denote <inline-formula id="inf1">
<mml:math id="m1">
<mml:mi>O</mml:mi>
<mml:mi>u</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> as the output of each patch <italic>P</italic>
<sub>
<italic>j</italic>
</sub>(<italic>r</italic>) from the region <italic>r</italic> of the frame <italic>F</italic>
<sub>
<italic>j</italic>
</sub>, <inline-formula id="inf2">
<mml:math id="m2">
<mml:mi>C</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:msubsup>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula> as the convolution operation with <italic>&#x201c;a&#x201d;</italic> filters of size <italic>b</italic> &#xd7; <italic>b</italic> followed by ReLU <inline-formula id="inf3">
<mml:math id="m3">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>C</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:msubsup>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>I</mml:mi>
<mml:mi>d</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> and <italic>maxP</italic> to denote the max-pooling layer:<disp-formula id="e1">
<mml:math id="m4">
<mml:mtable class="aligned">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mi>H</mml:mi>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mo>&#x3d;</mml:mo>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>P</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>C</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:msubsup>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>8</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>P</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>C</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:msubsup>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>4</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mi>O</mml:mi>
<mml:mi>u</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mo>&#x3d;</mml:mo>
<mml:mi>C</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>P</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>C</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:msubsup>
<mml:mrow>
<mml:mi>v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>16</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mfenced open="{" close="}">
<mml:mrow>
<mml:mn>0,1,3,5,7</mml:mn>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<label>(1)</label>
</disp-formula>
</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>The CNN architecture proposed by <xref ref-type="bibr" rid="B1">Aouayeb et al. (2019</xref>). It is employed to extract very local features form patches.</p>
</caption>
<graphic xlink:href="frsip-02-861469-g004.tif"/>
</fig>
<p>The outputs of the nine patches are concatenated and fused using SE (<xref ref-type="bibr" rid="B12">Hu et al., 2018</xref>), as depicted in <xref ref-type="fig" rid="F3">Figure 3</xref>. A detailed illustration of the SE network is shown in <xref ref-type="fig" rid="F5">Figure 5</xref>. The squeeze and excitation block mainly contains two operations:<list list-type="simple">
<list-item>
<p>1) The squeeze operation performed by <xref ref-type="disp-formula" rid="e2">Eq. 2</xref>: its operation is based on compressing the input with a global average pooling from (<italic>H</italic>, <italic>W</italic>, <italic>F</italic>) to (1, 1, <italic>F</italic>) and feeding it to an FCL (or 1 &#xd7; 1 convolutional layer). The FCL has <italic>se</italic>.<italic>F</italic> neurons (<italic>se</italic> &#x3c; 1 is the SE parameter) and ReLU as an activation function:</p>
</list-item>
</list>
<disp-formula id="e2">
<mml:math id="m5">
<mml:mtable class="aligned">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mi>G</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>P</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>W</mml:mi>
<mml:mo>.</mml:mo>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>W</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>u</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mi>s</mml:mi>
<mml:mi>q</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>z</mml:mi>
<mml:mi>e</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mo>&#x3d;</mml:mo>
<mml:mi>R</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>L</mml:mi>
<mml:mi>U</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>.</mml:mo>
<mml:mi>G</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>P</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<label>(2)</label>
</disp-formula>where <inline-formula id="inf4">
<mml:math id="m6">
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>t</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>H</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>W</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>F</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>, <italic>A</italic>
<sub>1</sub> and <italic>B</italic>
<sub>1</sub> are, respectively, the weight matrix and the bias vector of FCL, and <italic>GAP</italic> is for global average pooling layer.<list list-type="simple">
<list-item>
<p>2) The excitation operation (<xref ref-type="disp-formula" rid="e3">Eq. 3</xref>), which is a simple FCL (or 1 &#xd7; 1 convolutional layer) with <italic>F</italic> neurons followed by a sigmoid activation: the purpose of the excitation is to generate a weight for each feature channel. In our case, the feature channels represent the spatial features extracted from each patch <italic>P</italic>
<sub>
<italic>j</italic>
</sub>(<italic>r</italic>):</p>
</list-item>
</list>
<disp-formula id="e3">
<mml:math id="m7">
<mml:mtable class="aligned">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mi>e</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>q</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>z</mml:mi>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>&#x3c3;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>A</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>.</mml:mo>
<mml:mi>s</mml:mi>
<mml:mi>q</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>z</mml:mi>
<mml:mi>e</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<label>(3)</label>
</disp-formula>where <italic>A</italic>
<sub>2</sub> and <italic>B</italic>
<sub>2</sub> are the parameters of the FCL. Finally, we multiply the generated weights of the excitation with the feature maps <italic>FM</italic>:<disp-formula id="e4">
<mml:math id="m8">
<mml:mtable class="aligned">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mi>F</mml:mi>
<mml:mi>M</mml:mi>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mo>&#x3d;</mml:mo>
<mml:mi>C</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>O</mml:mi>
<mml:mi>u</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mfenced open="{" close="}">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2026;</mml:mo>
<mml:mn>9</mml:mn>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mi>SE</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mi>M</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mo>&#x3d;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mi>q</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>z</mml:mi>
<mml:mi>e</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mi>M</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x22c5;</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>M</mml:mi>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<label>(4)</label>
</disp-formula>
</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>Squeeze and excitation (SE) block structure.</p>
</caption>
<graphic xlink:href="frsip-02-861469-g005.tif"/>
</fig>
<p>For a more thorough description of the SE architecture and its effectiveness, more details can be found in <xref ref-type="bibr" rid="B12">Hu et al. (2018</xref>).</p>
<p>After the SE operations, we integrate a global average pooling layer and two FCLs, with, respectively, 2048 and 256 neurons and <italic>ReLU</italic> as an activation to reduce the dimension of the spatial features. A last layer of FCL is added with the <italic>softmax</italic> function to perform the classification. Furthermore, a dropout of 0.5 is used after each FCL to immunize the model against the overfitting problem.</p>
<p>After training the spatial model, we save the output of the last <italic>ReLU</italic> function applied on the FCL with 256 neurons, as the spatial features <italic>SF</italic>
<sub>
<italic>j</italic>
</sub>(<italic>r</italic>) (equation in 5) extracted from region <italic>r</italic> at frame <italic>F</italic>
<sub>
<italic>j</italic>
</sub>. At this point, each MiE sequence is transformed into six sequences of local spatial features (one for each ROI):<disp-formula id="e5">
<mml:math id="m9">
<mml:mtable class="aligned">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mi>S</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>F</mml:mi>
<mml:mi>C</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>256</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mi>C</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2048</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>G</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>P</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>E</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mi>M</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<label>(5)</label>
</disp-formula>
</p>
</sec>
<sec id="s3-3">
<title>3.3 Spatiotemporal Features Extraction and Classification</title>
<p>The temporal aspect of MiE is important for automatic MER systems. In this section, the temporal model, shown in <xref ref-type="fig" rid="F6">Figure 6</xref>, is described. First, a zero-padding is applied to make all sequences of spatial features in a batch fit a given standard length <italic>N</italic>. Then, an LSTM with 64 units is applied on each sequence {<italic>SF</italic>
<sub>
<italic>j</italic>
</sub>(<italic>r</italic>), <italic>j</italic> &#x2208; [1 ... <italic>N</italic>]}, followed by a leaky Rectified Linear Unit (LeakyReLU) as activation and a dropout of 0.2. For regions, the output of the LSTM is considered as the spatiotemporal features performed by<disp-formula id="e6">
<mml:math id="m10">
<mml:mtable class="aligned">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mi>S</mml:mi>
<mml:mi>T</mml:mi>
<mml:mi>F</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>L</mml:mi>
<mml:mi>S</mml:mi>
<mml:mi>T</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>M</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>64</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>F</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mfenced open="{" close="}">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2026;</mml:mo>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<label>(6)</label>
</disp-formula>
</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption>
<p>Temporal model architecture: the LSTM is employed to extract spatiotemporal features and SE to learn attention of active regions.</p>
</caption>
<graphic xlink:href="frsip-02-861469-g006.tif"/>
</fig>
<p>After that, we integrate another SE block to fuse the spatiotemporal features of the six regions and learn to activate the region for each MiE sequence. The output <italic>STF</italic> of the SE block is presented by <disp-formula id="e7">
<mml:math id="m11">
<mml:mtable class="aligned">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mi>S</mml:mi>
<mml:mi>T</mml:mi>
<mml:mi>F</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>SE</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>C</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>t</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>T</mml:mi>
<mml:mi>F</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mfenced open="{" close="}">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2026;</mml:mo>
<mml:mn>6</mml:mn>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<label>(7)</label>
</disp-formula>
</p>
<p>The final step is classification. In this model, a simple neural network (NN) is applied. It contains an FCL with 256 neurons and LeakyReLU as an activation function, followed by a dropout of 0.5 and then another FCL with <italic>K</italic> neurons and softmax as an activation function, where <italic>K</italic> represents the number of classes. Then, the system provides for each MiE sequence <italic>S</italic> a set of <italic>K</italic> probabilities <italic>P</italic>(<italic>s</italic>) set as <disp-formula id="e8">
<mml:math id="m12">
<mml:mtable class="aligned">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mi>P</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>S</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>f</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mi>C</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mi>C</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>256</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>T</mml:mi>
<mml:mi>F</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<label>(8)</label>
</disp-formula>
</p>
</sec>
<sec id="s3-4">
<title>3.4 Architecture Details</title>
<p>This section provides some details on the input, hyperparameters, and loss function used in the proposed solution. The input image for the spatial model has pixels with values in the [0, 255] range. It is standardized to be in the range [0, 1]. The input sequence of spatial features for the temporal model is normalized in such a way that the mean value data are equal to 0 with the standard deviation equal to 1. Moreover, all the layers are initialized with random values of the normal distribution with a mean value equal to 0 and a standard deviation equal to 1.</p>
<p>In order to train the spatial model or the temporal model with the classification network, a focal loss (<xref ref-type="bibr" rid="B16">Lin et al., 2018</xref>) is used. It is presented by <disp-formula id="e9">
<mml:math id="m13">
<mml:mtable class="aligned">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:msub>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mo>&#x2212;</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:munder>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3b1;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b3;</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2061;</mml:mo>
<mml:mi>log</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<label>(9)</label>
</disp-formula>where <italic>L</italic>
<sub>
<italic>FL</italic>
</sub> denotes the focal loss, <italic>&#x3b1;</italic>
<sub>
<italic>i</italic>
</sub> &#x2208; [0, 1] is a weighting factor for class <italic>i</italic> set by inverse class frequency to contribute the imbalance between classes, and <italic>&#x3b3;</italic> &#x3e; &#x3d; 0 is the focusing parameter often set to 2. The role of <inline-formula id="inf5">
<mml:math id="m14">
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b3;</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> factor is to balance the loss between hard and easy classification task of samples.</p>
<p>Furthermore, the used optimizer is Adam, with a learning rate set to 1<italic>e</italic> &#x2212; 4 for the training of the spatial model and 5<italic>e</italic> &#x2212; 5 for the training of the temporal model with the classification network. For fast implementation, we utilize the library of Tensorflow-gpu 1.12.0, and all the experiments are performed on a GPU cluster (GeForce GTX 1080 Ti GPU 32&#xa0;GB memory).</p>
</sec>
</sec>
<sec id="s4">
<title>4 Experiments and Comparison</title>
<p>In this section, we experimentally evaluate our contributions. We start with a brief introduction of the datasets and the evaluation methodology used in the 2nd Micro-expression Grand Challenge (MEGC) (4.1). Then, we ablate the various design choices in the proposed architecture to assess the comprehension of each (see Section 4.1.6). Finally, we compare our solution to state-of-the-art solutions (Section 4.2).</p>
<sec id="s4-1">
<title>4.1 Databases and Evaluation Methodology</title>
<sec id="s4-1-1">
<title>4.1.1 Databases</title>
<p>The three used datasets are CASME II (<xref ref-type="bibr" rid="B30">Yan et al., 2014</xref>), SAMM (<xref ref-type="bibr" rid="B5">Davison A. K et al., 2018</xref>), and SMIC (<xref ref-type="bibr" rid="B15">Li et al., 2013</xref>). Besides these three databases, another one called FULL is introduced in MEGC (<xref ref-type="bibr" rid="B27">See et al., 2019</xref>) by fusing the three of them.</p>
</sec>
<sec id="s4-1-2">
<title>4.1.2 SMIC</title>
<p>The Spontaneous Micro-Expression (SMIC) dataset contains three versions using three different cameras: a high-speed (HS) camera at 100 frames per second (fps) and two cameras at 25 fps of both visual (VIS) and near-infrared (NIR) light range. In all experiments, we used the SMIC-HS version that features 164 clips from 16 distinct persons. SMIC-HS generates sequences with a face resolution of (190 &#xd7; 230) that fall into only three categories: negative, positive, and surprise.</p>
</sec>
<sec id="s4-1-3">
<title>4.1.3 CASME II</title>
<p>The Chinese Academic of Science Micro-Expressions II (CASME II) dataset contains 247 sequences<xref ref-type="fn" rid="fn2">
<sup>2</sup>
</xref> of spontaneous MiE from 35 people, comprising five categories&#x2014;happiness, disgust, repression, surprise, and sadness&#x2014;and the Other category. The sequences have high temporal and spatial resolutions of 200 fps and (280 &#xd7; 340), respectively.</p>
</sec>
<sec id="s4-1-4">
<title>4.1.4 SAMM</title>
<p>The Spontaneous Micro-Facial Movement (SAMM) has the most ethnic diversity (13 ethnicities) and the most diverse age range. Disgust, surprise, happiness, fear, anger, contempt, and sadness are the seven main types of emotion depicted in the video sequences, captured with a high-resolution camera at 200 fps. A total of 159 spontaneous facial MiE sequences from 32 people are included in the database. Among these three datasets, it has the highest spatial resolution (400 &#xd7; 400 pixels). Furthermore, the focus of this dataset is on the objective AUs labels rather than the emotional labels. Therefore, all of the sequences are FACS-coded and include the Onset, Apex, and Offset frames.</p>
</sec>
<sec id="s4-1-5">
<title>4.1.5 FULL</title>
<p>It contains 442 sequences with three classes: &#x201c;negative,&#x201d; &#x201c;positive,&#x201d; and &#x201c;surprise.&#x201d; It is introduced as data augmentation.</p>
<p>All used datasets in experiments are summarized in <xref ref-type="table" rid="T4">Table 4</xref>.</p>
<table-wrap id="T4" position="float">
<label>TABLE 4</label>
<caption>
<p>Distribution of classes according to the MEGC conditions (<xref ref-type="bibr" rid="B27">See et al., 2019</xref>).</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Emotion class</th>
<th align="center">SMIC</th>
<th align="center">CASME II</th>
<th align="center">SAMM</th>
<th align="center">FULL</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Negative</td>
<td align="char" char=".">70</td>
<td align="char" char=".">88<sup>&#x2020;</sup>
</td>
<td align="char" char=".">92<sup>&#x2295;</sup>
</td>
<td align="char" char=".">250</td>
</tr>
<tr>
<td align="left">Positive</td>
<td align="char" char=".">51</td>
<td align="char" char=".">32</td>
<td align="char" char=".">26</td>
<td align="char" char=".">109</td>
</tr>
<tr>
<td align="left">Surprise</td>
<td align="char" char=".">43</td>
<td align="char" char=".">25</td>
<td align="char" char=".">15</td>
<td align="char" char=".">83</td>
</tr>
<tr>
<td align="left">Total</td>
<td align="char" char=".">164</td>
<td align="char" char=".">145</td>
<td align="char" char=".">133</td>
<td align="char" char=".">442</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>
<sup>&#x2020;</sup>Negative class of CASME II consists of samples from its original emotion classes of disgust and repression.</p>
</fn>
<fn>
<p>
<sup>&#x2295;</sup>Negative class of SAMM consists of samples from its original emotion classes of anger, contempt, disgust, fear, and sadness.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="s4-1-6">
<title>4.1.6 Evaluation Methodology</title>
<p>For the evaluation of the proposed solution, the Leave-One-Subject-Out Cross-Validation (LOSO-CV) is used as a protocol to split data into train and test sets. Data are divided per subject following this protocol. At each time, the training is conducted on Z-1 subjects and the test is run on the remaining subject (Z is the total number of subjects). The metrics applied to evaluate the system are the accuracy, the Unweighted Average Recall (UAR), and the Unweighted F1-score (UF1). The UF1 and UAR are computed by <xref ref-type="disp-formula" rid="e11">Eq. 11</xref> and <xref ref-type="disp-formula" rid="e10">Eq. 10</xref>, respectively. Both metrics are used with LOSO-CV as they are more convenient for an imbalanced classification problem (<xref ref-type="bibr" rid="B27">See et al., 2019</xref>):<disp-formula id="e10">
<mml:math id="m15">
<mml:mi>U</mml:mi>
<mml:mi>F</mml:mi>
<mml:mn>1</mml:mn>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mi>F</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mfrac>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi>T</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi>T</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>F</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>F</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:math>
<label>(10)</label>
</disp-formula>where <italic>TP</italic>
<sub>
<italic>c</italic>
</sub>, <italic>FP</italic>
<sub>
<italic>c</italic>
</sub>, and <italic>FN</italic>
<sub>
<italic>c</italic>
</sub> are, respectively, true positive, false positive, and false negative of class <italic>c</italic> and <italic>C</italic> is the number of classes:<disp-formula id="e11">
<mml:math id="m16">
<mml:mi>U</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>R</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mi>A</mml:mi>
<mml:mi>C</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mfrac>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:math>
<label>(11)</label>
</disp-formula>where <italic>ACC</italic>
<sub>
<italic>c</italic>
</sub> and <italic>N</italic>
<sub>
<italic>c</italic>
</sub> are, respectively, the accuracy rate and the number of samples of class <italic>c</italic>.</p>
</sec>
</sec>
<sec id="s4-2">
<title>4.2 Results and Analysis</title>
<sec id="s4-2-1">
<title>4.2.1 Contribution of Spatial and Temporal Models</title>
<p>The proposed method involves two stages of fusion in space and time. To validate the use of the two fusion blocks, we test the solution with and without the fusion blocks under the MEGC conditions with LOSO-CV. The performance in terms of UAR, UF1, and accuracy is summarized in <xref ref-type="table" rid="T5">Table 5</xref>. These results demonstrate the efficiency of each fusion unit. The performance with the two SE fusion blocks outperforms the base solution without any fusion and the model with either a spatial fusion or a temporal fusion, with a 3% more in UAR and almost 3% more in UF1. One can observe a gain of 3% on UF1 and 4% on accuracy with the spatial fusion compared to the basic solution, which clearly supports the use of small patches instead of the regions or the whole face. We can notice that the spatial fusion has a more positive impact on the result compared to the temporal fusion with 2% more in UF1 and 4% more in accuracy, which can be explained by the fact that the basic solution contains already a fusion of LSTMs with a simple concatenation followed by an FCL but no fusion of spatial features.</p>
<table-wrap id="T5" position="float">
<label>TABLE 5</label>
<caption>
<p>Performance of the spatial and temporal fusion blocks.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">(MEGC, LOSO-CV)</th>
<th align="center">UAR</th>
<th align="center">UF1</th>
<th align="center">Accuracy</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">
<xref ref-type="bibr" rid="B1">Aouayeb et al. (2019)</xref>
</td>
<td align="char" char=".">0.90</td>
<td align="char" char=".">0.90</td>
<td align="char" char=".">0.92</td>
</tr>
<tr>
<td align="left">Spatial fusion</td>
<td align="char" char=".">0.90</td>
<td align="char" char=".">0.93</td>
<td align="char" char=".">0.96</td>
</tr>
<tr>
<td align="left">Temporal fusion</td>
<td align="char" char=".">0.90</td>
<td align="char" char=".">0.91</td>
<td align="char" char=".">0.92</td>
</tr>
<tr>
<td align="left">Spatiotemporal fusion</td>
<td align="char" char=".">
<bold>0.93</bold>
</td>
<td align="char" char=".">
<bold>0.94</bold>
</td>
<td align="char" char=".">
<bold>0.96</bold>
</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4-2-2">
<title>4.2.2 Impact of Learning With ROI Labels</title>
<p>
<xref ref-type="bibr" rid="B1">Aouayeb et al. (2019</xref>) suggested using a customized label for each region to train the spatial model. To demonstrate the effectiveness of this contribution, we test the proposed model with the provided labels for the whole face with the label given for each region based on <xref ref-type="bibr" rid="B1">Aouayeb et al. (2019</xref>). <xref ref-type="table" rid="T6">Table 6</xref> shows that the solution with customized labels for each region performs better because it helps the spatial model to train more efficiently by focusing on a local region.</p>
<table-wrap id="T6" position="float">
<label>TABLE 6</label>
<caption>
<p>Performance of using the customized label for each region to train the spatial model.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">(MEGC, LOSO-CV)</th>
<th align="center">UAR</th>
<th align="center">UF1</th>
<th align="center">Accuracy</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Label of the whole face</td>
<td align="char" char=".">0.82</td>
<td align="char" char=".">0.82</td>
<td align="char" char=".">0.83</td>
</tr>
<tr>
<td align="left">Label based on region</td>
<td align="char" char=".">
<bold>0.93</bold>
</td>
<td align="char" char=".">
<bold>0.94</bold>
</td>
<td align="char" char=".">
<bold>0.96</bold>
</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4-2-3">
<title>4.2.3 Comparison to the State-of-the-Art</title>
<p>
<xref ref-type="table" rid="T7">Table 7</xref> shows that the proposed model improves the basic architecture in UAR and UF1 by almost 4% in the FULL database. By taking a closer look, one can find that the SAMM part is the most improved, with 8% in UAR and 14% in UF1.</p>
<table-wrap id="T7" position="float">
<label>TABLE 7</label>
<caption>
<p>Performance on three classes based MEGC with LOSO-CV.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left"/>
<th colspan="2" align="center">
<xref ref-type="bibr" rid="B1">Aouayeb et al. (2019)</xref>
</th>
<th colspan="2" align="center">Spatiotemporal fusion</th>
</tr>
<tr>
<th align="left">Data/metrics</th>
<th align="center">UAR</th>
<th align="center">UF1</th>
<th align="center">UAR</th>
<th align="center">UF1</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">SMIC</td>
<td align="center">0.88</td>
<td align="center">0.88</td>
<td align="center">
<bold>0.91</bold>
</td>
<td align="center">
<bold>0.91</bold>
</td>
</tr>
<tr>
<td align="left">CASMEII</td>
<td align="center">
<bold>0.98</bold>
</td>
<td align="center">
<bold>0.98</bold>
</td>
<td align="center">
<bold>0.98</bold>
</td>
<td align="center">
<bold>0.99</bold>
</td>
</tr>
<tr>
<td align="left">SAMM</td>
<td align="center">0.81</td>
<td align="center">0.78</td>
<td align="center">
<bold>0.89</bold>
</td>
<td align="center">
<bold>0.92</bold>
</td>
</tr>
<tr>
<td align="left">FULL</td>
<td align="center">0.90</td>
<td align="center">0.90</td>
<td align="center">
<bold>0.93</bold>
</td>
<td align="center">
<bold>0.94</bold>
</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>As shown in <xref ref-type="table" rid="T8">Table 8</xref>, the proposed solution outperforms all state-of-the-art works, particularly handcrafted solutions where the UAR and UF1 metrics are improved in most cases by 40%, and one can also observe a slight improvement compared to recent deep learning-based solutions. The main drawback of our solution is the complexity of the algorithm, which makes the tuning of hyperparameters of the model harder.</p>
<table-wrap id="T8" position="float">
<label>TABLE 8</label>
<caption>
<p>LOSO-CV performance of the proposed method, baselines, and the recent methods (&#x2a; references from the MEGC 2019 challenge). Bold: score <inline-formula id="inf6">
<mml:math id="m17">
<mml:mo>&#x3e;</mml:mo>
</mml:math>
</inline-formula> 0.90.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="left">Models</th>
<th colspan="2" align="center">FULL</th>
<th colspan="2" align="center">SMIC-HS</th>
<th colspan="2" align="center">Casame II</th>
<th colspan="2" align="center">SAMM</th>
</tr>
<tr>
<th align="center">UF1</th>
<th align="center">UAR</th>
<th align="center">UF1</th>
<th align="center">UAR</th>
<th align="center">UF1</th>
<th align="center">UAR</th>
<th align="center">UF1</th>
<th align="center">UAR</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">
<xref ref-type="bibr" rid="B27">See et al. (2019)</xref>
<sup>&#x22c4;</sup>
</td>
<td align="center">0.58</td>
<td align="center">0.57</td>
<td align="center">0.20</td>
<td align="center">0.52</td>
<td align="center">0.70</td>
<td align="center">0.74</td>
<td align="center">0.39</td>
<td align="center">0.41</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B11">Guo et al. (2019)</xref>
<sup>&#x22c4;</sup>
</td>
<td align="center">0.62</td>
<td align="center">0.62</td>
<td align="center">0.57</td>
<td align="center">0.58</td>
<td align="center">0.78</td>
<td align="center">0.80</td>
<td align="center">0.52</td>
<td align="center">0.51</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B37">Zhou et al. (2019)</xref>&#x2a;<sup>&#x2020;</sup>
</td>
<td align="center">0.73</td>
<td align="center">0.72</td>
<td align="center">0.66</td>
<td align="center">0.67</td>
<td align="center">0.86</td>
<td align="center">0.85</td>
<td align="center">0.58</td>
<td align="center">0.56</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B17">Liong et al. (2019)</xref>&#x2a;<sup>&#x2020;</sup>
</td>
<td align="center">0.73</td>
<td align="center">0.76</td>
<td align="center">0.68</td>
<td align="center">0.70</td>
<td align="center">0.83</td>
<td align="center">0.86</td>
<td align="center">0.65</td>
<td align="center">0.68</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B20">Liu Y et al. (2019)</xref>&#x2a;<sup>&#x2020;</sup>
</td>
<td align="center">0.78</td>
<td align="center">0.78</td>
<td align="center">0.74</td>
<td align="center">0.75</td>
<td align="center">0.82</td>
<td align="center">0.82</td>
<td align="center">0.77</td>
<td align="center">0.71</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B4">Choi and Song (2020)</xref>
<sup>&#x2020;</sup>
</td>
<td align="center">0.77</td>
<td align="center">0.75</td>
<td align="center">0.72</td>
<td align="center">0.71</td>
<td align="center">0.87</td>
<td align="center">0.84</td>
<td align="center">0.67</td>
<td align="center">0.60</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B19">Liu et al. (2021)</xref>
<sup>&#x2020;</sup>
</td>
<td align="center">0.83</td>
<td align="center">0.83</td>
<td align="center">0.81</td>
<td align="center">0.81</td>
<td align="center">0.88</td>
<td align="center">0.89</td>
<td align="center">0.80</td>
<td align="center">0.79</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B32">Zhang et al. (2021)</xref>
<sup>&#x2020;</sup>
</td>
<td align="center">0.81</td>
<td align="center">0.79</td>
<td align="center">0.72</td>
<td align="center">0.70</td>
<td align="center">
<bold>0.90</bold>
</td>
<td align="center">0.88</td>
<td align="center">0.71</td>
<td align="center">0.64</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B34">Zhao et al. (2021)</xref>
<sup>&#x2020;</sup>
</td>
<td align="center">
<bold>0.91</bold>
</td>
<td align="center">
<bold>0.90</bold>
</td>
<td align="center">0.85</td>
<td align="center">0.85</td>
<td align="center">
<bold>0.97</bold>
</td>
<td align="center">
<bold>0.97</bold>
</td>
<td align="center">
<bold>0.91</bold>
</td>
<td align="center">0.89</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B1">Aouayeb et al. (2019)</xref> <sup>&#x2295;</sup>
</td>
<td align="center">
<bold>0.90</bold>
</td>
<td align="center">
<bold>0.90</bold>
</td>
<td align="center">0.88</td>
<td align="center">0.88</td>
<td align="center">
<bold>0.98</bold>
</td>
<td align="center">
<bold>0.98</bold>
</td>
<td align="center">0.78</td>
<td align="center">0.81</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B31">Yu et al. (2020)</xref>
<sup>&#x2295;</sup>
</td>
<td align="center">0.85</td>
<td align="center">0.84</td>
<td align="center">0.79</td>
<td align="center">0.79</td>
<td align="center">0.87</td>
<td align="center">0.86</td>
<td align="center">0.85</td>
<td align="center">0.82</td>
</tr>
<tr>
<td align="left">Ours <sup>&#x2295;</sup>
</td>
<td align="center">
<bold>0.94</bold>
</td>
<td align="center">
<bold>0.93</bold>
</td>
<td align="center">
<bold>0.91</bold>
</td>
<td align="center">
<bold>0.91</bold>
</td>
<td align="center">
<bold>0.99</bold>
</td>
<td align="center">
<bold>0.98</bold>
</td>
<td align="center">
<bold>0.92</bold>
</td>
<td align="center">0.89</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>
<sup>&#x22c4;</sup>Handcrafted approach.</p>
</fn>
<fn>
<p>
<sup>&#x2020;</sup>Hybrid approach.</p>
</fn>
<fn>
<p>
<sup>&#x2295;</sup>Deep learning approach.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
</sec>
<sec id="s5">
<title>5 Conclusion</title>
<p>In this study, we have proposed a region-based solution for MER. The proposed solution extracts spatiotemporal features using a combined architecture of CNN and LSTM supported by a SE fusion unit in space and time. The effectiveness of the architecture, the use of the SE, and the ROI labels are validated. Experiments on different databases demonstrate the potential of this model. It outperforms the first solution in the MEGC and other recently proposed solutions. In future work, we will explore less complex architecture for MER that addresses the locality character with an automatic system.</p>
</sec>
</body>
<back>
<sec id="s6">
<title>Author Contributions</title>
<p>MA: software, methodology and conceptualization, and writing&#x2014;original draft. CS: conceptualization, methodology, and writing&#x2014;review and editing. WH: supervision and writing&#x2014;review and editing. KK: supervision and writing&#x2014;review and editing. RS: supervision.</p>
</sec>
<sec sec-type="COI-statement" id="s7">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s8">
<title>Publisher&#x2019;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<sec id="s9">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/frsip.2022.861469/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/frsip.2022.861469/full&#x23;supplementary-material</ext-link>
</p>
<supplementary-material xlink:href="DataSheet1.PDF" id="SM1" mimetype="application/PDF" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Aouayeb</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Hamidouche</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Kpalma</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Benazza-Benyahia</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>A Spatiotemporal Deep Learning Solution for Automatic Micro-expressions Recognition from Local Facial Regions</article-title>,&#x201d; in <conf-name>2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP)</conf-name> (<conf-loc>Pittsburgh, PA</conf-loc>), <fpage>1</fpage>&#x2013;<lpage>6</lpage>. <pub-id pub-id-type="doi">10.1109/mlsp.2019.8918771</pub-id> </citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ben</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>S.-J.</given-names>
</name>
<name>
<surname>Kpalma</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Meng</surname>
<given-names>W.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Video-based Facial Micro-expression Analysis: A Survey of Datasets, Features and Algorithms</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell.</source>, <fpage>1</fpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2021.3067464</pub-id> </citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Zhong</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Three-stream Convolutional Neural Network with Squeeze-And-Excitation Block for Near-Infrared Facial Expression Recognition</article-title>. <source>Electronics</source> <volume>8</volume>, <fpage>385</fpage>. <pub-id pub-id-type="doi">10.3390/electronics8040385</pub-id> </citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Choi</surname>
<given-names>D. Y.</given-names>
</name>
<name>
<surname>Song</surname>
<given-names>B. C.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Facial Micro-expression Recognition Using Two-Dimensional Landmark Feature Maps</article-title>. <source>IEEE Access</source> <volume>8</volume>, <fpage>121549</fpage>&#x2013;<lpage>121563</lpage>. <pub-id pub-id-type="doi">10.1109/access.2020.3006958</pub-id> </citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Davison</surname>
<given-names>A. K.</given-names>
</name>
<name>
<surname>Lansley</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Costen</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Tan</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Yap</surname>
<given-names>M. H.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>SAMM: A Spontaneous Micro-facial Movement Dataset</article-title>. <source>IEEE Trans. Affective Comput.</source> <volume>9</volume>, <fpage>116</fpage>&#x2013;<lpage>129</lpage>. <pub-id pub-id-type="doi">10.1109/taffc.2016.2573832</pub-id> </citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Davison</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Merghani</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Yap</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Objective Classes for Micro-facial Expression Recognition</article-title>. <source>J. Imaging</source> <volume>4</volume>, <fpage>119</fpage>. <pub-id pub-id-type="doi">10.3390/jimaging4100119</pub-id> </citation>
</ref>
<ref id="B7">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Duque</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Alata</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Emonet</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Konik</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Legrand</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2020</year>). <source>Mean Oriented Riesz Features for Micro Expression Classification</source>. <comment>
<italic>ArXiv</italic> abs/2005.06198</comment>. </citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ekman</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Friesen</surname>
<given-names>W. V.</given-names>
</name>
</person-group> (<year>1969</year>). <article-title>Nonverbal Leakage and Clues to Deception</article-title>. <source>Psychiatry</source> <volume>32</volume>, <fpage>88</fpage>&#x2013;<lpage>106</lpage>. <pub-id pub-id-type="doi">10.1080/00332747.1969.11023575</pub-id> </citation>
</ref>
<ref id="B9">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ekman</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Friesen</surname>
<given-names>W.</given-names>
</name>
</person-group> (<year>1978</year>). <source>Facial Action Coding System: Investigator&#x2019;s Guide</source>. <publisher-name>Consulting Psychologists Press</publisher-name>. </citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gan</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Liong</surname>
<given-names>S.-T.</given-names>
</name>
<name>
<surname>Yau</surname>
<given-names>W. C.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>Y. C.</given-names>
</name>
<name>
<surname>Tan</surname>
<given-names>L. K.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>OFF-ApexNet on Micro-expression Recognition System</article-title>. <source>Signal. Processing: Image Commun.</source> <volume>74</volume>, <fpage>129</fpage>. <pub-id pub-id-type="doi">10.1016/j.image.2019.02.005</pub-id> </citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Guo</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Liang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Zhan</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Pietik&#xe4;inen</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Extended Local Binary Patterns for Efficient and Robust Spontaneous Facial Micro-expression Recognition</article-title>. <source>IEEE Access</source> <volume>7</volume>, <fpage>174517</fpage>&#x2013;<lpage>174530</lpage>. <pub-id pub-id-type="doi">10.1109/access.2019.2942358</pub-id> </citation>
</ref>
<ref id="B12">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Hu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Squeeze-and-excitation Networks</article-title>,&#x201d; in <conf-name>2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>, <fpage>7132</fpage>&#x2013;<lpage>7141</lpage>. <pub-id pub-id-type="doi">10.1109/cvpr.2018.00745</pub-id> </citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ji</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>3d Convolutional Neural Networks for Human Action Recognition</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell.</source> <volume>35</volume>, <fpage>221</fpage>&#x2013;<lpage>231</lpage>. <pub-id pub-id-type="doi">10.1109/tpami.2012.59</pub-id> </citation>
</ref>
<ref id="B14">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Kazemi</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Sullivan</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>One Millisecond Face Alignment with an Ensemble of Regression Trees</article-title>,&#x201d; in <conf-name>2014 IEEE Conference on Computer Vision and Pattern Recognition</conf-name>, <fpage>1867</fpage>&#x2013;<lpage>1874</lpage>. <pub-id pub-id-type="doi">10.1109/cvpr.2014.241</pub-id> </citation>
</ref>
<ref id="B15">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Pfister</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Pietika</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2013</year>). &#x201c;<article-title>A Spontaneous Micro-expression Database: Inducement, Collection and Base-Line</article-title>,&#x201d; in <conf-name>10th Proceedings of the International Conference Autom Face Gesture Recognition (FG2013)</conf-name>, <conf-loc>Shanghai, China</conf-loc>. <pub-id pub-id-type="doi">10.1109/FG.2013.6553717</pub-id> </citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lin</surname>
<given-names>T.-Y.</given-names>
</name>
<name>
<surname>Goyal</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Girshick</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>He</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Dollar</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Focal Loss for Dense Object Detection</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell.</source> <volume>42</volume>, <fpage>318</fpage>&#x2013;<lpage>327</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2018.2858826</pub-id> </citation>
</ref>
<ref id="B17">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Liong</surname>
<given-names>S.-T.</given-names>
</name>
<name>
<surname>Gan</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>See</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Khor</surname>
<given-names>H.-Q.</given-names>
</name>
</person-group> (<year>2019</year>). <source>A Shallow Triple Stream Three-Dimensional CNN (STSTNet) for Micro-expression Recognition System</source>. <comment>
<italic>arXiv preprint arXiv:1902.03634</italic>
</comment>. </citation>
</ref>
<ref id="B18">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Simonyan</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>DARTS: Differentiable Architecture Search</article-title>,&#x201d; in <conf-name>International Conference on Learning Representations</conf-name>. </citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>K.-H.</given-names>
</name>
<name>
<surname>Jin</surname>
<given-names>Q.-S.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>H.-C.</given-names>
</name>
<name>
<surname>Gan</surname>
<given-names>Y.-S.</given-names>
</name>
<name>
<surname>Liong</surname>
<given-names>S.-T.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Micro-expression Recognition Using Advanced Genetic Algorithm</article-title>. <source>Signal. Processing: Image Commun.</source> <volume>93</volume>, <fpage>116153</fpage>. <pub-id pub-id-type="doi">10.1016/j.image.2021.116153</pub-id> </citation>
</ref>
<ref id="B20">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Du</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Zheng</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Gedeon</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>A Neural Micro-expression Recognizer</article-title>,&#x201d; in <conf-name>14th IEEE International Conference on Automatic Face &#x26; Gesture Recognition (FG 2019)</conf-name>. <pub-id pub-id-type="doi">10.1109/fg.2019.8756583</pub-id> </citation>
</ref>
<ref id="B21">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Mirza</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Osindero</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2014</year>). <source>Conditional Generative Adversarial Nets</source>. <comment>ArXiv abs/1411.1784</comment>. </citation>
</ref>
<ref id="B22">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Patel</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Hong</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Selective Deep Features for Micro-expression Recognition</article-title>,&#x201d; in <conf-name>2016 23rd International Conference on Pattern Recognition (ICPR)</conf-name>, <fpage>2258</fpage>&#x2013;<lpage>2263</lpage>. <pub-id pub-id-type="doi">10.1109/icpr.2016.7899972</pub-id> </citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Polikovsky</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Kameda</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Ohta</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2009</year>). &#x201c;<article-title>Facial Micro-expressions Recognition Using High Speed Camera and 3d-Gradient Descriptor</article-title>,&#x201d; in <conf-name>3rd International Conference On Imaging For Crime Detection And Prevention (ICDP 2009)</conf-name>. <pub-id pub-id-type="doi">10.1049/ic.2009.0244</pub-id> </citation>
</ref>
<ref id="B24">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Quang</surname>
<given-names>N. V.</given-names>
</name>
<name>
<surname>Chun</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Tokuyama</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Capsulenet for Micro-expression Recognition</article-title>,&#x201d; in <conf-name>2019 14th IEEE International Conference on Automatic Face Gesture Recognition (FG 2019)</conf-name>, <fpage>1</fpage>&#x2013;<lpage>7</lpage>. <pub-id pub-id-type="doi">10.1109/fg.2019.8756544</pub-id> </citation>
</ref>
<ref id="B25">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Reddy</surname>
<given-names>S. P. T.</given-names>
</name>
<name>
<surname>Karri</surname>
<given-names>S. T.</given-names>
</name>
<name>
<surname>Dubey</surname>
<given-names>S. R.</given-names>
</name>
<name>
<surname>Mukherjee</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2019</year>). <source>Spontaneous Facial Micro-expression Recognition Using 3D Spatiotemporal Convolutional Neural Networks</source>. <comment>
<italic>arXiv preprint arXiv:1904.01390</italic>
</comment>. </citation>
</ref>
<ref id="B26">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Sabour</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Frosst</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Hinton</surname>
<given-names>G. E.</given-names>
</name>
</person-group> (<year>2017</year>). <source>Dynamic Routing between Capsules</source>. <comment>ArXiv abs/1710, 09829</comment>. </citation>
</ref>
<ref id="B27">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>See</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Yap</surname>
<given-names>M. H.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Hong</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>S.-J.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>MEGC 2019 &#x2013; the Second Facial Micro-expressions Grand Challenge</article-title>,&#x201d; in <conf-name>14th IEEE International Conference on Automatic Face &#x26; Gesture Recognition (FG 2019)</conf-name>. </citation>
</ref>
<ref id="B28">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>See</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Phan</surname>
<given-names>R. C.-W.</given-names>
</name>
<name>
<surname>Oh</surname>
<given-names>Y.-H.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Lbp with Six Intersection Points: Reducing Redundant Information in Lbp-Top for Micro-expression Recognition</article-title>,&#x201d; in <conf-name>Asian conference on computer vision</conf-name> (<publisher-name>Springer</publisher-name>), <fpage>525</fpage>&#x2013;<lpage>537</lpage>. </citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Xia</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Peng</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Khor</surname>
<given-names>H.-Q.</given-names>
</name>
<name>
<surname>Feng</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Revealing the Invisible with Model and Data Shrinking for Composite-Database Micro-expression Recognition</article-title>. <source>IEEE Trans. Image Process.</source> <volume>29</volume>, <fpage>8590</fpage>&#x2013;<lpage>8605</lpage>. <pub-id pub-id-type="doi">10.1109/tip.2020.3018222</pub-id> </citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yan</surname>
<given-names>W. J.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>S. J.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Y. J.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Y. H.</given-names>
</name>
<etal/>
</person-group> (<year>2014</year>). <article-title>CASME II: An Improved Spontaneous Micro-expression Database and the Baseline Evaluation</article-title>. <source>PloS one</source> <volume>9</volume>, <fpage>e86041</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0086041</pub-id> </citation>
</ref>
<ref id="B31">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Yu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Song</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Cai</surname>
<given-names>W.</given-names>
</name>
</person-group> (<year>2020</year>). <source>Ice-gan: Identity-Aware and Capsule-Enhanced gan for Micro-expression Recognition and Synthesis</source>. <comment>
<italic>ArXiv</italic> abs/2005.04370</comment>. </citation>
</ref>
<ref id="B32">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Hong</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Arandjelovic</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2021</year>). <source>Short and Long Range Relation Based Spatio-Temporal Transformer for Micro-expression Recognition</source>. <comment>CoRR abs/2112.05851</comment>. </citation>
</ref>
<ref id="B33">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Pietikainen</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2007</year>). <article-title>Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell.</source> <volume>29</volume>, <fpage>915</fpage>&#x2013;<lpage>928</lpage>. <pub-id pub-id-type="doi">10.1109/tpami.2007.1110</pub-id> </citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Luo</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Micro-expression Recognition Based on Pixel Residual Sum and Cropped Gaussian Pyramid</article-title>. <source>Front. Neurorobot.</source> <volume>15</volume>, <fpage>746985</fpage>. <pub-id pub-id-type="doi">10.3389/fnbot.2021.746985</pub-id> </citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>An Improved Micro-expression Recognition Method Based on Necessary Morphological Patches</article-title>. <source>Symmetry</source> <volume>11</volume>, <fpage>497</fpage>. <pub-id pub-id-type="doi">10.3390/sym11040497</pub-id> </citation>
</ref>
<ref id="B36">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Necessary Morphological Patches Extraction for Automatic Micro-expression Recognition</article-title>. <source>Appl. Sci.</source> <volume>8</volume>, <fpage>1811</fpage>. <pub-id pub-id-type="doi">10.3390/app8101811</pub-id> </citation>
</ref>
<ref id="B37">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Mao</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Xue</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Dual-inception Network for Cross-Database Micro-expression Recognition</article-title>,&#x201d; in <conf-name>2019 14th IEEE International Conference on Automatic Face Gesture Recognition (FG 2019)</conf-name>, <fpage>1</fpage>&#x2013;<lpage>5</lpage>. <pub-id pub-id-type="doi">10.1109/fg.2019.8756579</pub-id> </citation>
</ref>
</ref-list>
<fn-group>
<fn id="fn1">
<label>1</label>
<p>
<ext-link ext-link-type="uri" xlink:href="http://dlib.net/face_landmark_detection.py.html">http://dlib.net/face_landmark_detection.py.html</ext-link>
</p>
</fn>
<fn id="fn2">
<label>2</label>
<p>247 samples were reported by <xref ref-type="bibr" rid="B30">Yan et al. (2014)</xref>, while, in the publicly available dataset, the number of samples is about 255.</p>
</fn>
</fn-group>
</back>
</article>