<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neuroinform.</journal-id>
<journal-title>Frontiers in Neuroinformatics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neuroinform.</abbrev-journal-title>
<issn pub-type="epub">1662-5196</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fninf.2025.1526259</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>An action decoding framework combined with deep neural network for predicting the semantics of human actions in videos from evoked brain activities</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Zhang</surname> <given-names>Yuanyuan</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/2964177/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Tian</surname> <given-names>Manli</given-names></name>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Liu</surname> <given-names>Baolin</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2890430/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/conceptualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/resources/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
</contrib-group>
<aff><institution>School of Computer and Communication Engineering, University of Science and Technology Beijing</institution>, <addr-line>Beijing</addr-line>, <country>China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Ying Xie, Kennesaw State University, United States</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Lili Zhang, Hewlett Packard Enterprise, United States</p><p>Linh Le, Kennesaw State University, United States</p><p>Marcos C&#x00E9;sar de Oliveira, State University of Campinas, Brazil</p></fn>
<corresp id="c001">&#x002A;Correspondence: Baolin Liu, <email>liubaolin@ustb.edu.cn</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>19</day>
<month>02</month>
<year>2025</year>
</pub-date>
<pub-date pub-type="collection">
<year>2025</year>
</pub-date>
<volume>19</volume>
<elocation-id>1526259</elocation-id>
<history>
<date date-type="received">
<day>11</day>
<month>11</month>
<year>2024</year>
</date>
<date date-type="accepted">
<day>30</day>
<month>01</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2025 Zhang, Tian and Liu.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Zhang, Tian and Liu</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<sec>
<title>Introduction</title>
<p>Recently, numerous studies have focused on the semantic decoding of perceived images based on functional magnetic resonance imaging (fMRI) activities. However, it remains unclear whether it is possible to establish relationships between brain activities and semantic features of human actions in video stimuli. Here we construct a framework for decoding action semantics by establishing relationships between brain activities and semantic features of human actions.</p>
</sec>
<sec>
<title>Methods</title>
<p>To effectively use a small amount of available brain activity data, our proposed method employs a pre-trained image action recognition network model based on an expanding three-dimensional (X3D) deep neural network framework (DNN). To apply brain activities to the image action recognition network, we train regression models that learn the relationship between brain activities and deep-layer image features. To improve decoding accuracy, we join by adding the nonlocal-attention mechanism module to the X3D model to capture long-range temporal and spatial dependence, proposing a multilayer perceptron (MLP) module of multi-task loss constraint to build a more accurate regression mapping approach and performing data enhancement through linear interpolation to expand the amount of data to reduce the impact of a small sample.</p>
</sec>
<sec>
<title>Results and discussion</title>
<p>Our findings indicate that the features in the X3D-DNN are biologically relevant, and capture information useful for perception. The proposed method enriches the semantic decoding model. We have also conducted several experiments with data from different subsets of brain regions known to process visual stimuli. The results suggest that semantic information for human actions is widespread across the entire visual cortex.</p>
</sec>
</abstract>
<kwd-group>
<kwd>functional magnetic resonance imaging</kwd>
<kwd>decoding</kwd>
<kwd>action semantic</kwd>
<kwd>three-dimension convolutional neural network</kwd>
<kwd>multi-subject model</kwd>
</kwd-group>
<contract-num rid="cn001">FRF-TP-24-021A</contract-num>
<contract-num rid="cn001">No.FRF-MP-19-007,</contract-num>
<contract-num rid="cn001">No.FRF-TP-20-065A1Z</contract-num>
<contract-num rid="cn002">No.U2133218</contract-num>
<contract-sponsor id="cn001">Fundamental Research Funds for the Central Universities<named-content content-type="fundref-id">10.13039/501100012226</named-content></contract-sponsor>
<contract-sponsor id="cn002">National Natural Science Foundation of China<named-content content-type="fundref-id">10.13039/501100001809</named-content></contract-sponsor>
<counts>
<fig-count count="7"/>
<table-count count="3"/>
<equation-count count="5"/>
<ref-count count="46"/>
<page-count count="12"/>
<word-count count="8726"/>
</counts>
</article-meta>
</front>
<body>
<sec id="S1" sec-type="intro">
<title>1 Introduction</title>
<p>In recent years, the brain&#x2019;s capacity for semantic decoding of visual stimuli has become a popular area in cognitive neuroscience research. The study of semantic decoding of brain activity will not only provide a better understanding of the cognitive mechanism of the brain but also the development of artificial intelligence. Among the many techniques that can be used to measure brain activity, functional magnetic resonance imaging (fMRI) is advantageous given its high spatiotemporal resolution (<xref ref-type="bibr" rid="B10">Engel et al., 1997</xref>).</p>
<p>Numerous studies have developed various methods to estimate the semantic information associated with brain activity based fMRI data (<xref ref-type="bibr" rid="B2">Akamatsu et al., 2018</xref>; <xref ref-type="bibr" rid="B24">Li et al., 2021</xref>; <xref ref-type="bibr" rid="B36">Stansbury et al., 2013</xref>; <xref ref-type="bibr" rid="B37">Takada et al., 2020</xref>; <xref ref-type="bibr" rid="B42">Vodrahalli et al., 2018</xref>). In early studies, statistical methods such as support vector machine (SVM) classifiers (<xref ref-type="bibr" rid="B7">Cortes and Vapnik, 1995</xref>) and linear discriminant analysis (LDA) (<xref ref-type="bibr" rid="B12">Fisher, 2012</xref>) are used to decode semantic information (<xref ref-type="bibr" rid="B20">Huth et al., 2012</xref>; <xref ref-type="bibr" rid="B19">Huth et al., 2016</xref>; <xref ref-type="bibr" rid="B36">Stansbury et al., 2013</xref>) for direct classification of categories corresponding to fMRI activities. Recent studies incorporating technological advancements in cognitive neuroscience methods have revealed that deep neural network models (DNNs) can partially explain the brain&#x2019;s responses to visual stimuli (<xref ref-type="bibr" rid="B6">Cichy et al., 2016</xref>; <xref ref-type="bibr" rid="B9">Eickenberg et al., 2016</xref>; <xref ref-type="bibr" rid="B14">G&#x00FC;&#x00E7;l&#x00FC; and van Gerven, 2015</xref>; <xref ref-type="bibr" rid="B45">Yamins et al., 2014</xref>). DNNs representations can provide accurate predictions of neural responses in both the dorsal (object recognition) and the ventral (motion processing/recognition) visual pathways. Many decoding studies have utilized DNN representations to construct models for decoding semantic information of both observed (<xref ref-type="bibr" rid="B3">Akamatsu et al., 2020</xref>; <xref ref-type="bibr" rid="B17">Horikawa and Kamitani, 2017a</xref>; <xref ref-type="bibr" rid="B26">Matsuo et al., 2018</xref>; <xref ref-type="bibr" rid="B44">Wen et al., 2018</xref>) and imagined (<xref ref-type="bibr" rid="B18">Horikawa and Kamitani, 2017b</xref>) picture stimuli based on the brain activities. They establish a regression mapping from fMRI to DNN representations, and convert the predicted representations into semantic tags through the pre-trained classifier. As the classifier is separated from the regression mapping, the model can be expanded by retraining the classifier with labeled images without changing the semantic representation space. Compared with models that directly classify fMRI data, the model based on DNN representations provides an effective extension of decoding capabilities (<xref ref-type="bibr" rid="B44">Wen et al., 2018</xref>).</p>
<p>However, these semantic decoding studies have mainly explored the scene/object semantic decoding of perceived images based on DNN representations from fMRI activities (<xref ref-type="bibr" rid="B3">Akamatsu et al., 2020</xref>; <xref ref-type="bibr" rid="B17">Horikawa and Kamitani, 2017a</xref>; <xref ref-type="bibr" rid="B26">Matsuo et al., 2018</xref>). By contrast, only a few studies have investigated the semantic decoding of human actions in videos based on fMRI data, and it is unclear whether and to what extent the DNN could decode the brain&#x2019;s responses to human actions in video stimuli. Among recent studies, <xref ref-type="bibr" rid="B15">G&#x00FC;&#x00E7;l&#x00FC; and van Gerven (2017)</xref> demonstrates that the spatiotemporal features of natural movies extracted by a 3D Convolutional (C3D) network (<xref ref-type="bibr" rid="B40">Tran et al., 2015</xref>) optimized for action recognition can accurately predict how the dorsal flow area responds to dynamic changes in natural video stimuli (<xref ref-type="bibr" rid="B15">G&#x00FC;&#x00E7;l&#x00FC; and van Gerven, 2017</xref>). This method extends beyond the capabilities of the previous models, which can only learn spatiotemporal representations (<xref ref-type="bibr" rid="B28">Nishimoto and Gallant, 2011</xref>; <xref ref-type="bibr" rid="B34">Rust et al., 2005</xref>). However, the DNN used in (<xref ref-type="bibr" rid="B15">G&#x00FC;&#x00E7;l&#x00FC; and van Gerven, 2017</xref>) still lacks the ability to model long-term dependence. An expanding three-dimensional (X3D) (<xref ref-type="bibr" rid="B11">Feichtenhofer, 2020</xref>) deep neural network model has been proposed to expand not only in the temporal dimension, but also in other dimensions such as spatial resolution, frame rate, etc., while being extremely light in terms of network width and parameters. Recently, a vision transformer and a video vision transformer were proposed for image and video recognition with a self-attention mechanism was used to capture the relationship of features globally (<xref ref-type="bibr" rid="B4">Arnab et al., 2021</xref>; <xref ref-type="bibr" rid="B8">Dosovitskiy et al., 2021</xref>). Although this improves the problem of neglecting global integration, the problem of time-consuming computation becomes more serious since transformers contain many more trainable parameters than CNNs with the same number of layers. In some action recognition tasks, convolutional neural networks such as X3D outperformed the Transformer models (<xref ref-type="bibr" rid="B23">Lai et al., 2023</xref>).</p>
<p>Inspired by recent studies, we construct a baseline framework for decoding human action semantics in videos by establishing relationships between brain activities and semantic features of human actions extracted by DNN. We try to determine whether long-term dependent action features and the use of multi-layer features of DNNs could help to improve the mapping from fMRI data to action features. The framework consists of two parts. The first part aims to capture action features containing spatiotemporal dynamic information through an X3D (<xref ref-type="bibr" rid="B11">Feichtenhofer, 2020</xref>) deep neural network model. In this part, an non-local attention mechanism (<xref ref-type="bibr" rid="B43">Wang et al., 2018</xref>) is added to the X3D deep learning model to extract long-term dependent action features, which helps to overcome the shortcomings of the deep learning model that can not capture features of long-term dependence. The second part aims to build a regression model from fMRI to action features and convert the predicted representations into semantic tags through the pre-trained classifier to decode the action semantics in the videos. In this part, we use multi-layer feature loss constraints as a loss constraint term of the MLP to establish an accurate mapping relationship from fMRI to action features. To demonstrate the advantage of the multi-layer feature loss-constrained MLP approach, we compare its decoding performance with ridge regression, K-nearest neighbor (KNN), and MLP regression approaches, which are various visual techniques widely used in decoding research (<xref ref-type="bibr" rid="B26">Matsuo et al., 2018</xref>; <xref ref-type="bibr" rid="B31">Papadimitriou et al., 2019</xref>; <xref ref-type="bibr" rid="B33">Qiao et al., 2018</xref>; <xref ref-type="bibr" rid="B44">Wen et al., 2018</xref>). In this study, we use the fMRI dataset published by <xref ref-type="bibr" rid="B39">Tarhan and Konkle (2020)</xref>, which consists of fMRI recordings of 13 people watching daily human behavior videos. We also try to perform fMRI data augmentation through linear interpolation to expand the amount of data to improve the decoding effect. Finally, we have conducted several experiments using reliable voxels acquired from the whole cortex and reliable voxels from specific brain regions, to gain further insight into the brain&#x2019;s understanding of motion video.</p>
</sec>
<sec id="S2" sec-type="materials|methods">
<title>2 Materials and methods</title>
<sec id="S2.SS1">
<title>2.1 Overview</title>
<p><xref ref-type="fig" rid="F1">Figure 1</xref> summarizes the proposed framework for decoding human action semantics in videos based on fMRI data. The decoding model is divided into two parts: (1) Extracting action features by utilizing the action recognition model, X3D. The X3D model takes a raw video clip as input that is uniformly sampled 16 frames in the data layer stage. The non-local-attention mechanism is added to X3D to capture long-range temporal and spatial dependence so as to overcome the shortcomings of the deep learning model that cannot capture features of long-term dependence. The first fully connected layer of X3D is the action feature extraction layer, and the output of this layer is the action feature corresponding to the video stimulus. (2) The MLP regression model is established with a three-layer fully connected network, mapping from fMRI data to action features. The predicted action features are then fed into the second fully connected layer of the X3D model, which serves as the semantic classification layer as defined by the X3D action recognition model, to extract semantic content. To achieve a more accurate mapping between fMRI and deep learning representations, multiple-layer feature mean square error (MSE) is incorporated into the MLP model&#x2019;s loss function. The model is trained using multi-subject fMRI data and subsequently tested on an unseen subject to evaluate its generalization capabilities.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption><p>Overview of the proposed method. <bold>(a)</bold> The X3D model. The orange arrows represent the learning process of the deep learning model X3D from video input to output action semantics. <bold>(b)</bold> The MLP regression model. The blue arrows represent the process of predicting action features from fMRI, and then from the predicted features to action semantics. Images were taken from <xref ref-type="bibr" rid="B39">Tarhan and Konkle (2020)</xref> (Creative Commons License CC BY 4.0, <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link>).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fninf-19-1526259-g001.tif"/>
</fig>
</sec>
<sec id="S2.SS2">
<title>2.2 Dataset and preprocessing</title>
<p>Functional magnetic resonance imaging data set published by <xref ref-type="bibr" rid="B39">Tarhan and Konkle (2020)</xref> is used in the study, which contains information from 13 subjects who have watched videos of typical daily human behavior. The stimuli consist of 120 videos (duration: 2.5 s) reflecting 60 types of daily human movement (e.g., running, cooking, riding a bike), which are obtained from YouTube, Vine, the Human Movement Database (<xref ref-type="bibr" rid="B22">Kuehne et al., 2011</xref>) and the University of Central Florida&#x2019;s Action Recognition Data Set (<xref ref-type="bibr" rid="B35">Soomro et al., 2012</xref>). Each video stimulus is 512 &#x00D7; 512 pixels in size and is presented on a 41.5 &#x00D7; 41.5 cm screen, with a viewing angle of approximately 9 &#x00D7; 9 degrees in the participant&#x2019;s field of view.</p>
<p>The 120 videos are divided into two sets, with each set containing a video for each of the 60 actions. During the experiment, each participant is required to complete eight runs. In each run, participants watch all 60 action videos from one of the two sets. The videos are displayed in a random order, and each 2.5 s video is shown twice consecutively in each run. To avoid visually jarring transitions between video presentations, a 500 ms time window is used for fading in and out to a uniform gray background at the start and end of each presentation, respectively. The fixation period is 4 s at the beginning of the run and 10 s at the end. Additionally, four 15 s blocks of fixation are interspersed throughout the run.</p>
<p>Imaging data are collected using a 3T Siemens Prima functional magnetic resonance imaging scanner. High-resolution T1-weighted anatomical scans are obtained using the 3D MPRAGE protocol [Time of Repetition (TR) = 2,530 ms; Time of Echo (TE) = 1.69 ms; FoV = 256 mm; 1 &#x00D7; 1 &#x00D7; 1 mm voxel resolution; 176 sagittal slices; gap thickness = 0 mm; Flip Angle (FA) = 7&#x00B0;]. Blood oxygenation level-dependent (BOLD) contrast functional scans are obtained using a gradient echo-planar T2&#x002A; sequence (84 oblique axial slices acquired at a 25&#x00B0; angle off of the anterior commissure-posterior commissure line; TR = 2,000 ms; TE = 30 ms; FoV = 204 mm; 1.5 &#x00D7; 1.5 &#x00D7; 1.5 mm voxel resolution; gap thickness = 0 mm; FA = 80 degrees; multi-band acceleration factor = 3). The collected fMRI data undergo the corresponding preprocessing steps using Brain Voyager QX software, including slice-scan time correction, three-dimensional motion correction, linear trend removal, temporal high-pass filtering (cut-off of 0.008 Hz), spatial smoothing [4 mm full-width at half-maximum (FWHM) kernel], and normalization to Talairach space.</p>
<p>The whole brain random effect general linear models (GLMs) of each participant are applied to each video set, and to the odd and even runs of each video set. In all cases, square-wave regressors for each 5 s stimulus presentation time are convolved with a 2-gamma function approximating the idealized hemodynamic response, and the regressors for each conditional block are included in the design matrix. In these GLMs, the mean variance inflation factor under the design matrix condition is 1.03 (where values greater than five are considered problematic) and the mean efficiency is 0.21. The time series of voxels are z-normalized in each run and corrected for temporal autocorrelation during GLM fitting. And a second-order autoregressive, AR(2), is used in the GLM. Because the reliable coverage of the participants differs, cross-subject comparison is challenging. Therefore, the decoding model is analyzed in the same voxel as that obtained in the random effects group GLM. The experiment selected 39 fMRI data corresponding to action stimulation videos for subsequent analysis.</p>
</sec>
<sec id="S2.SS3">
<title>2.3 Reliability voxel selection</title>
<p>We adopt the reliable voxels selection method to process the fMRI data and select reliable voxels (<xref ref-type="bibr" rid="B39">Tarhan and Konkle, 2020</xref>). The reliability-based voxel selection retains voxels that show systematic differences in activation across the different actions, removing less reliable voxels and voxels that respond equally to all actions. Additionally, this method requires voxels to show similar activation levels across the different actions. Thus, selected voxels necessarily have some tolerance to very low-level features. Split-half reliability is calculated for every voxel by correlating betas extracted from even and odd runs. The reliability is obtained in two ways: The within-set reliability is calculated by correlating the odd and even betas of each set separately, and then the resulting r-maps are averaged. Cross-sets reliability is also calculated by correlating odd and even betas of glms computed on the two video sets. Cross-sets reliable voxels have relatively low reliability in early visual areas, and within-set reliable voxels have better coverage of early visual cortex. For both types of reliability, a procedure from <xref ref-type="bibr" rid="B38">Tarhan and Konkle (2019)</xref> is used to select reliability-based cutoffs. First, the reliability of each video&#x2019;s multi-voxel response pattern is plotted across a range of candidate cutoffs. Then, the cutoff is chosen based on where the multi-voxel mode reliability for all videos starts to stabilize. Using this approach, any voxel with an average reliability of 0.3 or higher is a reasonable cutoff to be included in the feature modeling analysis, as it maximizes reliability without sacrificing too much coverage. This cutoff holds for both group and single-subject data. Finally, these reliable voxels activated (<xref ref-type="fig" rid="F2">Figure 2</xref>) along a broad extension of the ventral and parietal cortices, cover the lateral occipitotemporal cortex (OTC), the ventral OTC and the intra-parietal sulcus (IPS).</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption><p>Whole-brain map of split-half voxel reliability. And Reliable voxels (<italic>r</italic> &#x003E; 0.30) selected based on the <xref ref-type="bibr" rid="B38">Tarhan and Konkle (2019)</xref>. These results are based on group data.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fninf-19-1526259-g002.tif"/>
</fig>
<p>In order to clarify the cognitive mechanism of movement understanding in the brain, we select several regions of interest for classification and decoding. According to previous cognitive neuroscience studies on motion perception, human brain motion perception not only involves visual regions, but also involves an Action Observation Network, which is composed of three core regions of occipito-temporal, parietal, and premotor regions in the human brain. The primary motor cortex (Brodmann 4), auxiliary motor cortex/premotor cortex (Brodmann 6), primary visual cortex (Brodmann 17), secondary visual cortex (Brodmann 18) and higher visual cortex (Brodmann 19) are selected as areas of interest for analysis according to the Brodmann template.</p>
</sec>
<sec id="S2.SS4">
<title>2.4 Extraction of action features</title>
<p>In this study, the three-dimensional convolutional neural network X3D (<xref ref-type="bibr" rid="B11">Feichtenhofer, 2020</xref>) is used to extract action features from human action videos. The model is pre-trained on the large-scale action recognition database Kinetics-400 (<xref ref-type="bibr" rid="B21">Kay et al., 2017</xref>). X3D consists of nine layers, the first of which is a three-dimensional convolutional layer. Then, there are four resnet layers, each of which contains 3, 5, 11, and 7 resnet blocks. Each resnet block includes three convolutional layers: 1 &#x00D7; 1 &#x00D7; 1, 3 &#x00D7; 3 &#x00D7; 3, and 1 &#x00D7; 1 &#x00D7; 1 convolution kernel operations. The last four layers are a convolutional layer, a global average pooling layer with an output of 432 dimensions, and two fully connected layers. The output dimensions of the two fully connected layers are 2,048 and 400, respectively, where 2,048 represents the dimension for extracting action features, and 400 represents the number of action categories. The model retains the temporal input resolution of all elements in the entire network hierarchy and does not start temporal downsampling until the global average pooling layer.</p>
<p>Normally, the three-dimension CNN model involves the stacking of spatiotemporal convolution operations. However, Convolution operations can only be local operations in space or time. CNN only captures information in a small neighborhood in time or space, and it is difficult to capture dependence at further locations. To compensate for this limitation of X3D, we incorporate a nonlocal-attention mechanism (<xref ref-type="bibr" rid="B1">Affolter et al., 2020</xref>) before the global average pooling layer. The nonlocal-attention mechanism directly captures remote dependencies by calculating the relationship between any two locations, regardless of their distance from one another. In this study, we compare the decoding effect when we add and do not add a nonlocal-attention mechanism in X3D to explore whether long-term spatiotemporal dependencies are more beneficial for decoding.</p>
<p>In addition, due to the small amount of fMRI data, the 2,048-dimensional action features extracted by X3D are not conducive to the establishment of regression mapping. Therefore, we make structure modification of X3D model. We set the output dimensions of the first fully connected layer to 512, 256, and 128, respectively. We then set the output of the last fully connected layer to 39, as 39 semantic categories overlap between our set and the large-scale action recognition database Kinetics. Then we fine-tune the adjusted X3D model. Finally, the output dimension of the penultimate layer is set to 128 through a comparison based on decoding results. To retrain the modified model, we uniformly sample the corresponding 39 semantic categories of partial videos from the Kinetics-400 database. The size of the training dataset is approximately 14,323, while that of the verification dataset is 1,916. Finally, through transfer learning based on the trained X3D model, the action features of the stimuli are extracted directly. The main model structure and model parameters are shown in <xref ref-type="table" rid="T1">Table 1</xref>.</p>
<table-wrap position="float" id="T1">
<label>TABLE 1</label>
<caption><p>The model structure of expanding three-dimensional (X3D).</p></caption>
<table cellspacing="5" cellpadding="5" frame="box" rules="all">
<thead>
<tr>
<td valign="top" align="left" style="color:#ffffff;background-color: #7f8080;">Stage</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Filters</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Output sizes (T &#x00D7; H &#x00D7; W)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Input</td>
<td valign="top" align="center">Uniformly sample 16 frames</td>
<td valign="top" align="center">16 &#x00D7;224 &#x00D7; 224</td>
</tr>
<tr>
<td valign="top" align="left">Conv1</td>
<td valign="top" align="center">1 &#x00D7; 3 &#x00D7; 3, 24</td>
<td valign="top" align="center">16 &#x00D7;112 &#x00D7; 112</td>
</tr>
<tr>
<td valign="top" align="left">Res2</td>
<td valign="top" align="center"><inline-graphic xlink:href="fninf-19-1526259-i001.jpg"/></td>
<td valign="top" align="center">16 &#x00D7;56 &#x00D7; 56</td>
</tr>
<tr>
<td valign="top" align="left">Res3</td>
<td valign="top" align="center"><inline-graphic xlink:href="fninf-19-1526259-i002.jpg"/></td>
<td valign="top" align="center">16 &#x00D7;28 &#x00D7; 28</td>
</tr>
<tr>
<td valign="top" align="left">Res4</td>
<td valign="top" align="center"><inline-graphic xlink:href="fninf-19-1526259-i003.jpg"/></td>
<td valign="top" align="center">16 &#x00D7; 14 &#x00D7; 14</td>
</tr>
<tr>
<td valign="top" align="left">Res5</td>
<td valign="top" align="center"><inline-graphic xlink:href="fninf-19-1526259-i004.jpg"/></td>
<td valign="top" align="center">16 &#x00D7; 7 &#x00D7; 7</td>
</tr>
<tr>
<td valign="top" align="left">Conv5</td>
<td valign="top" align="center">1 &#x00D7;3 &#x00D7; 3, 432</td>
<td valign="top" align="center">16 &#x00D7; 7 &#x00D7; 7</td>
</tr>
<tr>
<td valign="top" align="left">Pool5</td>
<td valign="top" align="center">16 &#x00D7; 7 &#x00D7; 7</td>
<td valign="top" align="center">1 &#x00D7; 1 &#x00D7; 1</td>
</tr>
<tr>
<td valign="top" align="left">Fc1</td>
<td valign="top" align="center">1 &#x00D7; 1 &#x00D7; 1, 2,048</td>
<td valign="top" align="center">1 &#x00D7; 1 &#x00D7; 1</td>
</tr>
<tr>
<td valign="top" align="left">Fc2</td>
<td valign="top" align="center">1 &#x00D7; 1 &#x00D7; 1,39</td>
<td valign="top" align="center">1 &#x00D7;1 &#x00D7; 1</td>
</tr>
</tbody>
</table></table-wrap>
<p>In the model fine-tuning process, 16 frames are uniformly selected from each video for input, and the length and width of each frame are 224 &#x00D7; 224. In the last fully connected layer, the input is the action feature representation, <italic>y</italic>, from the penultimate layer of X3D, and the output is the normalized probability, <italic>q</italic>, by which the action video is classified into each category. The model is trained using the stochastic gradient descent (SGD) to minimize the cross-entropy loss from the predicted probability <italic>q</italic> to the true value <italic>p</italic>. Cross-entropy loss mainly describes the distance between the actual output (probability) and expected output (probability); that is, the smaller the value of the cross-entropy loss, the closer the two probability distributions are. The predicted probability is expressed as:</p>
<disp-formula id="S2.E1">
<label>(1)</label>
<mml:math id="M1">
<mml:mtable displaystyle="true">
<mml:mtr>
<mml:mtd columnalign="center">
<mml:mrow>
<mml:mpadded width="+3.3pt">
<mml:mi>q</mml:mi>
</mml:mpadded>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mfrac>
<mml:msup>
<mml:mi>e</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mi>W</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mrow>
<mml:msubsup>
<mml:mo largeop="true" symmetric="true">&#x2211;</mml:mo>
<mml:mrow>
<mml:mpadded width="+3.3pt">
<mml:mi>j</mml:mi>
</mml:mpadded>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mi>N</mml:mi>
</mml:msubsup>
<mml:msup>
<mml:mi>e</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mi>W</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mi>b</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</disp-formula>
<p>where <italic>N</italic> represents the number of all categories, <italic>W</italic> and <italic>b</italic> represent the weight and bias. To fine-tune the X3D model, the minimized cross-entropy loss <italic>H</italic> is expressed as follows:</p>
<disp-formula id="S2.E2">
<label>(2)</label>
<mml:math id="M2">
<mml:mrow>
<mml:mi>H</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>,</mml:mo>
<mml:mtext>&#x00A0;</mml:mtext>
<mml:mi>q</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mtext>&#x00A0;</mml:mtext>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>&#x2211;</mml:mi>
<mml:mi>x</mml:mi>
</mml:msub>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mi>l</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>q</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>When fine-tuning the model, the batchsize is set to 8, and the learning rate is set to 0.005. After training for 15 epochs, the optimal model on the validation set is selected as the final model, and then is directly migrated to perform feature extraction on the stimuli.</p>
</sec>
<sec id="S2.SS5">
<title>2.5 Model for decoding action semantics based on fMRI</title>
<p>The decoding step is mainly divided into two steps: (1) establishing the regression model from the fMRI data to the action features; (2) inputting the predicted action feature into the semantic classification model, which was predefined by the X3D action recognition model, to obtain semantic content. In addition, we directly adopt a data-driven approach to achieve brain decoding for unseen subjects, train the regression model on n-1 subjects, and verify it using data from one subject. The final decoding result is obtained by performing the leave-one-subject-out cross-validation. This setup of decoding for unseen subjects is remarkably challenging, since fMRI data are very different across subjects, among other reasons, owing to a lack of alignment and variable numbers of voxels between subjects (<xref ref-type="bibr" rid="B1">Affolter et al., 2020</xref>).</p>
<p>In this study, we first use the principal component analysis (PCA) approach to extract principal components on the obtained 43,949 dimension gray matter data of fMRI. The PCA is used to extract 1,000 principal components from fMRI data, following which a regression mapping model is established between the reduced fMRI data and the action features. Due to the small amount of fMRI data, we hope to introduce X3D&#x2019;s multi-layer features as an auxiliary task to help the learning process of regression model, so that the model gradually approaches the X3D action semantic space. To achieve this goal, we propose a multi-task loss constraint MLP approach of three-layer fully connected layers with 1,000, 432, and 128 dimensions, respectively. And we additionally add the X3D&#x2019;s penultimate layer feature constraints as another loss term of regression model. Except for the last layer, we add batch normalization and dropout at a rate of 0.4 at the end of each layer to prevent overfitting. The mean square error (MSE) is then used as the final measurement standard, and the loss function is minimized to update the parameters of the three-layer fully connected network model:</p>
<disp-formula id="S2.E3">
<label>(3)</label>
<mml:math id="M3">
<mml:mrow>
<mml:msub>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mtext>&#x00A0;</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mi>n</mml:mi>
</mml:mfrac>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mstyle displaystyle='true'>
<mml:msubsup>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:msubsup>
<mml:mrow>
<mml:mtext>&#x2009;</mml:mtext>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mover accent='true'>
<mml:mi>y</mml:mi>
<mml:mo>&#x005E;</mml:mo>
</mml:mover>
<mml:mn>0</mml:mn>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mo>+</mml:mo>
<mml:mtext>&#x2009;</mml:mtext>
</mml:mrow>
</mml:mstyle>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mstyle displaystyle='true'>
<mml:msubsup>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:msubsup>
<mml:mrow>
<mml:mtext>&#x2009;</mml:mtext>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mover accent='true'>
<mml:mi>y</mml:mi>
<mml:mo>&#x005E;</mml:mo>
</mml:mover>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <italic>n</italic> represents the number of all samples, <italic>y</italic><sub>0</sub> and <italic>y</italic><sub>1</sub> represent the true value of the output of the last one layer and the penultimate layer, respectively, and <inline-formula><mml:math id="INEQ5"><mml:msub><mml:mover accent="true"><mml:mi>y</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mn>0</mml:mn></mml:msub></mml:math></inline-formula> and <inline-formula><mml:math id="INEQ6"><mml:msub><mml:mover accent="true"><mml:mi>y</mml:mi><mml:mo stretchy="false">^</mml:mo></mml:mover><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula> represent the predicted values. <italic>w</italic><sub>0</sub> and <italic>w</italic><sub>1</sub> represent the weights of the first and second loss items, respectively, and the final values are determined to be 2.2 and 1, respectively according to the accuracy of semantic decoding on the verification set. In the process of training the three-layer fully connected layer, we use the SGD optimizer, the batchsize is set to 16, and the learning rate is set to 0.0001. After 1,000 epochs, training stops.</p>
<p>To evaluate the classification accuracy, we use the prediction accuracy of Top-1 and Top-5. Specifically, for any given action video, we rank the action semantic categories in descending order of probability estimated by fMRI. If the true category is the top 1 of the ranked categories, it is considered to be Top-1 accurate. If the true category is in the top 5 of the ranked categories, it is considered to be Top-5 accurate. In addition, we also use the pairwise classification criterion (<xref ref-type="bibr" rid="B1">Affolter et al., 2020</xref>; <xref ref-type="bibr" rid="B31">Papadimitriou et al., 2019</xref>) to obtain whether the regression mapping model can establish an accurate mapping relationship from fMRI to action features. For each video, we compute the correlation between the predicted vector and the actual vector. If the predicted vectors are more similar to their corresponding action features than to the alternatives, the decoding is deemed correct. The random baseline is 50%.</p>
</sec>
<sec id="S2.SS6">
<title>2.6 Data enhancement</title>
<p>Based on the Mixup method idea (<xref ref-type="bibr" rid="B46">Zhang et al., 2018</xref>), we propose a data enhancement approach, which linearly interpolates the different subjects&#x2019; data corresponding to the same category to generate new fMRI data and target vector. The Mixup method is a simple method for data enhancement that is independent of the data. It constructs a new sample by linearly interpolating two random samples and their target vectors in the training set. Based on this idea, we first use the data of n-1 subjects to perform PCA so as to extract principal components on the gray matter data of fMRI. And then we linearly weight the different subjects&#x2019; principal component features of fMRI data corresponding to the same category to generate new subject data. Both the augmented data and the original data have been tested on test sets.</p>
<disp-formula id="S2.E4">
<label>(4)</label>
<mml:math id="M4">
<mml:mrow>
<mml:mpadded width="+3.3pt">
<mml:mi>d</mml:mi>
</mml:mpadded>
<mml:mo rspace="10.8pt">=</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">&#x03BB;</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:msub>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>-</mml:mo>
<mml:mi mathvariant="normal">&#x03BB;</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:msub>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="S2.E5">
<label>(5)</label>
<mml:math id="M5">
<mml:mrow>
<mml:mpadded width="+3.3pt">
<mml:mi>t</mml:mi>
</mml:mpadded>
<mml:mo rspace="10.8pt">=</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">&#x03BB;</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:msub>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>-</mml:mo>
<mml:mi mathvariant="normal">&#x03BB;</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:msub>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Among them, <italic>d</italic> represents the newly generated fMRI data, <italic>d<sub>s<sub>i</sub></sub></italic> and <italic>d<sub>s<sub>j</sub></sub></italic> correspond to two data randomly selected for a certain category. <italic>t</italic> represents the newly generated target vectors, <italic>t<sub>s<sub>i</sub></sub></italic> and <italic>t<sub>s<sub>j</sub></sub></italic> correspond to the target vector of <italic>d<sub>s<sub>i</sub></sub></italic> and <italic>d<sub>s<sub>j</sub></sub></italic>. &#x03BB; represents the weight, which is a parameter that obeys the <italic>B</italic> distribution &#x03BB;<italic>Beta</italic>(&#x03B1;,&#x03B1;). Through training, the final selection value of &#x03BB; is 0.2.</p>
</sec>
</sec>
<sec id="S3" sec-type="results">
<title>3 Results</title>
<sec id="S3.SS1">
<title>3.1 Cross-subject decoding</title>
<p>We first construct a baseline decoding framework that does not add any other modules, that is, we extract action features from videos based on the X3D, and then an MLP model which uses fMRI signals to predict the action features is built, finally the built-in transformation of the predicted features to the last layer (or output layer) of X3D is used to estimate the classification probability. The final assessment of the results is performed by the leave-one-subject-out approach. <xref ref-type="fig" rid="F3">Figure 3</xref> shows Top-5 predicted categories of some samples. The decoded categories are sorted in descending order of predicted probability. Correct categories are highlighted in red.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption><p>Top-5 estimated semantic categories of human action videos. The vertical line shows the top-5 categories determined from fMRI activity which are shown in the order of descending probabilities from the top to the bottom. The horizontal one shows predicted probability of estimated categories, and correct categories are highlighted in red. Images were taken from <xref ref-type="bibr" rid="B39">Tarhan and Konkle (2020)</xref> (Creative Commons License CC BY 4.0, <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link>).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fninf-19-1526259-g003.tif"/>
</fig>
<p>As shown in <xref ref-type="table" rid="T2">Table 2</xref>, the baseline decoding accuracies of Top-1 and Top-5 are 11.00% (random level 2.56%) and 33.53% (random level 12.82%), respectively, both significantly exceeding the random level. These results show that the human brain has a wealth of representation space for action semantic. At the same time, similar to the literature (<xref ref-type="bibr" rid="B15">G&#x00FC;&#x00E7;l&#x00FC; and van Gerven, 2017</xref>), the brain activities are related to the representations extracted by 3D deep learning model for action recognition. Its achieved Top-1 accuracy on average reaches 16.56% and Top-5 accuracy reaches 43.13% by adding three different modules. The reason for this improved result is that the decoding accuracy of each subject has increased overall, and the generalization of the model to the unseen subjects has enhanced by adding three different modules (<xref ref-type="fig" rid="F4">Figure 4</xref>).</p>
<table-wrap position="float" id="T2">
<label>TABLE 2</label>
<caption><p>Ablation study. Top-1 and Top-5 accuracy after incrementally adding the nonlocal attention mechanism, multi-task multilayer perceptron (MLP) and data augmentation modules to our baseline model.</p></caption>
<table cellspacing="5" cellpadding="5" frame="box" rules="all">
<thead>
<tr>
<td valign="top" align="left" style="color:#ffffff;background-color: #7f8080;">Model</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Top-1 accuracy</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Top-5 accuracy</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Baseline</td>
<td valign="top" align="center">11.00%</td>
<td valign="top" align="center">33.51%</td>
</tr>
<tr>
<td valign="top" align="left">Baseline+Nonlocal</td>
<td valign="top" align="center">12.86%</td>
<td valign="top" align="center">38.64%</td>
</tr>
<tr>
<td valign="top" align="left">Baseline+Nonlocal+multi-task MLP</td>
<td valign="top" align="center">15.56%</td>
<td valign="top" align="center">41.81%</td>
</tr>
<tr>
<td valign="top" align="left">Baseline+Nonlocal+multi-task MLP+Data Augmentation</td>
<td valign="top" align="center">16.56%</td>
<td valign="top" align="center">43.13%</td>
</tr>
</tbody>
</table></table-wrap>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption><p>The performance comparisons of Top-1 and Top-5 after gradually adding different modules to the baseline model. Each point represents a subject.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fninf-19-1526259-g004.tif"/>
</fig>
</sec>
<sec id="S3.SS2">
<title>3.2 Comparing algorithmic choices</title>
<p>In this section, we mainly compare the contributions of the nonlocal-attention mechanism, multi-task loss constraints regression method, and data enhancement to the decoding model.</p>
<sec id="S3.SS2.SSS1">
<title>3.2.1 Nonlocal-X3D vs. X3D</title>
<p>To compare whether long-range dependent action features are more helpful for decoding. In the experiments, we add a nonlocal attention mechanism to the X3D model and compare the decoding accuracy produced by adding and not adding a nonlocal attention mechanism to the X3D model, thereby determining whether the captured long-distance dependent features are more conducive to decoding. In addition, we also compare the decoding effect by using different deep learning models to extract action features. These models are Inflated 3D Convolutional (I3D) network (<xref ref-type="bibr" rid="B5">Carreira and Zisserman, 2017</xref>) and 3D Convolutional (C3D) network (<xref ref-type="bibr" rid="B40">Tran et al., 2015</xref>). <xref ref-type="bibr" rid="B15">G&#x00FC;&#x00E7;l&#x00FC; and van Gerven (2017)</xref>. used the C3D model to extract the DNN representation of the movie, and established the mapping relationship between the backflow region and the DNN representation to identify different movie stimuli. I3D and X3D are the latest deep learning models for action recognition, especially X3D has higher action recognition accuracy. However, C3D and I3D still lack the ability to model long-term dependence.</p>
<p>We use the MLP model to construct regression maps from fMRI to action features and obtain the corresponding results of semantic decoding. As shown in <xref ref-type="table" rid="T3">Table 3</xref>, these results indicate that the Top-1 accuracy of the X3D model with a nonlocal attention mechanism is 12.86%, and the Top-5 accuracy is 38.64%. Compared with not joining the nonlocal attention mechanism, it has increased by 1.61 and 3.95%, respectively. In addition, the Top-1 decoding accuracies of C3D and I3D models are 8.51, and 6.09%, respectively, and the Top-5 accuracies are 30.66, and 22.69%, respectively. The results show that the decoding accuracy of the action features extracted based on the X3D model is higher.</p>
<table-wrap position="float" id="T3">
<label>TABLE 3</label>
<caption><p>The impact of different models for extracting action features on the accuracy results of Top-1 and Top-5.</p></caption>
<table cellspacing="5" cellpadding="5" frame="box" rules="all">
<thead>
<tr>
<td valign="top" align="left" style="color:#ffffff;background-color: #7f8080;">Model for extracting action features</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Top-1 accuracy</td>
<td valign="top" align="center" style="color:#ffffff;background-color: #7f8080;">Top-5 accuracy</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Inflated 3D Convolutional network (I3D) (<xref ref-type="bibr" rid="B5">Carreira and Zisserman, 2017</xref>)</td>
<td valign="top" align="center">6.09%</td>
<td valign="top" align="center">22.69%</td>
</tr>
<tr>
<td valign="top" align="left">3D Convolutional network (C3D) (<xref ref-type="bibr" rid="B40">Tran et al., 2015</xref>)</td>
<td valign="top" align="center">8.51%</td>
<td valign="top" align="center">30.66%</td>
</tr>
<tr>
<td valign="top" align="left">Expanding three-dimensional network (X3D) (<xref ref-type="bibr" rid="B11">Feichtenhofer, 2020</xref>)</td>
<td valign="top" align="center">11.00%</td>
<td valign="top" align="center">33.51%</td>
</tr>
<tr>
<td valign="top" align="left">X3D+Nonlocal (ours)</td>
<td valign="top" align="center">12.86%</td>
<td valign="top" align="center">38.64%</td>
</tr>
</tbody>
</table></table-wrap>
</sec>
<sec id="S3.SS2.SSS2">
<title>3.2.2 Comparison of regression mapping models</title>
<p>To place our model in the context of existing work, we compare with three competing approaches: ridge regression, KNN and MLP regression, which are used by previous decoding studies (<xref ref-type="bibr" rid="B26">Matsuo et al., 2018</xref>; <xref ref-type="bibr" rid="B44">Wen et al., 2018</xref>) in constructing fMRI-to-deep learning features. The three approaches are implemented by using the scikit-learn platform. These three approaches are used for mapping based on action features extracted by X3D with a nonlocal attention mechanism, following which we compare the decoded results. In the experimental settings, the ridge regression regularization parameter alpha is set to 0.2, the NEIGHBORS parameter of KNN is set to 7, and the DEGREE parameter is set to 2. The nonlinear layer of the MLP adopts the relu activation function. And to prevent overfitting during regression mapping learning, the dropout is set to 0.4, the learning rate is set to 0.0001, and the batch size is set to 16. We have tried to build a deeper network, but the overfitting is serious.</p>
<p>As shown in <xref ref-type="fig" rid="F5">Figure 5B</xref>, the comparison results of multiple models show that our model has 4.30% Top-1 improvement and 5.84% Top-5 improvement compared to ridge regression; 10.58% Top-1 and 20.58% Top-5 improvement compared to KNN; and compared to MLP it has an increase of 2.77% Top-1 accuracy and 3.74% Top-5 accuracy. And by using multi-task MLP regression mapping, the decoding accuracy of each subject is generally improved (<xref ref-type="fig" rid="F6">Figure 6</xref>). The pairwise classification corresponding to KNN, ridge regression, MLP and multi_task MLP are 62.45, 78.74, 81.91, and 83.46%, respectively. Thus showing that Multi_task MLP can more accurately predict action features from fMRI.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption><p>Comparison of the results of adding different modules. <bold>(A)</bold> Comparison of the accuracy results of Top-1 and Top-5 with and without data enhancement. <bold>(B)</bold> The impact of different regression models on the accuracy results of Top-1 and Top-5.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fninf-19-1526259-g005.tif"/>
</fig>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption><p>The performance comparisons of Top-1 and Top-5 by using different regression models. Each dot represents a subject decoding performance.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fninf-19-1526259-g006.tif"/>
</fig>
</sec>
<sec id="S3.SS2.SSS3">
<title>3.2.3 Data augmentation effect</title>
<p>Data augmentation is based on the Mixup method and then a linearly weighted combination of data from different subjects corresponding to the same category. This method not only increases the size of the dataset but also improves the generalization of the data. <xref ref-type="fig" rid="F5">Figure 5A</xref> shows that using data augmentation significantly improves decoding results. When compared with the non-enhanced dataset (Top-1: 15.56%, Top-5: 41.81%), the enhanced dataset increased Top-1 and Top-5 accuracy by 1% and 1.32%, respectively.</p>
</sec>
</sec>
<sec id="S3.SS3">
<title>3.3 Cross-subject decoding based reliable voxels</title>
<p>In order to analyze the brain&#x2019;s cognitive mechanism of action videos, we decode by identifying voxels that can reliably distinguish different actions. We use the X3D model combined with the nonlocal attention mechanism to extract action features, and then predict action features based on reliable voxels to decode action semantics. The results show that the Top-1 and Top-5 accuracy of action semantic decoding based on within-set reliable voxels are 14.89 and 39.78%, respectively. The Top-1 and Top-5 accuracy of action semantic decoding based on cross-sets reliable voxels are 14.10 and 40.42%, respectively. The pairwise classification results based on within-set reliable voxels and cross-sets reliable voxels are 83.03 and 82.34%, respectively. The results show that the Top-1 and Top-5 accuracy of action semantic decoding based on finally whole-brain reliable voxels are 16.27 and 43.52%, respectively.</p>
<p>The experimental results in <xref ref-type="fig" rid="F7">Figure 7</xref> show that the decoding accuracy of each ROI is significantly higher than the random level. In particular, Brodmann 18 and Brodmann 19 (high vision area) have better decoding accuracy. Moreover, the whole-brain reliable voxel-based semantic decoding method achieves higher decoding accuracy than ROI alone, which indicates that the whole-brain reliable voxel-based semantic decoding method in this research does not rely too much on the application of domain-specific knowledge.</p>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption><p>The Top-1 and Top-5 accuracy of action semantic decoding based on different brain regions.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fninf-19-1526259-g007.tif"/>
</fig>
</sec>
</sec>
<sec id="S4" sec-type="discussion">
<title>4 Discussion</title>
<p>Here, we develop a framework for decoding human action semantics based on fMRI data by establishing a corresponding relationship between the action features extracted from video data and fMRI data. The decoding results of Top-1 and Top-5 accuracy both significantly exceed chance levels which shows the feasibility of extracting action-related semantic information in videos.</p>
<p>Compared to the models used in most research on action perception, our model can be used to perform semantic decoding across participants, making it more general and transferable than traditional methods such as multivariate pattern analysis (MVPA). When extracting action features, the model in our research based on X3D-DNN more accurately captures motion-selective receptive fields for action semantic decoding by extracting spatiotemporal video features rather than separate spatial features (<xref ref-type="bibr" rid="B29">Nishimoto et al., 2011</xref>; <xref ref-type="bibr" rid="B34">Rust et al., 2005</xref>). This is different from previous video decoding frameworks, which previously extracted spatial features of video frames based on CNN (<xref ref-type="bibr" rid="B44">Wen et al., 2018</xref>) and focused on object semantic decoding using features extracted by DNN optimized for object recognition (<xref ref-type="bibr" rid="B18">Horikawa and Kamitani, 2017b</xref>; <xref ref-type="bibr" rid="B44">Wen et al., 2018</xref>). In addition, the Recurrent Neural Network (RNN) (<xref ref-type="bibr" rid="B27">Medsker and Jain, 2001</xref>) or Long Short-Term Memory (LSTM) model (<xref ref-type="bibr" rid="B16">Hochreiter and Schmidhuber, 1997</xref>) can also process time-series related data and LSTMs are more capable of capturing longer-range dependencies than RNNs, which can be used as an alternative to 3D CNN to extract action features. We further add nonlocal-attention mechanism to the DNN model to proof that the extracted long-range dependent action features are more in line with the cognitive mechanism of the human brain. Furthermore, a recent study proposed the Human-Centric Transformer (HCTransformer), which develops a decoupled human-centric learning paradigm to explicitly concentrate on human-centric action cues in domain-variant video feature learning (<xref ref-type="bibr" rid="B25">Lin et al., 2025</xref>). In future research, we will further utilize action features that are more aligned with human cognitive mechanisms to assist in decoding behavioral semantics from brain activity. When attempting to decode action semantics, the most important requirement is the development of an accurate mapping model between fMRI voxels and action features, which allows for more accurate decoding of semantic information related to actions. Compared with KNN, ridge regression (<xref ref-type="bibr" rid="B44">Wen et al., 2018</xref>), and traditional MLP methods (<xref ref-type="bibr" rid="B26">Matsuo et al., 2018</xref>; <xref ref-type="bibr" rid="B31">Papadimitriou et al., 2019</xref>), MLP models based on multi-layer feature constraints can more accurately establish the mapping relationship between fMRI and action features. Multilayer feature constraints can assist the learning process of fMRI to action features, making its distribution closer to that of action features.</p>
<p>In most cases, only a small amount of fMRI data can be acquired in a single subject. Building models from multiple subject&#x2019;s data and transferring them to test subject data is the key to improving the utility of cognitive neuroscience (<xref ref-type="bibr" rid="B13">Gabrieli et al., 2015</xref>). Our results indicate that a multi-subject decoding model based on the whole-brain common representation space can predict unseen individual subjects, which is of great significance. However, the common representation space of different brain regions has different contribution in semantic decoding. If the brain regions can be more accurately located about which brain areas contribute more to the decoding of the multi-subject model and which regions can more accurately extract abstract information related to action semantics according to the decoding accuracy, it will be able to gain a deeper understanding of brain mechanisms.</p>
<p>In order to analyze the brain cognitive mechanism of action videos, we conduct semantic decoding experiments using within-set reliable voxels obtained by correlating the odd and even betas of each set and cross-sets reliable voxels obtained by correlating odd and even betas of glms computed on the two video sets. The decoding results of Top-1, Top-5 and pairwise classification accuracy based on within-set reliable voxels and cross-set reliable voxels are all significantly higher than random levels. The Top-1 and pairwise classification accuracy of within-set reliable voxel decoding is slightly higher than that of cross-set reliable voxel decoding. And the within-set reliable voxels have relatively higher coverage of the early visual cortex than the cross-set reliable voxels. This reveals early visual regions are correlated with action semantic features. It has been shown that the layer depth of the optimal encoding layer of the deep learning model is positively correlated with V1, V2, V3 in the early visual area and the position of the MT in the dorsal flow area (<xref ref-type="bibr" rid="B15">G&#x00FC;&#x00E7;l&#x00FC; and van Gerven, 2017</xref>). The decoding results based on different brain regions show that the visual cortex can effectively decode the semantic information of action, and the decoding accuracy of the higher visual cortex is higher than that of the primary visual cortex. Similar to the results of <xref ref-type="bibr" rid="B41">Urgen et al. (2019)</xref>, the visual region can have a good similarity to the visual computational model. And higher visual areas are associated with higher semantic features than the primary visual cortex.</p>
<p>The lower visual cortex is mainly responsible for receiving and initially processing visual information, such as detecting lines, contours, colors, etc., while the higher visual cortex is responsible for more complex visual processing, such as object recognition and spatial cognition. Accurate recognition of actions is a highly challenging task due to cluttered backgrounds, occlusions, and viewpoint variations, etc. Primary visual features may be less capable of reflecting differences between action categories. Additionally, our approaches differ in the granularity at voxel fitting models. <xref ref-type="bibr" rid="B20">Huth et al. (2012)</xref> constructed a semantic space containing action categories which suggested that a voxel&#x2019;s response could be fit by putting weights on over 1,000 predictors, including verbs like &#x201C;cooking,&#x201D; &#x201C;talking,&#x201D; and &#x201C;crawl.&#x201D; Category labels need to be manually labeled. Our approaches uses features learned from natural images, which indicates that the features in the X3D-DNN are biologically relevant, and capture information useful for perception. The features we used my contain more abstract information and this level of representation may therefore be more appropriate for characterizing the response tuning of mid-to-high-level visual cortex. Future work on analyzing the features extracted from each layer of the model can further explore their correspondence with brain regions.</p>
<p>In addition to improving the deep learning model, data enhancement can be used to improve decoding accuracy. Data augmentation constructs new samples by interpolating the fMRI vectors of two random subjects under the same class and their corresponding target vectors. Data augmentation intuitively extends the distribution of a given training set by providing successive data samples for different target vectors, thus making the network more robust in the test phase. Although we perform data augmentation through linear interpolation and improve decoding accuracy, the improvement does not seem to be significant. Recently, deep learning models such as deep recurrent variational auto-encoder and Generative adversarial networks (GAN) have been used to generate EEG or fMRI data with remarkable results (<xref ref-type="bibr" rid="B30">Panwar et al., 2020</xref>; <xref ref-type="bibr" rid="B32">Qiang et al., 2021</xref>). But the training of such complex models with small samples is still a challenge. On the basis of simply increasing the sample size using our data augmentation method, future research will explore more complex and effective data augmentation models.</p>
</sec>
<sec id="S5" sec-type="conclusion">
<title>5 Conclusion</title>
<p>In this paper, we explore the possibility of action semantic decoding based on fMRI data. We construct a more extensible model based on the action representations extracted by a three-dimensional DNN. The difference between this model and the previous models based on deep learning representation is that it uses a three-dimensional DNN model to extract spatiotemporal dynamic features to establish a connection with fMRI, instead of only extracting spatial image features. The model first extracts action features based on the three-dimensional action recognition model X3D, and an MLP model is built to establish the relationship between fMRI and action features so as to decode the action semantics corresponding to brain activities. Considering that it is difficult to obtain single subject data, the model uses multi-subject data for training and tests on unseen subjects. The final results significantly exceed the random level. In addition, the decoding results are further improved by adding a nonlocal attention mechanism, multi-task loss constraint&#x2019;s MLP model and data enhancement. Moreover, by examining the results obtained from models trained with data from different regions of the cortex, our results suggest that semantic information for human actions is widespread across the entire visual cortex.</p>
</sec>
</body>
<back>
<sec id="S6" sec-type="data-availability">
<title>Data availability statement</title>
<p>The original contributions presented in this study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.</p>
</sec>
<sec id="S7" sec-type="ethics-statement">
<title>Ethics statement</title>
<p>Written informed consent was obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article.</p>
</sec>
<sec id="S8" sec-type="author-contributions">
<title>Author contributions</title>
<p>YZ: Writing &#x2013; review and editing, Writing &#x2013; original draft. MT: Writing &#x2013; original draft, Writing &#x2013; review and editing. BL: Conceptualization, Resources, Writing &#x2013; original draft, Writing &#x2013; review and editing.</p>
</sec>
<sec id="S9" sec-type="funding-information">
<title>Funding</title>
<p>The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. This work was supported by the National Key Research and Development Program of China (No. 2024YFC3308300), the National Natural Science Foundation of China (No. U2133218) and the Fundamental Research Funds for the Central Universities of China (No. FRF-MP-19-007, No. FRF-TP-20-065A1Z and FRF-TP-24-021A).</p>
</sec>
<sec id="S10" sec-type="COI-statement">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec id="S11">
<title>Generative AI statement</title>
<p>The authors declare that no Generative AI was used in the creation of this manuscript.</p>
</sec>
<sec id="S12" sec-type="disclaimer">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Affolter</surname> <given-names>N.</given-names></name> <name><surname>Egressy</surname> <given-names>B.</given-names></name> <name><surname>Pascual</surname> <given-names>D.</given-names></name> <name><surname>Wattenhofer</surname> <given-names>R.</given-names></name></person-group> (<year>2020</year>). &#x201C;<article-title>Brain2Word: Improving brain decoding methods and evaluation</article-title>,&#x201D; in <source><italic>Proceedings of the Medical Imaging Meets Neurips Workshop-34th Conference on Neural Information Processing Systems</italic></source>, (<publisher-name>NeurIPS</publisher-name>).</citation></ref>
<ref id="B2"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Akamatsu</surname> <given-names>Y.</given-names></name> <name><surname>Harakawa</surname> <given-names>R.</given-names></name> <name><surname>Ogawa</surname> <given-names>T.</given-names></name> <name><surname>Haseyama</surname> <given-names>M.</given-names></name></person-group> (<year>2018</year>). &#x201C;<article-title>Estimation of viewed image categories via CCA using human brain activity</article-title>,&#x201D; in <source><italic>Proceedings of the 2018 IEEE 7th Global Conference on Consumer Electronics (GCCE)</italic></source>, (<publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>202</fpage>&#x2013;<lpage>203</lpage>.</citation></ref>
<ref id="B3"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Akamatsu</surname> <given-names>Y.</given-names></name> <name><surname>Harakawa</surname> <given-names>R.</given-names></name> <name><surname>Ogawa</surname> <given-names>T.</given-names></name> <name><surname>Haseyama</surname> <given-names>M.</given-names></name></person-group> (<year>2020</year>). &#x201C;<article-title>Multi-view bayesian generative model for multi-subject fmri data on brain decoding of viewed image categories</article-title>,&#x201D; in <source><italic>Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</italic></source>, (<publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1215</fpage>&#x2013;<lpage>1219</lpage>.</citation></ref>
<ref id="B4"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Arnab</surname> <given-names>A.</given-names></name> <name><surname>Dehghani</surname> <given-names>M.</given-names></name> <name><surname>Heigold</surname> <given-names>G.</given-names></name> <name><surname>Sun</surname> <given-names>C.</given-names></name> <name><surname>Luki&#x00E6;</surname> <given-names>M.</given-names></name> <name><surname>Schmid</surname> <given-names>C.</given-names></name></person-group> (<year>2021</year>). &#x201C;<article-title>Vivit: A video vision transformer</article-title>,&#x201D; in <source><italic>Proceedings of the IEEE/CVF International Conference on Computer Vision</italic></source>, (<publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>6836</fpage>&#x2013;<lpage>6846</lpage>.</citation></ref>
<ref id="B5"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Carreira</surname> <given-names>J.</given-names></name> <name><surname>Zisserman</surname> <given-names>A.</given-names></name></person-group> (<year>2017</year>). &#x201C;<article-title>Quo vadis, action recognition? A new model and the kinetics dataset</article-title>,&#x201D; in <source><italic>Proceedings of the Inernational Conference on Computer Vision and Pattern Recognition</italic></source>, (<publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>IEEE</publisher-name>).</citation></ref>
<ref id="B6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cichy</surname> <given-names>R. M.</given-names></name> <name><surname>Khosla</surname> <given-names>A.</given-names></name> <name><surname>Pantazis</surname> <given-names>D.</given-names></name> <name><surname>Torralba</surname> <given-names>A.</given-names></name> <name><surname>Oliva</surname> <given-names>A.</given-names></name></person-group> (<year>2016</year>). <article-title>Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence.</article-title> <source><italic>Sci. Rep.</italic></source> <volume>6</volume> <fpage>1</fpage>&#x2013;<lpage>13</lpage>. <pub-id pub-id-type="doi">10.1038/srep27755</pub-id> <pub-id pub-id-type="pmid">27282108</pub-id></citation></ref>
<ref id="B7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cortes</surname> <given-names>C.</given-names></name> <name><surname>Vapnik</surname> <given-names>V.</given-names></name></person-group> (<year>1995</year>). <article-title>Support-vector networks.</article-title> <source><italic>Mach. Learn.</italic></source> <volume>20</volume> <fpage>273</fpage>&#x2013;<lpage>297</lpage>.</citation></ref>
<ref id="B8"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dosovitskiy</surname> <given-names>A.</given-names></name> <name><surname>Beyer</surname> <given-names>L.</given-names></name> <name><surname>Kolesnikov</surname> <given-names>A.</given-names></name> <name><surname>Weissenborn</surname> <given-names>D.</given-names></name> <name><surname>Zhai</surname> <given-names>X.</given-names></name> <name><surname>Unterthiner</surname> <given-names>T.</given-names></name><etal/></person-group> (<year>2021</year>). <article-title>An image is worth 16x16 words: Transformers for image recognition at scale.</article-title> <source><italic>arXiv [Preprint]</italic></source> <pub-id pub-id-type="doi">10.48550/arXiv.2010.11929</pub-id></citation></ref>
<ref id="B9"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Eickenberg</surname> <given-names>M.</given-names></name> <name><surname>Gramfort</surname> <given-names>A.</given-names></name> <name><surname>Varoquaux</surname> <given-names>G.</given-names></name> <name><surname>Thirion</surname> <given-names>B.</given-names></name></person-group> (<year>2016</year>). <article-title>Seeing it all: Convolutional network layers map the function of the human visual system.</article-title> <source><italic>NeuroImage</italic></source> <volume>152</volume> <fpage>184</fpage>&#x2013;<lpage>194</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuroimage.2016.10.001</pub-id> <pub-id pub-id-type="pmid">27777172</pub-id></citation></ref>
<ref id="B10"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Engel</surname> <given-names>S. A.</given-names></name> <name><surname>Glover</surname> <given-names>G. H.</given-names></name> <name><surname>Wandell</surname> <given-names>B. A.</given-names></name></person-group> (<year>1997</year>). <article-title>Retinotopic organization in human visual cortex and the spatial precision of functional MRI.</article-title> <source><italic>Cereb. Cortex</italic></source> <volume>7</volume> <fpage>181</fpage>&#x2013;<lpage>192</lpage>. <pub-id pub-id-type="doi">10.1093/cercor/7.2.181</pub-id> <pub-id pub-id-type="pmid">9087826</pub-id></citation></ref>
<ref id="B11"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Feichtenhofer</surname> <given-names>C.</given-names></name></person-group> (<year>2020</year>). &#x201C;<article-title>X3D: Expanding architectures for efficient video recognition</article-title>,&#x201D; in <source><italic>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</italic></source>, (<publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>203</fpage>&#x2013;<lpage>213</lpage>.</citation></ref>
<ref id="B12"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fisher</surname> <given-names>R. A.</given-names></name></person-group> (<year>2012</year>). <article-title>The use of multiple measurements in taxonomic problems.</article-title> <source><italic>Ann. Hum. Genet.</italic></source> <volume>7</volume> <fpage>179</fpage>&#x2013;<lpage>188</lpage>.</citation></ref>
<ref id="B13"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gabrieli</surname> <given-names>J. D.</given-names></name> <name><surname>Ghosh</surname> <given-names>S. S.</given-names></name> <name><surname>Whitfield-Gabrieli</surname> <given-names>S.</given-names></name></person-group> (<year>2015</year>). <article-title>Prediction as a humanitarian and pragmatic contribution from human cognitive neuroscience.</article-title> <source><italic>Neuron</italic></source> <volume>85</volume> <fpage>11</fpage>&#x2013;<lpage>26</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuron.2014.10.047</pub-id> <pub-id pub-id-type="pmid">25569345</pub-id></citation></ref>
<ref id="B14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>G&#x00FC;&#x00E7;l&#x00FC;</surname> <given-names>U.</given-names></name> <name><surname>van Gerven</surname> <given-names>M. A.</given-names></name></person-group> (<year>2015</year>). <article-title>Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream.</article-title> <source><italic>J. Neurosci.</italic></source> <volume>35</volume> <fpage>10005</fpage>&#x2013;<lpage>10014</lpage>. <pub-id pub-id-type="doi">10.1523/JNEUROSCI.5023-14.2015</pub-id> <pub-id pub-id-type="pmid">26157000</pub-id></citation></ref>
<ref id="B15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>G&#x00FC;&#x00E7;l&#x00FC;</surname> <given-names>U.</given-names></name> <name><surname>van Gerven</surname> <given-names>M. A.</given-names></name></person-group> (<year>2017</year>). <article-title>Increasingly complex representations of natural movies across the dorsal stream are shared between subjects.</article-title> <source><italic>NeuroImage</italic></source> <volume>145</volume> <fpage>329</fpage>&#x2013;<lpage>336</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuroimage.2015.12.036</pub-id> <pub-id pub-id-type="pmid">26724778</pub-id></citation></ref>
<ref id="B16"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hochreiter</surname> <given-names>S.</given-names></name> <name><surname>Schmidhuber</surname> <given-names>J.</given-names></name></person-group> (<year>1997</year>). <article-title>Long short-term memory.</article-title> <source><italic>Neural Comput.</italic></source> <volume>9</volume> <fpage>1735</fpage>&#x2013;<lpage>1780</lpage>.</citation></ref>
<ref id="B17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Horikawa</surname> <given-names>T.</given-names></name> <name><surname>Kamitani</surname> <given-names>Y.</given-names></name></person-group> (<year>2017a</year>). <article-title>Generic decoding of seen and imagined objects using hierarchical visual features.</article-title> <source><italic>Nat. Commun.</italic></source> <volume>8</volume>:<issue>15037</issue>. <pub-id pub-id-type="doi">10.1038/ncomms15037</pub-id> <pub-id pub-id-type="pmid">28530228</pub-id></citation></ref>
<ref id="B18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Horikawa</surname> <given-names>T.</given-names></name> <name><surname>Kamitani</surname> <given-names>Y.</given-names></name></person-group> (<year>2017b</year>). <article-title>Hierarchical neural representation of dreamed objects revealed by brain decoding with deep neural network features.</article-title> <source><italic>Front. Comput. Neurosci.</italic></source> <volume>11</volume>:<issue>4</issue>. <pub-id pub-id-type="doi">10.3389/fncom.2017.00004</pub-id> <pub-id pub-id-type="pmid">28197089</pub-id></citation></ref>
<ref id="B19"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huth</surname> <given-names>A. G.</given-names></name> <name><surname>Lee</surname> <given-names>T.</given-names></name> <name><surname>Nishimoto</surname> <given-names>S.</given-names></name> <name><surname>Bilenko</surname> <given-names>N. Y.</given-names></name> <name><surname>Vu</surname> <given-names>A. T.</given-names></name> <name><surname>Gallant</surname> <given-names>J. L.</given-names></name></person-group> (<year>2016</year>). <article-title>Decoding the semantic content of natural movies from human brain activity.</article-title> <source><italic>Front. Syst. Neurosci.</italic></source> <volume>10</volume>:<issue>81</issue>. <pub-id pub-id-type="doi">10.3389/fnsys.2016.00081</pub-id> <pub-id pub-id-type="pmid">27781035</pub-id></citation></ref>
<ref id="B20"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huth</surname> <given-names>A. G.</given-names></name> <name><surname>Nishimoto</surname> <given-names>S.</given-names></name> <name><surname>Vu</surname> <given-names>A. T.</given-names></name> <name><surname>Gallant</surname> <given-names>J. L.</given-names></name></person-group> (<year>2012</year>). <article-title>A continuous semantic space describes the representation of thousands of object and action categories across the human brain.</article-title> <source><italic>Neuron</italic></source> <volume>76</volume> <fpage>1210</fpage>&#x2013;<lpage>1224</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuron.2012.10.014</pub-id> <pub-id pub-id-type="pmid">23259955</pub-id></citation></ref>
<ref id="B21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kay</surname> <given-names>W.</given-names></name> <name><surname>Carreria</surname> <given-names>J.</given-names></name> <name><surname>Simonyan</surname> <given-names>K.</given-names></name> <name><surname>Zhang</surname> <given-names>B.</given-names></name> <name><surname>Hillier</surname> <given-names>C.</given-names></name> <name><surname>Vijayanarasimhan</surname> <given-names>S.</given-names></name><etal/></person-group> (<year>2017</year>). <article-title>The kinetics human action video dataset.</article-title> <source><italic>arXiv [Preprint]</italic></source> <pub-id pub-id-type="doi">10.48550/arXiv.1705.06950</pub-id></citation></ref>
<ref id="B22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kuehne</surname> <given-names>H.</given-names></name> <name><surname>Jhuang</surname> <given-names>H.</given-names></name> <name><surname>Garrote</surname> <given-names>E.</given-names></name> <name><surname>Poggio</surname> <given-names>T.</given-names></name> <name><surname>Serre</surname> <given-names>T.</given-names></name></person-group> (<year>2011</year>). &#x201C;<article-title>HMDB: A large video database for human motion recognition</article-title>,&#x201D; in <source><italic>Proceedings of the IEEE International Conference Computer Vision</italic></source>, (<publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2556</fpage>&#x2013;<lpage>2563</lpage>. <pub-id pub-id-type="doi">10.1109/TIP.2019.2952088</pub-id> <pub-id pub-id-type="pmid">31725381</pub-id></citation></ref>
<ref id="B23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lai</surname> <given-names>D. K. H.</given-names></name> <name><surname>Cheng</surname> <given-names>E. S.</given-names></name> <name><surname>So</surname> <given-names>B. P.</given-names></name> <name><surname>Mao</surname> <given-names>Y. J.</given-names></name> <name><surname>Cheung</surname> <given-names>S. M.</given-names></name> <name><surname>Cheung</surname> <given-names>D. S.</given-names></name><etal/></person-group> (<year>2023</year>). <article-title>Transformer models and convolutional networks with different activation functions for swallow classification using depth video data.</article-title> <source><italic>Mathematics</italic></source> <volume>11</volume>:<issue>3081</issue>.</citation></ref>
<ref id="B24"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>D.</given-names></name> <name><surname>Du</surname> <given-names>C.</given-names></name> <name><surname>Wang</surname> <given-names>S.</given-names></name> <name><surname>Wang</surname> <given-names>H.</given-names></name> <name><surname>He</surname> <given-names>H.</given-names></name></person-group> (<year>2021</year>). <article-title>Multi-subject data augmentation for target subject semantic decoding with deep multi-view adversarial learning.</article-title> <source><italic>Inf. Sci.</italic></source> <volume>547</volume> <fpage>1025</fpage>&#x2013;<lpage>1044</lpage>.</citation></ref>
<ref id="B25"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>K. Y.</given-names></name> <name><surname>Zhou</surname> <given-names>J.</given-names></name> <name><surname>Zheng</surname> <given-names>W. S.</given-names></name></person-group> (<year>2025</year>). <article-title>Human-centric transformer for domain adaptive action recognition.</article-title> <source><italic>IEEE Trans. Pattern Anal. Mach. Intell.</italic></source> <volume>47</volume> <fpage>679</fpage>&#x2013;<lpage>696</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2024.3429387</pub-id> <pub-id pub-id-type="pmid">39012755</pub-id></citation></ref>
<ref id="B26"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Matsuo</surname> <given-names>E.</given-names></name> <name><surname>Kobayashi</surname> <given-names>I.</given-names></name> <name><surname>Nishimoto</surname> <given-names>S.</given-names></name> <name><surname>Nishida</surname> <given-names>S.</given-names></name> <name><surname>Asoh</surname> <given-names>H.</given-names></name></person-group> (<year>2018</year>). &#x201C;<article-title>Describing semantic representations of brain activity evoked by visual stimuli</article-title>,&#x201D; in <source><italic>Proceedings of the 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC)</italic></source>, (<publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>576</fpage>&#x2013;<lpage>583</lpage>.</citation></ref>
<ref id="B27"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Medsker</surname> <given-names>L. R.</given-names></name> <name><surname>Jain</surname> <given-names>L. C.</given-names></name></person-group> (<year>2001</year>). <article-title>Recurrent neural networks.</article-title> <source><italic>Design Appl.</italic></source> <volume>5</volume> <fpage>64</fpage>&#x2013;<lpage>67</lpage>.</citation></ref>
<ref id="B28"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nishimoto</surname> <given-names>S.</given-names></name> <name><surname>Gallant</surname> <given-names>J. L.</given-names></name></person-group> (<year>2011</year>). <article-title>A three-dimensional spatiotemporal receptive field model explains responses of area MT neurons to naturalistic movies.</article-title> <source><italic>J. Neurosci.</italic></source> <volume>31</volume> <fpage>14551</fpage>&#x2013;<lpage>14564</lpage>. <pub-id pub-id-type="doi">10.1523/JNEUROSCI.6801-10.2011</pub-id> <pub-id pub-id-type="pmid">21994372</pub-id></citation></ref>
<ref id="B29"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nishimoto</surname> <given-names>S.</given-names></name> <name><surname>Vu</surname> <given-names>A. T.</given-names></name> <name><surname>Naselaris</surname> <given-names>T.</given-names></name> <name><surname>Benjamini</surname> <given-names>Y.</given-names></name> <name><surname>Yu</surname> <given-names>B.</given-names></name> <name><surname>Gallant</surname> <given-names>J. L.</given-names></name></person-group> (<year>2011</year>). <article-title>Reconstructing visual experiences from brain activity evoked by natural movies.</article-title> <source><italic>Curr. Biol.</italic></source> <volume>21</volume> <fpage>1641</fpage>&#x2013;<lpage>1646</lpage>. <pub-id pub-id-type="doi">10.1016/j.cub.2011.08.031</pub-id> <pub-id pub-id-type="pmid">21945275</pub-id></citation></ref>
<ref id="B30"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Panwar</surname> <given-names>S.</given-names></name> <name><surname>Rad</surname> <given-names>P.</given-names></name> <name><surname>Jung</surname> <given-names>T. P.</given-names></name> <name><surname>Huang</surname> <given-names>Y.</given-names></name></person-group> (<year>2020</year>). <article-title>Modeling EEG data distribution with a wasserstein generative adversarial network to predict rsvp events.</article-title> <source><italic>IEEE Trans. Neural Syst. Rehabil. Eng.</italic></source> <volume>28</volume> <fpage>1720</fpage>&#x2013;<lpage>1730</lpage>. <pub-id pub-id-type="doi">10.1109/TNSRE.2020.3006180</pub-id> <pub-id pub-id-type="pmid">32746311</pub-id></citation></ref>
<ref id="B31"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Papadimitriou</surname> <given-names>A.</given-names></name> <name><surname>Passalis</surname> <given-names>N.</given-names></name> <name><surname>Tefas</surname> <given-names>A.</given-names></name></person-group> (<year>2019</year>). <article-title>Visual representation decoding from human brain activity using machine learning: A baseline study.</article-title> <source><italic>Patt. Recognit. Lett.</italic></source> <volume>128</volume> <fpage>38</fpage>&#x2013;<lpage>44</lpage>.</citation></ref>
<ref id="B32"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Qiang</surname> <given-names>N.</given-names></name> <name><surname>Dong</surname> <given-names>Q.</given-names></name> <name><surname>Liang</surname> <given-names>H.</given-names></name> <name><surname>Ge</surname> <given-names>B.</given-names></name> <name><surname>Zhang</surname> <given-names>S.</given-names></name> <name><surname>Sun</surname> <given-names>Y.</given-names></name><etal/></person-group> (<year>2021</year>). <article-title>Modeling and augmenting of fMRI data using deep recurrent variational auto-encoder.</article-title> <source><italic>J. Neural Eng</italic>.</source> <volume>18</volume>:<fpage>0460b6</fpage>. <pub-id pub-id-type="doi">10.1088/1741-2552/ac1179</pub-id> <pub-id pub-id-type="pmid">34229310</pub-id></citation></ref>
<ref id="B33"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Qiao</surname> <given-names>K.</given-names></name> <name><surname>Zhang</surname> <given-names>C.</given-names></name> <name><surname>Wang</surname> <given-names>L.</given-names></name> <name><surname>Chen</surname> <given-names>J.</given-names></name> <name><surname>Zeng</surname> <given-names>L.</given-names></name> <name><surname>Tong</surname> <given-names>L.</given-names></name><etal/></person-group> (<year>2018</year>). <article-title>Accurate reconstruction of image stimuli from human functional magnetic resonance imaging based on the decoding model with capsule network architecture.</article-title> <source><italic>Front. Neuroinformatics</italic></source> <volume>12</volume>:<issue>62</issue>. <pub-id pub-id-type="doi">10.3389/fninf.2018.00062</pub-id> <pub-id pub-id-type="pmid">30294269</pub-id></citation></ref>
<ref id="B34"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rust</surname> <given-names>N. C.</given-names></name> <name><surname>Schwartz</surname> <given-names>O.</given-names></name> <name><surname>Movshon</surname> <given-names>J. A.</given-names></name> <name><surname>Simoncelli</surname> <given-names>E. P.</given-names></name></person-group> (<year>2005</year>). <article-title>Spatiotemporal elements of macaque v1 receptive fields.</article-title> <source><italic>Neuron</italic></source> <volume>46</volume> <fpage>945</fpage>&#x2013;<lpage>956</lpage>.</citation></ref>
<ref id="B35"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Soomro</surname> <given-names>K.</given-names></name> <name><surname>Zamir</surname> <given-names>A. R.</given-names></name> <name><surname>Shah</surname> <given-names>M.</given-names></name></person-group> (<year>2012</year>). <article-title>UCF101: A dataset of 101 human actions classes from videos in the wild.</article-title> <source><italic>arXiv [Preprint]</italic></source> <pub-id pub-id-type="doi">10.48550/arXiv.1212.0402</pub-id></citation></ref>
<ref id="B36"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stansbury</surname> <given-names>D. E.</given-names></name> <name><surname>Naselaris</surname> <given-names>T.</given-names></name> <name><surname>Gallant</surname> <given-names>J. L.</given-names></name></person-group> (<year>2013</year>). <article-title>Natural scene statistics account for the representation of scene categories in human visual cortex.</article-title> <source><italic>Neuron</italic></source> <volume>79</volume> <fpage>1025</fpage>&#x2013;<lpage>1034</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuron.2013.06.034</pub-id> <pub-id pub-id-type="pmid">23932491</pub-id></citation></ref>
<ref id="B37"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Takada</surname> <given-names>S.</given-names></name> <name><surname>Togo</surname> <given-names>R.</given-names></name> <name><surname>Ogawa</surname> <given-names>T.</given-names></name> <name><surname>Haseyama</surname> <given-names>M.</given-names></name></person-group> (<year>2020</year>). &#x201C;<article-title>Generation of viewed image captions from human brain activity via unsupervised text latent space</article-title>,&#x201D; in <source><italic>Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP)</italic></source>, (<publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2521</fpage>&#x2013;<lpage>2525</lpage>.</citation></ref>
<ref id="B38"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tarhan</surname> <given-names>L.</given-names></name> <name><surname>Konkle</surname> <given-names>T.</given-names></name></person-group> (<year>2019</year>). <article-title>Reliability-based voxel selection.</article-title> <source><italic>NeuroImage</italic></source> <volume>207</volume>:<issue>116350</issue>. <pub-id pub-id-type="doi">10.1016/j.neuroimage.2019.116350</pub-id> <pub-id pub-id-type="pmid">31733373</pub-id></citation></ref>
<ref id="B39"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tarhan</surname> <given-names>L.</given-names></name> <name><surname>Konkle</surname> <given-names>T.</given-names></name></person-group> (<year>2020</year>). <article-title>Sociality and interaction envelope organize visual action representations.</article-title> <source><italic>Nat. Commun.</italic></source> <volume>11</volume>:<issue>3002</issue>. <pub-id pub-id-type="doi">10.1038/s41467-020-16846-w</pub-id> <pub-id pub-id-type="pmid">32532982</pub-id></citation></ref>
<ref id="B40"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tran</surname> <given-names>D.</given-names></name> <name><surname>Bourdev</surname> <given-names>L.</given-names></name> <name><surname>Fergus</surname> <given-names>R.</given-names></name> <name><surname>Torresani</surname> <given-names>L.</given-names></name> <name><surname>Paluri</surname> <given-names>M.</given-names></name></person-group> (<year>2015</year>). &#x201C;<article-title>Learning spatiotemporal features with 3D convolutional networks</article-title>,&#x201D; in <source><italic>Proceedings of the IEEE International Conference on Computer Vision</italic></source>, (<publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>IEEE</publisher-name>).</citation></ref>
<ref id="B41"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Urgen</surname> <given-names>B. A.</given-names></name> <name><surname>Pehlivan</surname> <given-names>S.</given-names></name> <name><surname>Saygin</surname> <given-names>A. P.</given-names></name></person-group> (<year>2019</year>). <article-title>Distinct representations in occipito-temporal, parietal, and premotor cortex during action perception revealed by fMRI and computational modeling.</article-title> <source><italic>Neuropsychologia</italic></source> <volume>127</volume> <fpage>35</fpage>&#x2013;<lpage>47</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuropsychologia.2019.02.006</pub-id> <pub-id pub-id-type="pmid">30772426</pub-id></citation></ref>
<ref id="B42"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vodrahalli</surname> <given-names>K.</given-names></name> <name><surname>Ko</surname> <given-names>J.</given-names></name> <name><surname>Chiou</surname> <given-names>A.</given-names></name> <name><surname>Novoa</surname> <given-names>R.</given-names></name> <name><surname>Abid</surname> <given-names>A.</given-names></name> <name><surname>Phung</surname> <given-names>M.</given-names></name><etal/></person-group> (<year>2018</year>). <article-title>Mapping between fMRI responses to movies and their natural language annotations.</article-title> <source><italic>NeuroImage</italic></source> <volume>180</volume> <fpage>223</fpage>&#x2013;<lpage>231</lpage>. <pub-id pub-id-type="doi">10.1016/j.neuroimage.2017.06.042</pub-id> <pub-id pub-id-type="pmid">28648889</pub-id></citation></ref>
<ref id="B43"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>X.</given-names></name> <name><surname>Girshick</surname> <given-names>R.</given-names></name> <name><surname>Gupta</surname> <given-names>A.</given-names></name> <name><surname>He</surname> <given-names>K.</given-names></name></person-group> (<year>2018</year>). &#x201C;<article-title>Non-local neural networks</article-title>,&#x201D; in <source><italic>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</italic></source>, (<publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>7794</fpage>&#x2013;<lpage>7803</lpage>. <pub-id pub-id-type="doi">10.3390/bioengineering11060627</pub-id> <pub-id pub-id-type="pmid">38927863</pub-id></citation></ref>
<ref id="B44"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wen</surname> <given-names>H.</given-names></name> <name><surname>Shi</surname> <given-names>J.</given-names></name> <name><surname>Zhang</surname> <given-names>Y.</given-names></name> <name><surname>Lu</surname> <given-names>K. H.</given-names></name> <name><surname>Cao</surname> <given-names>J.</given-names></name> <name><surname>Liu</surname> <given-names>Z.</given-names></name></person-group> (<year>2018</year>). <article-title>Neural encoding and decoding with deep learning for dynamic natural vision.</article-title> <source><italic>Cereb. Cortex</italic></source> <volume>28</volume> <fpage>4136</fpage>&#x2013;<lpage>4160</lpage>.</citation></ref>
<ref id="B45"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yamins</surname> <given-names>D. L.</given-names></name> <name><surname>Hong</surname> <given-names>H.</given-names></name> <name><surname>Cadieu</surname> <given-names>C. F.</given-names></name> <name><surname>Solomon</surname> <given-names>E. A.</given-names></name> <name><surname>Seibert</surname> <given-names>D.</given-names></name> <name><surname>DiCarlo</surname> <given-names>J. J.</given-names></name></person-group> (<year>2014</year>). <article-title>Performance-optimized hierarchical models predict neural responses in higher visual cortex.</article-title> <source><italic>Proc. Natl. Acad. Sci.</italic></source> <volume>111</volume> <fpage>8619</fpage>&#x2013;<lpage>8624</lpage>. <pub-id pub-id-type="doi">10.1073/pnas.1403112111</pub-id> <pub-id pub-id-type="pmid">24812127</pub-id></citation></ref>
<ref id="B46"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>H.</given-names></name> <name><surname>Cisse</surname> <given-names>M.</given-names></name> <name><surname>Dauphin</surname> <given-names>Y. N.</given-names></name> <name><surname>Lopez-Paz</surname> <given-names>D.</given-names></name></person-group> (<year>2018</year>). <article-title>mixup: Beyond empirical risk minimization.</article-title> <source><italic>arXiv [Preprint]</italic></source> <pub-id pub-id-type="doi">10.48550/arXiv.1710.09412</pub-id></citation></ref>
</ref-list>
</back>
</article>