<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="2.3" xml:lang="EN">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neurorobot.</journal-id>
<journal-title>Frontiers in Neurorobotics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neurorobot.</abbrev-journal-title>
<issn pub-type="epub">1662-5218</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnbot.2025.1527908</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Transformer-based short-term traffic forecasting model considering traffic spatiotemporal correlation</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Chang</surname> <given-names>Ande</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<role content-type="https://credit.niso.org/contributor-roles/methodology/"/>
<role content-type="https://credit.niso.org/contributor-roles/validation/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Ji</surname> <given-names>Yuting</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<role content-type="https://credit.niso.org/contributor-roles/investigation/"/>
<role content-type="https://credit.niso.org/contributor-roles/visualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Bie</surname> <given-names>Yiming</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
<uri xlink:href="https://loop.frontiersin.org/people/1637262/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/funding-acquisition/"/>
<role content-type="https://credit.niso.org/contributor-roles/methodology/"/>
<role content-type="https://credit.niso.org/contributor-roles/supervision/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>College of Forensic Sciences, Criminal Investigation Police University of China</institution>, <addr-line>Shenyang</addr-line>, <country>China</country></aff>
<aff id="aff2"><sup>2</sup><institution>School of Transportation, Jilin University</institution>, <addr-line>Changchun</addr-line>, <country>China</country></aff>
<author-notes>
<fn fn-type="edited-by" id="fn0001">
<p>Edited by: Liangliang Li, Beijing Institute of Technology, China</p>
</fn>
<fn fn-type="edited-by" id="fn0002">
<p>Reviewed by: Hasnain Iftikhar, Quaid-i-Azam University, Pakistan</p>
<p>Liu Yang, Tsinghua University, China</p>
</fn>
<corresp id="c001">&#x002A;Correspondence: Yiming Bie, <email>yimingbie@126.com</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>23</day>
<month>01</month>
<year>2025</year>
</pub-date>
<pub-date pub-type="collection">
<year>2025</year>
</pub-date>
<volume>19</volume>
<elocation-id>1527908</elocation-id>
<history>
<date date-type="received">
<day>14</day>
<month>11</month>
<year>2024</year>
</date>
<date date-type="accepted">
<day>06</day>
<month>01</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2025 Chang, Ji and Bie.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Chang, Ji and Bie</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>Traffic forecasting is crucial for a variety of applications, including route optimization, signal management, and travel time estimation. However, many existing prediction models struggle to accurately capture the spatiotemporal patterns in traffic data due to its inherent nonlinearity, high dimensionality, and complex dependencies. To address these challenges, a short-term traffic forecasting model, Trafficformer, is proposed based on the Transformer framework. The model first uses a multilayer perceptron to extract features from historical traffic data, then enhances spatial interactions through Transformer-based encoding. By incorporating road network topology, a spatial mask filters out noise and irrelevant interactions, improving prediction accuracy. Finally, traffic speed is predicted using another multilayer perceptron. In the experiments, Trafficformer is evaluated on the Seattle Loop Detector dataset. It is compared with six baseline methods, with Mean Absolute Error, Mean Absolute Percentage Error, and Root Mean Square Error used as metrics. The results show that Trafficformer not only has higher prediction accuracy, but also can effectively identify key sections, and has great potential in intelligent traffic control optimization and refined traffic resource allocation.</p>
</abstract>
<kwd-group>
<kwd>intelligent transportation system</kwd>
<kwd>short-term traffic forecasting</kwd>
<kwd>Transformer</kwd>
<kwd>traffic spatiotemporal correlation</kwd>
<kwd>deep learning</kwd>
</kwd-group>
<contract-num rid="cn1">52220105001</contract-num>
<contract-num rid="cn1">52131203</contract-num>
<contract-num rid="cn1">72471102</contract-num>
<contract-num rid="cn2">20230508048RC</contract-num>
<contract-sponsor id="cn1">National Natural Science Foundation of China<named-content content-type="fundref-id">10.13039/501100001809</named-content></contract-sponsor>
<contract-sponsor id="cn2">Science and Technology Department of Jilin Province</contract-sponsor>
<counts>
<fig-count count="8"/>
<table-count count="4"/>
<equation-count count="23"/>
<ref-count count="71"/>
<page-count count="16"/>
<word-count count="11468"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="sec1">
<label>1</label>
<title>Introduction</title>
<p>Traffic forecasting is a fundamental component of intelligent transportation systems (ITS). The primary goal of traffic forecasting is to identify key factors influencing traffic variation based on historical observations, develop prediction models, and forecast future traffic conditions (<xref ref-type="bibr" rid="ref65">Yu, 2021</xref>; <xref ref-type="bibr" rid="ref46">Rong et al., 2022</xref>). Traffic forecasting is typically categorized into short-term and long-term predictions, depending on the forecast horizon. In this study, the focus is on short-term predictions, which generally aim to forecast traffic conditions within the next hour. It is particularly significant in the real-world context of ITS for several reasons (<xref ref-type="bibr" rid="ref20">Ji et al., 2023</xref>; <xref ref-type="bibr" rid="ref24">Li et al., 2025</xref>). First, accurate short-term forecasts directly benefit travelers by providing more precise travel time estimates, which help individuals make informed decisions about their departure times and route choices. This can lead to more efficient traffic distribution and reduced overall travel time (<xref ref-type="bibr" rid="ref2">Bie et al., 2024</xref>; <xref ref-type="bibr" rid="ref35">Luo et al., 2024</xref>). Furthermore, for transportation operators, effective short-term forecasting enables the implementation of real time management strategies, such as dynamic route guidance. This helps mitigate congestion before it reaches critical levels and reduces the risk of accidents (<xref ref-type="bibr" rid="ref52">Sun et al., 2018a</xref>,<xref ref-type="bibr" rid="ref53">b</xref>). However, short-term traffic forecasting also faces specific challenges, particularly due to the stochastic nature of traffic flow and the influence of external factors such as weather, accidents, and special events.</p>
<p>In pursuit of more accurate traffic forecasting accuracy, many methods have been explored. These methods typically take historical traffic data as input or combine it with other actual data sources. Through a variety of means, they mine the characteristics within the traffic flow data to achieve predictions of traffic flow features, such as traffic flow speed or traffic volume. They are mainly divided into two categories: model methods based on linear statistical theory and nonlinear theory. Methods based on linear statistical theory, such as historical mean prediction, time series prediction (<xref ref-type="bibr" rid="ref36">Ma et al., 2021</xref>; <xref ref-type="bibr" rid="ref16">Han, 2024</xref>), Kalman filtering prediction (<xref ref-type="bibr" rid="ref41">Okutani and Stephanedes, 1984</xref>; <xref ref-type="bibr" rid="ref68">Zhang et al., 2023</xref>), are characterized by their simplicity, ease of implementation, and low computational cost for a single prediction. However, they usually fail to address the uncertainty and nonlinearity of traffic flow, thereby lacking the capability of effective prediction in complex environments. Nonlinear theoretical model-based methods mainly include wavelet analysis (<xref ref-type="bibr" rid="ref58">Wang and Shi, 2013</xref>; <xref ref-type="bibr" rid="ref8">Dong et al., 2021</xref>), chaos theory (<xref ref-type="bibr" rid="ref50">Shi et al., 2020</xref>), neural network, and support vector regression (<xref ref-type="bibr" rid="ref43">Omar et al., 2024</xref>). Among these, wavelet analysis models and chaos theory can extract nonlinear characteristics and achieve relatively high accuracy, but due to their high complexity, research on traffic forecasting based on these methods is relatively limited (<xref ref-type="bibr" rid="ref67">Zhang et al., 2018</xref>). Neural network models and models based on support vector regression have rich parameters and strong fitting ability for complex nonlinear relationships, making them the mainstream prediction methods currently employed (<xref ref-type="bibr" rid="ref60">Wang et al., 2023</xref>; <xref ref-type="bibr" rid="ref62">Wang J. et al., 2024</xref>).</p>
<p>Early neural network models are essentially shallow neural networks (NN), which were unable to comprehensively extract the fundamental features from traffic data. Therefore, neural network models with multiple hidden layers (MHL), such as Multilayer Perceptron (MLP), have gradually been applied in traffic forecasting (<xref ref-type="bibr" rid="ref42">Oliveira et al., 2021</xref>). With the increase in model complexity, the network&#x2019;s ability to extract traffic features enhances, but at the same time, it requires a larger number of training samples and the prediction time per single training also increases. Due to computational limitations, early machine learning algorithms did not demonstrate significant advantages in traffic forecasting problems. In 2006, Hinton et al. introduced the first Deep Learning (DL) paper, highlighting two key insights: deep neural networks with MHL excel at feature learning, providing a more fundamental data representation, and &#x201C;layer-wise pre-training&#x201D; effectively mitigates the challenges of training deep networks. The publication of this article sparked the wave of research in DL (<xref ref-type="bibr" rid="ref40">Nigam and Srivastava, 2023</xref>).</p>
<p>Recurrent Neural Networks (RNN) (<xref ref-type="bibr" rid="ref45">Pascanu, 2013</xref>), along with variants like Long Short-Term Memory (LSTM) (<xref ref-type="bibr" rid="ref49">Schmidhuber and Hochreiter, 1997</xref>) and Gated Recurrent Unit (GRU) (<xref ref-type="bibr" rid="ref64">Yang et al., 2022</xref>), are effective at handling sequential data and conducting complex transformations. These capabilities enable them to capture temporal dependencies in traffic flow, making them ideal for time series forecasting (<xref ref-type="bibr" rid="ref17">He et al., 2022</xref>). In addition, with the widespread use of surveillance equipment, convolutional neural network (CNN) models, which rely on image data, have been introduced into traffic forecasting (<xref ref-type="bibr" rid="ref44">Parishwad et al., 2023</xref>). Based on the multilayer convolution structure inherent in CNN models, these models can effectively capture spatial correlation characteristics of traffic flow (<xref ref-type="bibr" rid="ref39">Narmadha and Vijayakumar, 2023</xref>). On the other hand, graph neural networks (GNNs) models (<xref ref-type="bibr" rid="ref48">Scarselli et al., 2008</xref>), which are based on graph-structured data, have also been applied to traffic forecasting. GNNs are good at modeling the relationships between different nodes in a traffic network, especially in capturing topological structures and interactions. They are suitable for scenarios where the spatial relationship between roads and intersections plays a vital role. Subsequently, Transformer-based models (<xref ref-type="bibr" rid="ref55">Vaswani et al., 2017</xref>) have gradually shown great potential in traffic forecasting problems. Compared with other traffic forecasting methods, Transformer can simultaneously focus on different positions of the input sequence through its unique multi-head attention mechanism, thereby more comprehensively capturing long-distance dependencies and complex features in traffic data. In addition, the architecture design of Transformer allows it to perform parallel calculations, greatly improving the training efficiency. Compared with some methods based on CNNs/GNNs, it has obvious speed advantages when processing large-scale traffic data sets, and can adapt to dynamic changes in traffic conditions more quickly, providing a more efficient solution for real time traffic forecasting (<xref ref-type="bibr" rid="ref9">Eleonora and Pinar, 2023</xref>; <xref ref-type="bibr" rid="ref5">Chen et al., 2024</xref>; <xref ref-type="bibr" rid="ref71">Zoican et al., 2024</xref>; <xref ref-type="bibr" rid="ref13">Guo B. et al., 2024</xref>; <xref ref-type="bibr" rid="ref14">Guo X. et al., 2024</xref>).</p>
<p>However, existing methods still have limitations. For example, traditional graph-based models may face challenges of high computational complexity due to complex graph convolution operations and strict dependence on road topology. Similarly, in the Transformer&#x2019;s self-attention, while it typically uses all node information to compute attention weights, the traffic network, composed of roads and intersections, has complex spatial relationships that cannot be captured by a simple linear sequence. As a result, the current approach introduces unnecessary interactions and noise, limiting its ability to fully capture the network&#x2019;s spatial characteristics. Taking into account the complexity of traffic flow and the limitations of existing methods, the historical traffic flow data sequence and road topology information of traffic nodes are used as the core input data source. A DL framework based on the Transformer encoding module is constructed to achieve accurate prediction of future traffic speed at traffic nodes. Specifically, spatial masks based on spatial topology and travel time are designed. In this way, spatial information is effectively introduced, significantly enhancing the model&#x2019;s ability to capture spatial relationships in complex urban traffic scenarios and greatly improving traffic flow prediction accuracy. In addition, a streamlined and effective MLP is used to replace the original complex decoding structure of the Transformer. This reduces the computational complexity and the number of network layers while ensuring that the prediction accuracy is not compromised. The main contributions of this work include:</p>
<list list-type="order">
<list-item>
<p>Using the road network topology to generate spatial masks, so that the model can take more into account the traffic nodes with spatial connections during feature interaction, which reduces the unnecessary interaction and noise.</p>
</list-item>
<list-item>
<p>Introducing a Transformer-based traffic forecasting model, which can effectively handle long-term dependencies in spatiotemporal traffic information and provide more interpretability.</p>
</list-item>
<list-item>
<p>Conducting multiple sets of comparative experiments and ablation studies using a large-scale real road network dataset to assess the model&#x2019;s performance, accuracy, and its internal components.</p>
</list-item>
</list>
<p>The remainder of the paper is structured as follows. &#x201C;Literature review&#x201D; covers DL-based traffic forecasting methods. &#x201C;Methodology&#x201D; introduces the DL framework established in this study. &#x201C;Experiments&#x201D; validates the proposed approach with real world datasets. The research conclusions and prospects are presented in &#x201C;Conclusions.&#x201D;</p>
</sec>
<sec id="sec2">
<label>2</label>
<title>Literature review</title>
<p>As a core part of ITS, traffic flow prediction aims to anticipate traffic conditions, such as traffic flow speed, traffic flow volume, enabling authorities to take preemptive measures and travelers to plan better. However, traffic flow is complex, affected by various factors. Traditional prediction methods struggle to capture its dynamic nature. With computing power growth, machine learning, especially DL, has emerged as a leading solution (<xref ref-type="bibr" rid="ref70">Zhu et al., 2021</xref>; <xref ref-type="bibr" rid="ref38">Mohammadian et al., 2023</xref>; <xref ref-type="bibr" rid="ref7">Ding et al., 2024</xref>; <xref ref-type="bibr" rid="ref5">Chen et al., 2024</xref>; <xref ref-type="bibr" rid="ref56">Wang Q. et al., 2024</xref>). Different DL architectures offer unique strengths in handling traffic flow data. RNN and their variants, like LSTM, are designed to handle sequential data, making them suitable for capturing temporal patterns in traffic flow. CNN excel at extracting spatial features, which is vital for understanding the relationships between different traffic nodes (<xref ref-type="bibr" rid="ref29">Li et al., 2024a</xref>). And Transformer, with its attention mechanism, can model full dependencies, better handling long-range correlations in traffic. Hence, the following sections will explore these three categories of DL-based traffic forecasting methods.</p>
<sec id="sec3">
<label>2.1</label>
<title>Traffic forecasting based on RNN</title>
<p>RNN and their improved architectures are a highly utilized class of NN in the field of traffic forecasting. <xref ref-type="bibr" rid="ref54">Tian and Pan (2015)</xref> developed a recursive LSTM model that incorporates three multiplication units in the memory block, allowing for dynamic selection of the optimal time lag from historical input, leading to better prediction accuracy. <xref ref-type="bibr" rid="ref69">Zhao et al. (2017)</xref> constructed a two-dimensional LSTM network with multiple memory units to facilitate short-term traffic flow forecasting. They also compared the established model with other representative prediction models to verify its effectiveness. <xref ref-type="bibr" rid="ref66">Yu et al. (2017)</xref> constructed a hybrid deep model based on LSTM for traffic forecasting under extreme conditions and realized the joint simulation of traffic flow states under normal conditions and accident modes. A bidirectional RNN module was used by <xref ref-type="bibr" rid="ref32">Liu et al. (2017)</xref> to analyze historical traffic data at nodes, uncover periodic traffic flow patterns, and incorporate them into urban traffic forecasting. <xref ref-type="bibr" rid="ref10">Fang et al. (2023)</xref> reconfigured the loss function in LSTM based on the negative guidance mixed correlation entropy criterion, aiming at the prediction error caused by non-Gaussian noise, and constructed a delta-free LSTM framework for short-term traffic flow prediction.</p>
</sec>
<sec id="sec4">
<label>2.2</label>
<title>Traffic forecasting based on CNN</title>
<p>CNNs have been utilized by some researchers for traffic forecasting tasks. They use multilayer convolutional structures and their combined networks to extract the spatiotemporal correlation features of traffic flows. <xref ref-type="bibr" rid="ref37">Ma et al. (2022)</xref> built a feature selection algorithm based on the combined units of CNN and GRU, and combined the positive and reverse GRU networks to mine the long-distance dependencies in the input information to increase the accuracy of predictions. <xref ref-type="bibr" rid="ref59">Wang and Susanto (2023)</xref> used CNN to represent and process features such as traffic flow change patterns in different time periods in a way similar to image features, so as to better understand and use the information in time series data to predict traffic flow. However, traditional CNN frameworks are better suited for processing data with uniform size and dimension, typically found in Euclidean structure data. In the context of traffic networks, the road connections between traffic nodes may not be uniformly distributed, and the feature matrix dimensions of nodes may also vary. Therefore, the spatial characteristics learned by CNN may not necessarily represent the optimal features of the traffic network structure. The introduction of graph convolutional networks (GCN) (<xref ref-type="bibr" rid="ref22">Kipf and Welling, 2016</xref>) has brought breakthroughs in the application of CNN in non-Euclidean structured data (<xref ref-type="bibr" rid="ref11">Gong et al., 2023</xref>; <xref ref-type="bibr" rid="ref13">Guo B. et al., 2024</xref>; <xref ref-type="bibr" rid="ref14">Guo X. et al., 2024</xref>). By using the topological structure information of the graph to adjust the convolution operation, CNN can better adapt to the irregular data distribution and complex node relationships in the traffic network, thereby significantly improving its performance in tasks such as traffic forecasting (<xref ref-type="bibr" rid="ref27">Li et al., 2023</xref>).</p>
</sec>
<sec id="sec5">
<label>2.3</label>
<title>Traffic forecasting based on transformer</title>
<p>Transformer, as one of the variations of DL network architectures, was introduced by <xref ref-type="bibr" rid="ref55">Vaswani et al. (2017)</xref>. It models the full dependencies between inputs and outputs using attention mechanisms. Models and frameworks based on Transformer can better handle long-range dependencies in traffic flow data, exhibiting relatively higher flexibility. Based on the overall architecture of Transformer, <xref ref-type="bibr" rid="ref3">Cai et al. (2020)</xref> identified the continuous and periodic patterns in traffic time series, modeled the spatial dependence of the road network, and finally verified the model&#x2019;s impact through two real data sets. <xref ref-type="bibr" rid="ref63">Yan et al. (2021)</xref> used the combined framework of the global encoder and the global&#x2013;local decoder to realize the extraction and fusion of global and local traffic flow features and achieved high-precision prediction of urban traffic flow. <xref ref-type="bibr" rid="ref4">Chen et al. (2022)</xref> constructed a dual-directional spatiotemporal adaptive transformation framework based on codec-decoder structure to address the uneven spatiotemporal distribution in traffic prediction, and verified its effectiveness on four datasets. <xref ref-type="bibr" rid="ref61">Wang F. et al. (2024)</xref> proposed a comprehensive network based on Transformer and GCN to capture the complex spatiotemporal correlations in metropolitan area networks and achieve more accurate traffic forecasting. The attention distribution in Transformer partly reveals the correlation information of traffic flow across different traffic nodes in spatial and temporal dimensions, improving the model&#x2019;s interpretability.</p>
<p><xref ref-type="table" rid="tab1">Table 1</xref> lists the basic models, input information, datasets used and other key information of some methods. Based on <xref ref-type="table" rid="tab1">Table 1</xref>, it can be seen that most of the early short-term traffic forecasting methods are based on a single detector to obtain time series data, such as traffic volume collected by sensors. However, the information contained in a single data source is usually difficult to meet the needs of accurate prediction. To this end, some studies have attempted to integrate multi-source information, give full play to the advantages of various network structures, and build large-scale complex network architectures to mine complex spatiotemporal correlation patterns in traffic flow data. These methods have indeed improved the prediction accuracy to a certain extent. However, the increase in model complexity will increase the training cost and computing resource requirements of the model, and ultimately affect the efficiency and scalability of practical applications (<xref ref-type="bibr" rid="ref34">Lu and Osorio, 2018</xref>; <xref ref-type="bibr" rid="ref21">Ji et al., 2022</xref>; <xref ref-type="bibr" rid="ref1">Berghaus et al., 2024</xref>). Therefore, how to build an efficient and accurate traffic forecasting model is still one of the key issues that need to be overcome in the field of short-term traffic forecasting, and it is also the research goal of this paper.</p>
<table-wrap position="float" id="tab1">
<label>Table 1</label>
<caption>
<p>Summary of research on short-term traffic forecasting.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="top">References</th>
<th align="left" valign="top">Basic model</th>
<th align="left" valign="top">Prediction target</th>
<th align="left" valign="top">Input</th>
<th align="left" valign="top">Dataset</th>
<th align="left" valign="top">Accuracy</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top">
<xref ref-type="bibr" rid="ref54">Tian and Pan (2015)</xref>
</td>
<td align="left" valign="top">LSTM</td>
<td align="left" valign="top">Volume</td>
<td align="left" valign="top">Volume</td>
<td align="left" valign="top">PeMS</td>
<td align="left" valign="top">MAPE&#x202F;=&#x202F;6.49%</td>
</tr>
<tr>
<td align="left" valign="top">
<xref ref-type="bibr" rid="ref69">Zhao et al. (2017)</xref>
</td>
<td align="left" valign="top">LSTM</td>
<td align="left" valign="top">Volume</td>
<td align="left" valign="top">Volume</td>
<td align="left" valign="top">Proprietary dataset</td>
<td align="left" valign="top">MRE&#x202F;=&#x202F;6.41%</td>
</tr>
<tr>
<td align="left" valign="top">
<xref ref-type="bibr" rid="ref66">Yu et al. (2017)</xref>
</td>
<td align="left" valign="top">LSTM</td>
<td align="left" valign="top">Speed</td>
<td align="left" valign="top">Speed and accident data</td>
<td align="left" valign="top">Proprietary dataset</td>
<td align="left" valign="top">MAPE&#x202F;=&#x202F;1.03%</td>
</tr>
<tr>
<td align="left" valign="top">
<xref ref-type="bibr" rid="ref32">Liu et al. (2017)</xref>
</td>
<td align="left" valign="top">LSTM<break/>CNN</td>
<td align="left" valign="top">Volume</td>
<td align="left" valign="top">Traffic network graph, speed and volume, &#x2026;</td>
<td align="left" valign="top">PeMS</td>
<td align="left" valign="top">MAE&#x202F;=&#x202F;4.41<break/>MAPE&#x202F;=&#x202F;6.99%<break/>RMSE&#x202F;=&#x202F;6.42</td>
</tr>
<tr>
<td align="left" valign="top" rowspan="2">
<xref ref-type="bibr" rid="ref3">Cai et al. (2020)</xref>
</td>
<td align="left" valign="top" rowspan="2">GCN<break/>Transformer</td>
<td align="left" valign="top" rowspan="2">Speed</td>
<td align="left" valign="top" rowspan="2">Traffic network graph, speed and volume</td>
<td align="left" valign="top">METR-LA</td>
<td align="left" valign="top">MAE&#x202F;=&#x202F;2.43<break/>MAPE&#x202F;=&#x202F;4.73</td>
</tr>
<tr>
<td align="left" valign="top">PeMS</td>
<td align="left" valign="top">MAE&#x202F;=&#x202F;1.22<break/>MAPE&#x202F;=&#x202F;2.78</td>
</tr>
<tr>
<td align="left" valign="top" rowspan="3">
<xref ref-type="bibr" rid="ref63">Yan et al. (2021)</xref>
</td>
<td align="left" valign="top" rowspan="3">Transformer</td>
<td align="left" valign="top" rowspan="3">Speed</td>
<td align="left" valign="top" rowspan="3">Speed, time of day, and day of the week&#x2026;</td>
<td align="left" valign="top">METR-LA</td>
<td align="left" valign="top">MAE&#x202F;=&#x202F;2.66<break/>MAPE&#x202F;=&#x202F;5.11%<break/>RMSE&#x202F;=&#x202F;6.75</td>
</tr>
<tr>
<td align="left" valign="top">Urban-BJ</td>
<td align="left" valign="top">MAE&#x202F;=&#x202F;4.34<break/>MAPE&#x202F;=&#x202F;6.40%<break/>RMSE&#x202F;=&#x202F;16.67</td>
</tr>
<tr>
<td align="left" valign="top">Ring-BJ</td>
<td align="left" valign="top">MAE&#x202F;=&#x202F;2.31<break/>MAPE&#x202F;=&#x202F;4.15%<break/>RMSE&#x202F;=&#x202F;6.08</td>
</tr>
<tr>
<td align="left" valign="top">
<xref ref-type="bibr" rid="ref37">Ma et al. (2022)</xref>
</td>
<td align="left" valign="top">CNN<break/>GRU</td>
<td align="left" valign="top">Speed</td>
<td align="left" valign="top">Speed</td>
<td align="left" valign="top">Proprietary dataset</td>
<td align="left" valign="top">MAE&#x202F;=&#x202F;3.48<break/>MAPE&#x202F;=&#x202F;8.60%<break/>RMSE&#x202F;=&#x202F;5.09</td>
</tr>
<tr>
<td align="left" valign="top" rowspan="4">
<xref ref-type="bibr" rid="ref4">Chen et al. (2022)</xref>
</td>
<td align="left" valign="top" rowspan="4">DHM<break/>Transformer</td>
<td align="left" valign="top" rowspan="4">Speed</td>
<td align="left" valign="top" rowspan="4">Speed, volume time of day, and day of the week&#x2026;</td>
<td align="left" valign="top">PeMSD3</td>
<td align="left" valign="top">MAE&#x202F;=&#x202F;15.30<break/>MAPE&#x202F;=&#x202F;15.46%<break/>RMSE&#x202F;=&#x202F;25.80</td>
</tr>
<tr>
<td align="left" valign="top">PeMSD4</td>
<td align="left" valign="top">MAE&#x202F;=&#x202F;18.53<break/>MAPE&#x202F;=&#x202F;12.37%<break/>RMSE&#x202F;=&#x202F;29.96</td>
</tr>
<tr>
<td align="left" valign="top">PeMSD7</td>
<td align="left" valign="top">MAE&#x202F;=&#x202F;20.28<break/>MAPE&#x202F;=&#x202F;8.50%<break/>RMSE&#x202F;=&#x202F;33.24</td>
</tr>
<tr>
<td align="left" valign="top">PeMSD8</td>
<td align="left" valign="top">MAE&#x202F;=&#x202F;13.58<break/>MAPE&#x202F;=&#x202F;9.21%<break/>RMSE&#x202F;=&#x202F;23.08</td>
</tr>
<tr>
<td align="left" valign="top">
<xref ref-type="bibr" rid="ref59">Wang and Susanto (2023)</xref>
</td>
<td align="left" valign="top">CNN<break/>LSTM</td>
<td align="left" valign="top">Volume</td>
<td align="left" valign="top">Traffic scene images, vehicle type, holidays, and weather</td>
<td align="left" valign="top">Proprietary dataset</td>
<td align="left" valign="top">MAE&#x202F;=&#x202F;16.50<break/>MSE&#x202F;=&#x202F;0.50<break/>RMSE&#x202F;=&#x202F;22.26</td>
</tr>
<tr>
<td align="left" valign="top">
<xref ref-type="bibr" rid="ref10">Fang et al. (2023)</xref>
</td>
<td align="left" valign="top">LSTM<break/>MCC</td>
<td align="left" valign="top">Volume</td>
<td align="left" valign="top">Volume</td>
<td align="left" valign="top">Amsterdam traffic dataset</td>
<td align="left" valign="top">MAPE&#x202F;=&#x202F;11.57%<break/>RMSE&#x202F;=&#x202F;280.87</td>
</tr>
<tr>
<td align="left" valign="top" rowspan="2">
<xref ref-type="bibr" rid="ref11">Gong et al. (2023)</xref>
</td>
<td align="left" valign="top" rowspan="2">RGCN</td>
<td align="left" valign="top" rowspan="2">Volume</td>
<td align="left" valign="top" rowspan="2">Spatial knowledge graph and volume</td>
<td align="left" valign="top">Shanghai dataset</td>
<td align="left" valign="top">MAE&#x202F;=&#x202F;0.15<break/>RMSE&#x202F;=&#x202F;30.22</td>
</tr>
<tr>
<td align="left" valign="top">Nanjing dataset</td>
<td align="left" valign="top">MAE&#x202F;=&#x202F;0.19<break/>RMSE&#x202F;=&#x202F;0.28</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec sec-type="methods" id="sec6">
<label>3</label>
<title>Methodology</title>
<sec id="sec7">
<label>3.1</label>
<title>Structure of Trafficformer model</title>
<p>The Trafficformer model introduced in this paper is designed for short-term traffic speed prediction at road network nodes, where traffic nodes represent the locations of traffic sensors on the road network. <xref ref-type="fig" rid="fig1">Figure 1</xref> shows the structure of Trafficformer. As shown in <xref ref-type="fig" rid="fig1">Figure 1</xref>, the input of the model is the feature matrix <inline-formula>
<mml:math id="M1">
<mml:msub>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211D;</mml:mi>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mo>&#x00D7;</mml:mo>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> consisting of the traffic speeds of <italic>N</italic> consecutive steps of <italic>I</italic> nodes and the spatial mask <inline-formula>
<mml:math id="M2">
<mml:msup>
<mml:mi mathvariant="bold">M</mml:mi>
<mml:mi mathvariant="normal">P</mml:mi>
</mml:msup>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211D;</mml:mi>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mo>&#x00D7;</mml:mo>
<mml:mi>I</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> calculated by the node distance and free flow speed. Among them, the feature matrix <inline-formula>
<mml:math id="M3">
<mml:msub>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:math>
</inline-formula> is input into the traffic temporal feature extraction module, and the output is the matrix <inline-formula>
<mml:math id="M4">
<mml:msubsup>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211D;</mml:mi>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mo>&#x00D7;</mml:mo>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> containing the traffic flow time series features. As <italic>a priori</italic> knowledge, <inline-formula>
<mml:math id="M5">
<mml:msup>
<mml:mi mathvariant="bold">M</mml:mi>
<mml:mi mathvariant="normal">P</mml:mi>
</mml:msup>
</mml:math>
</inline-formula> specifically guides the model to focus on those nodes that are more likely to affect each other in space, so that the model can focus on the key spatial relationship faster and improve the prediction performance. With <inline-formula>
<mml:math id="M6">
<mml:msubsup>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula> and <inline-formula>
<mml:math id="M7">
<mml:msup>
<mml:mi mathvariant="bold">M</mml:mi>
<mml:mi mathvariant="normal">P</mml:mi>
</mml:msup>
</mml:math>
</inline-formula> as input, the model realizes the extraction and embedding of spatial features based on the feature interaction module, and outputs the global feature matrix <inline-formula>
<mml:math id="M8">
<mml:msub>
<mml:mi mathvariant="bold">Z</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211D;</mml:mi>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mo>&#x00D7;</mml:mo>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> containing the spatiotemporal correlation of traffic flow. Finally, with <inline-formula>
<mml:math id="M9">
<mml:msub>
<mml:mi mathvariant="bold">Z</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:math>
</inline-formula> as input, the predicted speed matrix of each node can be obtained through the speed prediction module. Below, the three modules in the model will be elaborated on in detail.</p>
<fig position="float" id="fig1">
<label>Figure 1</label>
<caption>
<p>Structure of Trafficformer model.</p>
</caption>
<graphic xlink:href="fnbot-19-1527908-g001.tif"/>
</fig>
</sec>
<sec id="sec8">
<label>3.2</label>
<title>Traffic node temporal feature extractor</title>
<p>The Temporal Feature Extractor for traffic nodes primarily consists of an MLP. MLP is a type of feedforward artificial neural network comprised of multiple layers of nodes. Each layer is fully connected to the next layer, and all nodes except the input nodes are neurons with non-linear activation functions. The use of activation functions introduces non-linearity to the output of the neurons, enabling MLP to handle non-linear separable problems effectively. Therefore, MLP is suitable for extracting temporal features with high uncertainty and non-linear characteristics. In this paper, the temporal feature extraction module for traffic nodes is a two-layer perceptron structure. It takes a feature matrix <inline-formula>
<mml:math id="M10">
<mml:msub>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:math>
</inline-formula> as input composed of the historical speeds of traffic flow of <italic>I</italic> nodes over a continuous sequence of <italic>N</italic> statistical intervals starting from time <italic>t</italic>. The feature matrix undergoes two neural network linear layers, one normalization layer, and one non-linear layer successively, ending up with a temporal feature matrix <inline-formula>
<mml:math id="M11">
<mml:msubsup>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula> that contains temporal information for each node, as shown in <xref ref-type="disp-formula" rid="EQ1">Equations 1</xref><xref ref-type="disp-formula" rid="EQ2"/><xref ref-type="disp-formula" rid="EQ3"/>&#x2013;<xref ref-type="disp-formula" rid="EQ4">4</xref>.</p>
<disp-formula id="EQ1">
<label>(1)</label>
<mml:math id="M12">
<mml:msubsup>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">L</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:msup>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">L</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mi mathvariant="bold">b</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">L</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mtext>,</mml:mtext>
</mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math id="M13">
<mml:msubsup>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">L</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211D;</mml:mi>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mo>&#x00D7;</mml:mo>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> is the output of the first neural network linear layer (<italic>H</italic> refers to the hidden layer dimensions of the temporal feature extractor of traffic nodes); <inline-formula>
<mml:math id="M14">
<mml:msup>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">L</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211D;</mml:mi>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#x00D7;</mml:mo>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> and <inline-formula>
<mml:math id="M15">
<mml:msup>
<mml:mi mathvariant="bold">b</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">L</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211D;</mml:mi>
<mml:mi>I</mml:mi>
</mml:msup>
</mml:math>
</inline-formula> are learnable weight matrices, respectively.</p>
<p>To improve the accuracy of non-linear feature extraction and alleviate overfitting issues, a standardization layer and a non-linear layer have been introduced after the first linear layer. The standardization layer employed in this module is LayerNorm (<xref ref-type="bibr" rid="ref23">Lei Ba et al., 2016</xref>). LayerNorm performs individual data sample training without relying on other data, which effectively avoids stability issues caused by the uneven distribution of mini-batch data in the batch normalization process during batch training. Furthermore, it eliminates the need to store mini-batch mean and variance and saves storage space. Considering the convergence speed of the model, the non-linear layer uses the ReLU activation function.</p>
<disp-formula id="EQ2">
<label>(2)</label>
<mml:math id="M16">
<mml:msubsup>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi mathvariant="normal">Lay</mml:mi>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="normal">LayerNorm</mml:mi>
<mml:mfenced open="(" close=")">
<mml:msubsup>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">L</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:mfenced>
<mml:mtext>,</mml:mtext>
</mml:math>
</disp-formula>
<disp-formula id="EQ3">
<label>(3)</label>
<mml:math id="M17">
<mml:msubsup>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi mathvariant="normal">ReLU</mml:mi>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="normal">ReLU</mml:mi>
<mml:mfenced open="(" close=")">
<mml:msubsup>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi mathvariant="normal">relu</mml:mi>
</mml:msubsup>
</mml:mfenced>
<mml:mtext>,</mml:mtext>
</mml:math>
</disp-formula>
<disp-formula id="EQ4">
<label>(4)</label>
<mml:math id="M18">
<mml:msubsup>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi mathvariant="normal">ReLU</mml:mi>
</mml:msubsup>
<mml:msup>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">L</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mi mathvariant="bold">b</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">L</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mtext>,</mml:mtext>
</mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math id="M19">
<mml:msup>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">L</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211D;</mml:mi>
<mml:mrow>
<mml:mi>H</mml:mi>
<mml:mo>&#x00D7;</mml:mo>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> and <inline-formula>
<mml:math id="M20">
<mml:msup>
<mml:mi mathvariant="bold">b</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">L</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211D;</mml:mi>
<mml:mi>I</mml:mi>
</mml:msup>
</mml:math>
</inline-formula> are learnable weight matrices, respectively.</p>
</sec>
<sec id="sec9">
<label>3.3</label>
<title>Traffic node feature interaction</title>
<p>Based on the traffic node temporal feature extractor, the temporal feature of each node was obtained. However, the spatial features among the nodes remained unprocessed. Therefore, subsequent to the traffic node temporal feature extractor, the traffic node feature interaction module was constructed using the encoder in the Transformer. The input of this module is <inline-formula>
<mml:math id="M21">
<mml:msubsup>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula>, which encompasses the temporal features of all nodes, and the output is the global feature matrix <inline-formula>
<mml:math id="M22">
<mml:msub>
<mml:mi mathvariant="bold">Z</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:math>
</inline-formula> that contains the spatiotemporal features of the nodes. The traffic node feature interaction module is constituted by <italic>L</italic> fundamental units. Each of these fundamental units mainly consists of a multi-head attention layer and a feedforward part. Among them, the multi-head attention layer is utilized to capture the complex spatial correlations and dependencies between different nodes by computing attention weights for each node&#x2019;s features and generating new representations based on the weighted sum of other nodes&#x2019; features. And the feed-forward layer is employed to perform a non-linear transformation on the features obtained from the multi-head attention, mapping the input temporal feature matrix to the spatiotemporal feature output. It helps to further refine and enrich the feature representation, endowing the model with stronger discriminative ability. Next, a detailed introduction to the structures of the multi-head attention layer and the feedforward layer will be provided.</p>
<sec id="sec10">
<label>3.3.1</label>
<title>Multi-head attention layer</title>
<p>The multi-head attention mechanism, which is an evolved form of the self-attention mechanism, functions by concurrently executing multiple self-attention heads. This parallel operation empowers the mechanism to capture the intricate dependency relationships within traffic node feature sequences from various vantage points, thereby endowing the traffic flow prediction model with more elaborate and accurate feature representations. In the context of each individual self-attention head, the model first derives the query, key, and value feature matrices that correspond to the node&#x2019;s feature vectors. Subsequently, the model computes the attention weights between nodes by leveraging the query matrix of a particular node and the key matrices of other nodes. Finally, through the utilization of the value matrices of other nodes and their respective attention weights, the model achieves the update of the node feature matrix. <xref ref-type="disp-formula" rid="EQ5">Equations 5</xref><xref ref-type="disp-formula" rid="EQ6"/><xref ref-type="disp-formula" rid="EQ7"/><xref ref-type="disp-formula" rid="EQ8"/>&#x2013;<xref ref-type="disp-formula" rid="EQ9">9</xref>, with the <italic>j</italic>-th (<inline-formula>
<mml:math id="M23">
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>J</mml:mi>
</mml:math>
</inline-formula>) self-attention head serving as a representative example, illustrate the update process of the feature matrix <inline-formula>
<mml:math id="M24">
<mml:msubsup>
<mml:mi mathvariant="bold">Z</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211D;</mml:mi>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mo>&#x00D7;</mml:mo>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> at time <italic>t</italic>.</p>
<disp-formula id="EQ5">
<label>(5)</label>
<mml:math id="M25">
<mml:msubsup>
<mml:mi mathvariant="bold">Q</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:msup>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">Q</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mtext>,</mml:mtext>
</mml:math>
</disp-formula>
<disp-formula id="EQ6">
<label>(6)</label>
<mml:math id="M26">
<mml:msubsup>
<mml:mi mathvariant="bold">K</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:msup>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">K</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mtext>,</mml:mtext>
</mml:math>
</disp-formula>
<disp-formula id="EQ7">
<label>(7)</label>
<mml:math id="M27">
<mml:msubsup>
<mml:mi mathvariant="bold">V</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:msup>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">V</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mtext>,</mml:mtext>
</mml:math>
</disp-formula>
<disp-formula id="EQ8">
<label>(8)</label>
<mml:math id="M28">
<mml:msubsup>
<mml:mi mathvariant="bold">A</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mi mathvariant="bold">Q</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
<mml:msup>
<mml:mfenced open="(" close=")">
<mml:msubsup>
<mml:mi mathvariant="bold">K</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
</mml:mfenced>
<mml:mi mathvariant="normal">T</mml:mi>
</mml:msup>
<mml:mtext>,</mml:mtext>
</mml:math>
</disp-formula>
<disp-formula id="EQ9">
<label>(9)</label>
<mml:math id="M29">
<mml:msubsup>
<mml:mi mathvariant="bold">Z</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="normal">softmax</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mfrac>
<mml:msubsup>
<mml:mi mathvariant="bold">A</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
<mml:msqrt>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:msqrt>
</mml:mfrac>
</mml:mfenced>
<mml:msubsup>
<mml:mi mathvariant="bold">V</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
<mml:mtext>,</mml:mtext>
</mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math id="M30">
<mml:mfenced close="]" open="[">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold">Q</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:mspace width="0.66em"/>
<mml:msubsup>
<mml:mi mathvariant="bold">K</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:mspace width="0.66em"/>
<mml:msubsup>
<mml:mi mathvariant="bold">V</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211D;</mml:mi>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mo>&#x00D7;</mml:mo>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> are the query, key, and value feature matrices in the <italic>j</italic>-th self-attention head respectively; <inline-formula>
<mml:math id="M31">
<mml:msup>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">Q</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:mspace width="0.66em"/>
<mml:msup>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">K</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:mspace width="0.66em"/>
<mml:msup>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">V</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211D;</mml:mi>
<mml:mrow>
<mml:mi>H</mml:mi>
<mml:mo>&#x00D7;</mml:mo>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> are the weight matrices, which can be updated during the training process; <inline-formula>
<mml:math id="M32">
<mml:msubsup>
<mml:mi mathvariant="bold">A</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211D;</mml:mi>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mo>&#x00D7;</mml:mo>
<mml:mi>I</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> is the attention weight in the <italic>j</italic>-th self-attention head; <inline-formula>
<mml:math id="M33">
<mml:mi mathvariant="normal">softmax</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mo>&#x22C5;</mml:mo>
</mml:mfenced>
</mml:math>
</inline-formula> is a normalization function that scales the values of each element in the matrix between 0 and 1 by dividing the attention weights between nodes by the sum of the weights; <inline-formula>
<mml:math id="M34">
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:math>
</inline-formula> is a scaling factor primarily used to mitigate the gradient disappearance issue introduced by the softmax function, which is numerically equal to the dimension <italic>H</italic> of the row vector <inline-formula>
<mml:math id="M35">
<mml:msubsup>
<mml:mi>k</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula> of the node keys in the matrix <inline-formula>
<mml:math id="M36">
<mml:msubsup>
<mml:mi mathvariant="bold">K</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
</mml:math>
</inline-formula>.</p>
<p>Theoretically, the self-attention mechanism possesses the capacity to incorporate the information of all nodes for the generation of a comprehensive feature matrix. Nevertheless, in real-world applications, especially when confronted with complex traffic networks that encompass a large number of nodes, if the model were to compute the attention weights with respect to all nodes without discrimination, it would entail exorbitant computational overheads and might introduce a significant amount of superfluous noise and interference. In light of this, prior information has been elected to be employed to fabricate a spatial mask <inline-formula>
<mml:math id="M37">
<mml:msup>
<mml:mi mathvariant="bold">M</mml:mi>
<mml:mi mathvariant="normal">P</mml:mi>
</mml:msup>
</mml:math>
</inline-formula>. This mask allows the model to ignore nodes that are less likely to be relevant spatially when calculating attention weights. This effectively narrows the computational scope, reduces the impact of noise, and ultimately enhances both training efficiency and model accuracy. To be more specific, initially, the travel time expended by a vehicle in traversing each node at the free flow speed <inline-formula>
<mml:math id="M38">
<mml:msup>
<mml:mi>V</mml:mi>
<mml:mi mathvariant="normal">F</mml:mi>
</mml:msup>
</mml:math>
</inline-formula> is computed. Here, the free flow speed pertains to the velocity at which a vehicle travels under an ideal, unimpeded traffic flow scenario. Subsequently, by considering the connectivity traits among the nodes within the road network, those nodes whose travel time falls within the range of [0, <inline-formula>
<mml:math id="M39">
<mml:msup>
<mml:mi>T</mml:mi>
<mml:mi mathvariant="normal">Limit</mml:mi>
</mml:msup>
</mml:math>
</inline-formula>] are designated as strongly correlated nodes, while those with a travel time exceeding <inline-formula>
<mml:math id="M40">
<mml:msup>
<mml:mi>T</mml:mi>
<mml:mi mathvariant="normal">Limit</mml:mi>
</mml:msup>
</mml:math>
</inline-formula> are classified as weakly correlated nodes. The mask elements corresponding to the strongly correlated nodes are assigned a value of 1, and those corresponding to the weakly correlated nodes are set to 0. This process culminates in the construction of the spatial mask. <xref ref-type="disp-formula" rid="EQ10">Equations 10</xref>, <xref ref-type="disp-formula" rid="EQ11">11</xref> takes node <italic>i</italic> and node <inline-formula>
<mml:math id="M41">
<mml:msup>
<mml:mi>i</mml:mi>
<mml:mo>&#x2217;</mml:mo>
</mml:msup>
</mml:math>
</inline-formula> (<inline-formula>
<mml:math id="M42">
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mspace width="0.66em"/>
<mml:msup>
<mml:mi>i</mml:mi>
<mml:mo>&#x2217;</mml:mo>
</mml:msup>
<mml:mo>=</mml:mo>
<mml:mfenced close="]" open="[">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mspace width="0.66em"/>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>I</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>;</mml:mo>
<mml:mspace width="0.66em"/>
<mml:mi>i</mml:mi>
<mml:mo>&#x2260;</mml:mo>
<mml:msup>
<mml:mi>i</mml:mi>
<mml:mo>&#x2217;</mml:mo>
</mml:msup>
</mml:math>
</inline-formula>) as examples to illustrate the calculation process of the spatial mask.</p>
<disp-formula id="EQ10">
<label>(10)</label>
<mml:math id="M43">
<mml:msup>
<mml:mi>m</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>i</mml:mi>
<mml:mo>&#x2217;</mml:mo>
</mml:msup>
</mml:mrow>
</mml:msup>
<mml:mo>=</mml:mo>
<mml:mfenced close="" open="{">
<mml:mtable columnalign="left" equalrows="true" equalcolumns="true">
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mn>0</mml:mn>
<mml:mspace width="1.25em"/>
<mml:mi mathvariant="normal">if</mml:mi>
<mml:mspace width="1.91em"/>
<mml:msup>
<mml:mi>T</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>i</mml:mi>
<mml:mo>&#x2217;</mml:mo>
</mml:msup>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2264;</mml:mo>
<mml:msup>
<mml:mi>T</mml:mi>
<mml:mi mathvariant="normal">Limit</mml:mi>
</mml:msup>
</mml:mtd>
</mml:mtr>
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mn>1</mml:mn>
<mml:mspace width="1.25em"/>
<mml:mi mathvariant="normal">else</mml:mi>
<mml:mspace width="0.66em"/>
<mml:msup>
<mml:mi>T</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>i</mml:mi>
<mml:mo>&#x2217;</mml:mo>
</mml:msup>
</mml:mrow>
</mml:msup>
<mml:mo>&#x003E;</mml:mo>
<mml:msup>
<mml:mi>T</mml:mi>
<mml:mi mathvariant="normal">Limit</mml:mi>
</mml:msup>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mfenced>
<mml:mtext>,</mml:mtext>
</mml:math>
</disp-formula>
<disp-formula id="EQ11">
<label>(11)</label>
<mml:math id="M44">
<mml:msup>
<mml:mi>T</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>i</mml:mi>
<mml:mo>&#x2217;</mml:mo>
</mml:msup>
</mml:mrow>
</mml:msup>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:msup>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>i</mml:mi>
<mml:mo>&#x2217;</mml:mo>
</mml:msup>
</mml:mrow>
</mml:msup>
<mml:msup>
<mml:mi>V</mml:mi>
<mml:mi mathvariant="normal">F</mml:mi>
</mml:msup>
</mml:mfrac>
<mml:mtext>,</mml:mtext>
</mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math id="M45">
<mml:msup>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>i</mml:mi>
<mml:mo>&#x2217;</mml:mo>
</mml:msup>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> is the actual distance between nodes, mile.</p>
<p>At this stage, the calculation methodology for the attention weight <inline-formula>
<mml:math id="M46">
<mml:msubsup>
<mml:mi mathvariant="bold">A</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
</mml:math>
</inline-formula> is revised as <xref ref-type="disp-formula" rid="EQ12">Equation 12</xref>:</p>
<disp-formula id="EQ12">
<label>(12)</label>
<mml:math id="M47">
<mml:msubsup>
<mml:mi mathvariant="bold">A</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mi mathvariant="bold">Q</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
<mml:msup>
<mml:mfenced open="(" close=")">
<mml:msubsup>
<mml:mi mathvariant="bold">K</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>j</mml:mi>
</mml:msubsup>
</mml:mfenced>
<mml:mi mathvariant="normal">T</mml:mi>
</mml:msup>
<mml:mo>&#x2297;</mml:mo>
<mml:msup>
<mml:mi mathvariant="bold">M</mml:mi>
<mml:mi mathvariant="normal">P</mml:mi>
</mml:msup>
<mml:mtext>,</mml:mtext>
</mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math id="M48">
<mml:mo>&#x2297;</mml:mo>
</mml:math>
</inline-formula> denotes elementwise multiplication of matrices.</p>
<p>Once the feature matrix of each attention head have been computed, the global feature matrix <inline-formula>
<mml:math id="M49">
<mml:msub>
<mml:mi mathvariant="bold">Z</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:math>
</inline-formula> within the framework of the multi-head attention mechanism can be calculated in accordance with <xref ref-type="disp-formula" rid="EQ13">Equation 13</xref>. The multi-head attention mechanism&#x2019;s network structure is presented in <xref ref-type="fig" rid="fig2">Figure 2</xref>.</p>
<disp-formula id="EQ13">
<label>(13)</label>
<mml:math id="M50">
<mml:msub>
<mml:mi mathvariant="bold">Z</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="normal">Concat</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msubsup>
<mml:mi mathvariant="bold">Z</mml:mi>
<mml:mi>t</mml:mi>
<mml:mn>1</mml:mn>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:mspace width="0.66em"/>
<mml:msubsup>
<mml:mi mathvariant="bold">Z</mml:mi>
<mml:mi>t</mml:mi>
<mml:mn>2</mml:mn>
</mml:msubsup>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi mathvariant="bold">Z</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>J</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
<mml:msubsup>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi mathvariant="normal">O</mml:mi>
</mml:msubsup>
<mml:mtext>,</mml:mtext>
</mml:math>
</disp-formula>
<fig position="float" id="fig2">
<label>Figure 2</label>
<caption>
<p>Structure of multi-head attention.</p>
</caption>
<graphic xlink:href="fnbot-19-1527908-g002.tif"/>
</fig>
<p>where <inline-formula>
<mml:math id="M51">
<mml:mi mathvariant="normal">Concat</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mo>&#x22C5;</mml:mo>
</mml:mfenced>
</mml:math>
</inline-formula> represents the concatenation operation, which specifically refers to horizontal concatenation of the feature matrices under different conditions in this paper; <inline-formula>
<mml:math id="M52">
<mml:msubsup>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi mathvariant="normal">O</mml:mi>
</mml:msubsup>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211D;</mml:mi>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>H</mml:mi>
<mml:mo>&#x00D7;</mml:mo>
<mml:mi>J</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x00D7;</mml:mo>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> is a learnable weight matrix that represents the importance of different attention angles based on a global perspective.</p>
</sec>
<sec id="sec11">
<label>3.3.2</label>
<title>Feedforward networks</title>
<p>The feedforward network is a two-layer MLP structure. Unlike the normalization operation embedded within the traffic node&#x2019;s temporal feature extraction component, the normalization operation in the feature interaction component is implemented separately by an external module. Therefore, the feedforward network consists only of fully connected layers and non-linear activation functions, as shown in <xref ref-type="disp-formula" rid="EQ14">Equation 14</xref>:</p>
<disp-formula id="EQ14">
<label>(14)</label>
<mml:math id="M53">
<mml:msub>
<mml:mi mathvariant="bold">F</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="normal">ReLU</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">Z</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:msup>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">F</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mi mathvariant="bold">b</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">F</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
<mml:msup>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">F</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mi mathvariant="bold">b</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">F</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mtext>,</mml:mtext>
</mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math id="M54">
<mml:mfenced close="]" open="[" separators=",">
<mml:msup>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">F</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:msup>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">F</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mfenced>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211D;</mml:mi>
<mml:mrow>
<mml:mi>H</mml:mi>
<mml:mo>&#x00D7;</mml:mo>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>, <inline-formula>
<mml:math id="M55">
<mml:mfenced close="]" open="[">
<mml:mrow>
<mml:msup>
<mml:mi mathvariant="bold">b</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">F</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:mspace width="0.66em"/>
<mml:msup>
<mml:mi mathvariant="bold">b</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">F</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211D;</mml:mi>
<mml:mi>I</mml:mi>
</mml:msup>
</mml:math>
</inline-formula> are learnable weight matrices, respectively.</p>
<p>To build a deep model that effectively captures the complex spatiotemporal features in traffic flow data, Transformer employs residual connections around each module, followed by layer normalization, as shown in <xref ref-type="disp-formula" rid="EQ15">Equations 15</xref>, <xref ref-type="disp-formula" rid="EQ16">16</xref>. In summary, the basic unit of the traffic node interaction module can be abstracted as the following equation, and the structure of the basic interaction module can be represented by <xref ref-type="fig" rid="fig3">Figure 3</xref>.</p>
<disp-formula id="EQ15">
<label>(15)</label>
<mml:math id="M56">
<mml:msubsup>
<mml:mi mathvariant="bold">Z</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="normal">LayerNorm</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">Z</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msubsup>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
<mml:mtext>,</mml:mtext>
</mml:math>
</disp-formula>
<disp-formula id="EQ16">
<label>(16)</label>
<mml:math id="M57">
<mml:msubsup>
<mml:mi mathvariant="bold">Z</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="normal">LayerNorm</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">F</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:msubsup>
<mml:mi mathvariant="bold">Z</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mfenced>
<mml:mtext>,</mml:mtext>
</mml:math>
</disp-formula>
<fig position="float" id="fig3">
<label>Figure 3</label>
<caption>
<p>Feature interaction module structure of traffic nodes.</p>
</caption>
<graphic xlink:href="fnbot-19-1527908-g003.tif"/>
</fig>
</sec>
</sec>
<sec id="sec12">
<label>3.4</label>
<title>Traffic node speed forecasting</title>
<p>The traffic node speed forecasting module also follows the MLP structure, which is identical to the traffic node temporal feature extraction module. Both modules consist of two neural network linear layers, one normalization layer, and one non-linear layer. The difference lies in the input, output, and hidden layer dimensions of the network. The input of the traffic node speed forecasting module is the fused interaction feature matrix <inline-formula>
<mml:math id="M58">
<mml:msubsup>
<mml:mi mathvariant="bold">Z</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula> that captures the spatiotemporal correlations in the road network, while the output is the traffic speed matrix <inline-formula>
<mml:math id="M59">
<mml:msubsup>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211D;</mml:mi>
<mml:mi>I</mml:mi>
</mml:msup>
</mml:math>
</inline-formula> for each node on road network at time step <inline-formula>
<mml:math id="M60">
<mml:mi>t</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>N</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:math>
</inline-formula>, as shown in <xref ref-type="disp-formula" rid="EQ17">Equations 17</xref><xref ref-type="disp-formula" rid="EQ18"/><xref ref-type="disp-formula" rid="EQ19"/>&#x2013;<xref ref-type="disp-formula" rid="EQ20">20</xref>. MLP has various advantages of structure simplicity and highly parallel processing, which makes it computationally efficient for large-scale traffic forecasting tasks. This is why MLP has been chosen multiple times in this study for processing traffic node features.</p>
<disp-formula id="EQ17">
<label>(17)</label>
<mml:math id="M61">
<mml:msubsup>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">L</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mi mathvariant="bold">Z</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:msup>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">L</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mi mathvariant="bold">b</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">L</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mtext>,</mml:mtext>
</mml:math>
</disp-formula>
<disp-formula id="EQ18">
<label>(18)</label>
<mml:math id="M62">
<mml:msubsup>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">Lay</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="normal">LayerNorm</mml:mi>
<mml:mfenced open="(" close=")">
<mml:msubsup>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">L</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:mfenced>
<mml:mtext>,</mml:mtext>
</mml:math>
</disp-formula>
<disp-formula id="EQ19">
<label>(19)</label>
<mml:math id="M63">
<mml:msubsup>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">ReLU</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="normal">ReLU</mml:mi>
<mml:mfenced open="(" close=")">
<mml:msubsup>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">Lay</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mfenced>
<mml:mtext>,</mml:mtext>
</mml:math>
</disp-formula>
<disp-formula id="EQ20">
<label>(20)</label>
<mml:math id="M64">
<mml:msubsup>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">ReLU</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msup>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">L</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mi mathvariant="bold">b</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">L</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mtext>,</mml:mtext>
</mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math id="M65">
<mml:msup>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">L</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211D;</mml:mi>
<mml:mrow>
<mml:mi>H</mml:mi>
<mml:mo>&#x00D7;</mml:mo>
<mml:msup>
<mml:mi>H</mml:mi>
<mml:mo>&#x2217;</mml:mo>
</mml:msup>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>, <inline-formula>
<mml:math id="M66">
<mml:msup>
<mml:mi mathvariant="bold">b</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">L</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211D;</mml:mi>
<mml:mi>I</mml:mi>
</mml:msup>
</mml:math>
</inline-formula>, <inline-formula>
<mml:math id="M67">
<mml:msup>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">L</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211D;</mml:mi>
<mml:mrow>
<mml:msup>
<mml:mi>H</mml:mi>
<mml:mo>&#x2217;</mml:mo>
</mml:msup>
<mml:mo>&#x00D7;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> and <inline-formula>
<mml:math id="M68">
<mml:msup>
<mml:mi mathvariant="bold">b</mml:mi>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">L</mml:mi>
<mml:mi mathvariant="normal">i</mml:mi>
<mml:mi mathvariant="normal">n</mml:mi>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211D;</mml:mi>
<mml:mn>1</mml:mn>
</mml:msup>
</mml:math>
</inline-formula> are all learnable weight matrices; <inline-formula>
<mml:math id="M69">
<mml:msup>
<mml:mi>H</mml:mi>
<mml:mo>&#x2217;</mml:mo>
</mml:msup>
</mml:math>
</inline-formula> denotes the dimensions of hidden layers in the traffic node speed forecasting module.</p>
</sec>
</sec>
<sec id="sec13">
<label>4</label>
<title>Experiments</title>
<sec id="sec14">
<label>4.1</label>
<title>Dataset description</title>
<p>In this study, the efficacy of the method was evaluated by leveraging the publicly available Seattle Inductive Loop Detector Dataset V1 (referred to as the Loop dataset hereafter). This dataset consists of speed information collected from loop detectors deployed on four highways in the Seattle area: I-5, I-405, I-90, and SR-520. Each blue icon in <xref ref-type="fig" rid="fig4">Figure 4</xref> represents a milepost on the road network, with a total of 323 mileposts along the entire route. For any given milepost, the speed information is obtained by averaging the data from multiple detectors on the corresponding main road direction. The dataset used in this study is available at the following link: <ext-link xlink:href="https://github.com/zhiyongc/Seattle-Loop-Data" ext-link-type="uri">https://github.com/zhiyongc/Seattle-Loop-Data</ext-link>.</p>
<fig position="float" id="fig4">
<label>Figure 4</label>
<caption>
<p>Seattle freeway satellite map (<ext-link xlink:href="https://github.com/zhiyongc/Seattle-Loop-Data" ext-link-type="uri">https://github.com/zhiyongc/Seattle-Loop-Data</ext-link>).</p>
</caption>
<graphic xlink:href="fnbot-19-1527908-g004.tif"/>
</fig>
<p>The dataset contains the complete spatiotemporal speed information for the highway system in 2015, with a time interval of 5&#x202F;min for each detector. The dataset comprises over 3.83 million records. In terms of the principle of algorithmic consistency, the model program was implemented based on the opensource code from a previous study (<xref ref-type="bibr" rid="ref6">Cui et al., 2019</xref>). Several comparative experiments were performed using the identical dataset. The dataset was partitioned into three parts: training set, validation set, and test set, maintaining a 7:2:1 proportion. The training set served the purpose of model training, the validation set was reserved for finetuning and optimizing the parameters, and the test set was designated for evaluating the generalization performance of the model. Additionally, the road speed limit was set to 60&#x202F;miles per hour, so <inline-formula>
<mml:math id="M70">
<mml:msup>
<mml:mi>V</mml:mi>
<mml:mi mathvariant="normal">F</mml:mi>
</mml:msup>
<mml:mo>=</mml:mo>
<mml:mn>60</mml:mn>
</mml:math>
</inline-formula> mph is obtained. In the preprocessing stage, each speed value in the speed matrix is divided by the maximum speed value in the data set to normalize the speed data to the [0, 1] interval. This normalization operation is of great significance. It unifies the data scale, effectively improves the efficiency and stability of model training, and avoids the model&#x2019;s excessive attention to certain features due to differences in data scale.</p>
</sec>
<sec id="sec15">
<label>4.2</label>
<title>Experimental settings</title>
<sec id="sec16">
<label>4.2.1</label>
<title>Baselines</title>
<p>In this paper, the Trafficformer model is compared with several established baseline models. These baseline models are carefully selected to represent a diverse range of techniques in the traffic flow prediction field, including both classic linear methods such as ARIMA and SVR, which possess well-established theoretical foundations but also come with certain limitations, and various nonlinear models like DiffGRU, LSTM, DMLP, LSTM+MLP, and TGG-LSTM. By comparing with these models, a thorough analysis of their performance is provided, and the distinct advantages of Trafficformer in different traffic forecasting scenarios are highlighted.</p>
<list list-type="order">
<list-item>
<p>SVR: Support Vector Regression model (<xref ref-type="bibr" rid="ref15">Hamed et al., 1995</xref>).</p>
</list-item>
<list-item>
<p>LSTM: Long Short-Term Memory network (<xref ref-type="bibr" rid="ref49">Schmidhuber and Hochreiter, 1997</xref>).</p>
</list-item>
<list-item>
<p>ARIMA: Autoregressive Integrated Moving Average model (<xref ref-type="bibr" rid="ref51">Smola and Sch&#x00F6;lkopf, 2004</xref>).</p>
</list-item>
<list-item>
<p>DiffGRU: An improved model based on Convolutional RNN. The spatial dependencies between traffic nodes are captured using Spectrogram Convolution, and the temporal dependencies are captured using enc-decoding components with scheduled sampling (<xref ref-type="bibr" rid="ref30">Li et al., 2017</xref>).</p>
</list-item>
<list-item>
<p>TGG-LSTM: A DL model based on LSTM, which modeled the spatial correlations between different traffic nodes using graph convolution and utilized LSTM for vertical mining of the historical information of traffic flow (<xref ref-type="bibr" rid="ref6">Cui et al., 2019</xref>).</p>
</list-item>
<list-item>
<p>DMLP: A network model consisting of two double-layered perceptions, where each MLP is responsible for traffic feature extraction and prediction, respectively (<xref ref-type="bibr" rid="ref57">Wang Z. et al., 2024</xref>).</p>
</list-item>
<list-item>
<p>LSTM + MLP: A comparative algorithm proposed in relation to LSTM, aiming to highlight the unique significance of designing traffic flow feature extraction and prediction as separate modules. It consists of a single layer of LSTM for extracting traffic feature states and a two-layer perceptron for predicting traffic speed, which effectively improves the analysis of traffic flow data.</p>
</list-item>
</list>
</sec>
<sec id="sec17">
<label>4.2.2</label>
<title>Training parameters</title>
<p>All LSTM and MLP layers have the same weight dimensions, with a hidden layer size of 128. The input traffic flow data was composed of the historical speeds of traffic flow of 323 nodes over a continuous sequence of 10 artistical intervals starting from time <italic>t</italic>, denoted as <inline-formula>
<mml:math id="M71">
<mml:mi>N</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>10</mml:mn>
</mml:math>
</inline-formula>. The predicted time step is 1. The size of the node connectivity constraint indicator <inline-formula>
<mml:math id="M72">
<mml:msup>
<mml:mi>T</mml:mi>
<mml:mi mathvariant="normal">Limit</mml:mi>
</mml:msup>
</mml:math>
</inline-formula> can be adjusted to observe the effects of feature extraction and interaction within different spatial ranges. Through multiple experiments, the value of <inline-formula>
<mml:math id="M73">
<mml:msup>
<mml:mi>T</mml:mi>
<mml:mi mathvariant="normal">Limit</mml:mi>
</mml:msup>
</mml:math>
</inline-formula> was set to 5. This means that each traffic node interacts with other traffic nodes that can be reached within 5&#x202F;min of free flow speed from that node. Each model is trained with the goal of minimizing the MSE, which serves as a reliable and commonly used metric to quantify the disparity between the predicted and actual values. The optimization process is carried out using the AdamW optimizer, a sophisticated variant proposed by <xref ref-type="bibr" rid="ref33">Loshchilov (2017)</xref>. This optimizer ingeniously applies weight decay, a technique that effectively curtails the gradient of model parameters. By doing so, it not only mitigates the risk of overfitting but also substantially lowers the computational complexity associated with training. In terms of the learning rate strategy, the ReduceLROnPlateau approach (<xref ref-type="bibr" rid="ref47">Ruder, 2016</xref>) has been adopted. This strategy is designed to dynamically adjust the learning rate based on the evaluation metrics. The initial learning rate is meticulously configured at 1E-3, a value determined through an extensive series of preliminary experiments. A decay factor of 0.2 is employed, which means that whenever the performance metric plateaus, the learning rate is reduced by this factor. The minimum learning rate is set at 1E-6 to ensure that the learning process does not stagnate completely. The total number of iterations is capped at a maximum of 150 to prevent excessive training and potential overfitting.</p>
<p>To further safeguard the convergence and generalization ability of the model, a mechanism to adaptively reduce the learning rate has been implemented. Specifically, if there is no observable improvement in performance for 10 consecutive epochs, the model will automatically reduce the learning rate. This adaptive learning rate adjustment strategy allows the model to finetune its learning pace and explore the parameter space more effectively, ultimately leading to better convergence and performance. In addition to the aforementioned strategies, a crucial regularization technique known as Early Stopping has been incorporated. The Early Stopping strategy acts as a safeguard against overfitting by closely monitoring the performance of the model on the validation set. Once the performance on the validation set ceases to improve, the training process is promptly halted. This ensures that the model is trained sufficiently to capture the underlying patterns in the data while preventing it from overfitting to the training data and losing its generalization capabilities. Overall, these meticulously designed optimization and regularization strategies work in tandem to enhance the performance, stability, and generalization ability of the model, enabling it to effectively handle the complex and dynamic nature of the traffic flow prediction task.</p>
</sec>
<sec id="sec18">
<label>4.2.3</label>
<title>Metrics</title>
<p>To evaluate the discrepancy between predicted traffic flow speed and actual traffic flow speed, three performance metrics are utilized: Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and Root Mean Square Error (RMSE) (<xref ref-type="bibr" rid="ref25">Li et al., 2021</xref>, <xref ref-type="bibr" rid="ref26">2022</xref>; <xref ref-type="bibr" rid="ref13">Guo B. et al., 2024</xref>; <xref ref-type="bibr" rid="ref14">Guo X. et al., 2024</xref>). The calculation method of the three metrics is shown in <xref ref-type="disp-formula" rid="EQ21">Equations 21</xref><xref ref-type="disp-formula" rid="EQ22"/>&#x2013;<xref ref-type="disp-formula" rid="EQ23">23</xref>.</p>
<disp-formula id="EQ21">
<label>(21)</label>
<mml:math id="M74">
<mml:mi>M</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>E</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mi>I</mml:mi>
</mml:mfrac>
<mml:msubsup>
<mml:mstyle displaystyle="true">
<mml:mo stretchy="true">&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>I</mml:mi>
</mml:msubsup>
<mml:mfenced close="|" open="|">
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo stretchy="true">&#x0302;</mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mtext>,</mml:mtext>
</mml:math>
</disp-formula>
<disp-formula id="EQ22">
<label>(22)</label>
<mml:math id="M75">
<mml:mi mathvariant="italic">MAPE</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mi>I</mml:mi>
</mml:mfrac>
<mml:msubsup>
<mml:mstyle displaystyle="true">
<mml:mo stretchy="true">&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>I</mml:mi>
</mml:msubsup>
<mml:mfrac>
<mml:mfenced close="|" open="|">
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo stretchy="true">&#x0302;</mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mfrac>
<mml:mtext>,</mml:mtext>
</mml:math>
</disp-formula>
<disp-formula id="EQ23">
<label>(23)</label>
<mml:math id="M76">
<mml:mi mathvariant="italic">RMSE</mml:mi>
<mml:mo>=</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mi>I</mml:mi>
</mml:mfrac>
<mml:msubsup>
<mml:mstyle displaystyle="true">
<mml:mo stretchy="true">&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>I</mml:mi>
</mml:msubsup>
<mml:msup>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo stretchy="true">&#x0302;</mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:msqrt>
<mml:mtext>,</mml:mtext>
</mml:math>
</disp-formula>
<p>where, <inline-formula>
<mml:math id="M77">
<mml:msub>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo stretchy="true">&#x0302;</mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:math>
</inline-formula> represents the predicted speed of the traffic flow corresponding to node <italic>i</italic>, and <inline-formula>
<mml:math id="M78">
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:math>
</inline-formula> represents the actual speed of the traffic flow corresponding to the same node, which serves as the data label.</p>
</sec>
</sec>
<sec id="sec19">
<label>4.3</label>
<title>Experimental results</title>
<sec id="sec20">
<label>4.3.1</label>
<title>Comparative study</title>
<p>The performance metrics for each model on the test dataset can be found in <xref ref-type="table" rid="tab2">Table 2</xref>. It can be observed that ARIMA and SVR are at a significant disadvantage. The limitations of these models stem from their inherent structural characteristics, which restrict their performance in large-scale prediction problems. For instance, ARIMA-based methods require the data to be stationary before making predictions, which can consume a significant number of computational resources in large-scale prediction tasks. Additionally, as mentioned in the Introduction, ARIMA-based methods have limited effectiveness in handling nonlinear data, which further restricts their applicability. While SVR performs well in handling low-dimensional and small sample datasets, it struggles with large-scale training samples and is sensitive to missing data. Consequently, it faces challenges in pre-processing and parameter tuning. On the other hand, DiffGRU and LSTM demonstrate a significant improvement in RMSE compared to ARIMA and SVR, with a reduction of 23%/26 and 53%/55%, respectively. This highlights the advantages of DL models in traffic forecasting. Traffic flow exhibits long-term fluctuations in both time and space, and these underlying patterns need to be mined and learned in the traffic forecasting process. Both GRU and LSTM leverage gate structures to achieve recurrent processing and feature extraction in sequential data. GRU does not have the forget gate structure found in LSTM, which may make it less effective in certain tasks requiring long-term dependencies. However, in some cases, GRU&#x2019;s simplicity can lead to better efficiency. Furthermore, the network complexity of DiffGRU and LSTM is relatively low, and their ability to represent highly nonlinear road network features is limited with a small number of parameters. Therefore, their prediction accuracy is lower compared to other DL methods (models 5&#x2013;8).</p>
<table-wrap position="float" id="tab2">
<label>Table 2</label>
<caption>
<p>Evaluation metrics of baseline model test set.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="top">Number</th>
<th align="left" valign="top">Model</th>
<th align="center" valign="top">MAE/STD (mph)</th>
<th align="center" valign="top">MAPE (%)</th>
<th align="center" valign="top">RMSE (mph)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="middle">1</td>
<td align="left" valign="middle">SVR</td>
<td align="center" valign="middle">6.85/1.17</td>
<td align="center" valign="middle">14.39</td>
<td align="center" valign="middle">11.12</td>
</tr>
<tr>
<td align="left" valign="middle">2</td>
<td align="left" valign="middle">LSTM</td>
<td align="center" valign="middle">2.70/0.18</td>
<td align="center" valign="middle">6.83</td>
<td align="center" valign="middle">4.97</td>
</tr>
<tr>
<td align="left" valign="middle">3</td>
<td align="left" valign="middle">ARIMA</td>
<td align="center" valign="middle">6.10/1.09</td>
<td align="center" valign="middle">13.85</td>
<td align="center" valign="middle">10.65</td>
</tr>
<tr>
<td align="left" valign="middle">4</td>
<td align="left" valign="middle">DiffGRU</td>
<td align="center" valign="middle">4.67/0.38</td>
<td align="center" valign="middle">11.18</td>
<td align="center" valign="middle">8.22</td>
</tr>
<tr>
<td align="left" valign="middle">5</td>
<td align="left" valign="middle">TGG-LSTM</td>
<td align="center" valign="middle">2.57/0.10</td>
<td align="center" valign="middle">6.01</td>
<td align="center" valign="middle">4.63</td>
</tr>
<tr>
<td align="left" valign="middle">6</td>
<td align="left" valign="middle">DMLP</td>
<td align="center" valign="middle">2.40/0.09</td>
<td align="center" valign="middle">5.80</td>
<td align="center" valign="middle">3.57</td>
</tr>
<tr>
<td align="left" valign="middle">7</td>
<td align="left" valign="middle">LSTM+MLP</td>
<td align="center" valign="middle">2.40/0.09</td>
<td align="center" valign="middle">5.70</td>
<td align="center" valign="middle">3.56</td>
</tr>
<tr>
<td align="left" valign="middle">8</td>
<td align="left" valign="middle">Trafficformer</td>
<td align="center" valign="middle">2.10/0.07</td>
<td align="center" valign="middle">4.70</td>
<td align="center" valign="middle">3.08</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>DMLP, LSTM+MLP, TGG-LSTM, and Trafficformer are four models with sufficient complexity to capture the nonlinear patterns within traffic flow data. Therefore, compared to the previous three models, all four models show a notable enhancement in accuracy. However, even the best performing model among the four, LSTM+MLP has a 16% higher RMSE compared to Trafficformer. The forecasting accuracy of the initial three models is similar but with some differences. DMLP and LSTM+MLP have the closest performance, indicating that a single-layer MLP and LSTM have similar effectiveness in extracting traffic flow features. Comparing them with a single-layer LSTM network also reveals the importance of designing separate networks for traffic flow feature extraction in improving prediction performance. TGG-LSTM takes into account the complex spatiotemporal features of traffic flow data and explores the prediction task thoroughly using LSTM and graph convolutional neural networks as core algorithms. Theoretically, it is supposed to surpass other DL algorithms that overlook traffic flow spatial features. However, its evaluation metrics are slightly higher than the other three algorithms. Relative to the proposed Trafficformer model, the MAE, MAPE, and RMSE show increases of 22, 27, and 50%, respectively.</p>
<p>The phenomenon can be explained by two main causes. First of all, the self-attention mechanism in Transformer permits the model to capture information from any position in the sequence, enabling better handling of long-range dependencies. On the other hand, GCN can only address long-range dependencies through expanding the number of convolutional layers. However, as the number of layers increases, the model&#x2019;s effectiveness in capturing dependencies diminishes and the interpretability of the model is reduced. Therefore, prediction models based on GCN lack flexibility in feature extraction. Second, traffic data is typically collected by fixed location detectors at regular time intervals, resulting in sequences with clear temporal features. With the inherent advantages of attention mechanisms, Transformer can be applied to any type of input regardless of its shape. However, the GCN algorithm can only handle graph data, and treating traffic flow data as graph input disrupts the internal structure of the data to some degree, which limits the model&#x2019;s performance and results in relatively lower accuracy. This does not mean that GCN-based network structures cannot be applied to traffic forecasting problems. When the data collection method changes, such as using image-based traffic data collected by video detectors, GCN-based models may achieve better prediction results (<xref ref-type="bibr" rid="ref31">Li et al., 2024b</xref>,<xref ref-type="bibr" rid="ref28">c</xref>).</p>
<p>In conclusion, the Trafficformer model shows significant improvements in MAE, MSE, and RMSE compared to other baseline methods, which indicates good performance in predicting future traffic flow.</p>
<p>In addition, to more rigorously evaluate the reliability of its performance improvements from a statistical perspective, LSTM + MLP, which performed best among the comparison methods, is selected. The predicted and true values from both models on the test set are used as inputs for paired <italic>t</italic>-tests and DM tests. The paired <italic>t</italic>-test is employed to determine whether there is a significant difference in the means of the two paired datasets. The null hypothesis states that the means of the two groups are equal, while the alternative hypothesis posits that the means are not equal. If the <italic>p</italic>-value obtained from the paired <italic>t</italic>-test is less than 0.05, the null hypothesis can be rejected, indicating a statistically significant difference between the means of the two groups. The DM test is used to compare whether there is a significant difference in the predictive accuracy of the two models. Its null hypothesis is that there is no difference in predictive accuracy between the two models, and the alternative hypothesis is that there is a difference (<xref ref-type="bibr" rid="ref19">Iftikhar et al., 2023</xref>, <xref ref-type="bibr" rid="ref18">2024</xref>; <xref ref-type="bibr" rid="ref12">Gonzales et al., 2024</xref>). When the <italic>p</italic>-value calculated from the DM test is less than 0.05, there is sufficient evidence to reject the null hypothesis, suggesting that the predictive accuracies of the two models differ significantly.</p>
<p>As shown in <xref ref-type="table" rid="tab3">Table 3</xref>, the <italic>p</italic>-values from the paired <italic>t</italic>-tests between Trafficformer and LSTM+MLP are very small (averaging 3.27E-18 and 2.96E-03), well below the 0.05 significance level. Thus, the null hypothesis is rejected, confirming a statistically significant difference between the predicted and true values of the two models. Moreover, the DM test further supports this conclusion by rejecting the null hypothesis that the models&#x2019; predictive performances are identical. The multiple DM statistics and corresponding minimal <italic>p</italic>-values indicate that the prediction errors of the models are fundamentally different, reflecting the distinct effectiveness of their prediction mechanisms rather than random fluctuations. In summary, Trafficformer demonstrates clear advantages in both prediction accuracy and statistical significance, showcasing its broad application potential in traffic prediction problems.</p>
<table-wrap position="float" id="tab3">
<label>Table 3</label>
<caption>
<p>LSTM+MLP &#x0026; Trafficformer statistical significance verification table.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="top" colspan="3">Models</th>
<th align="center" valign="top">LSTM+MLP</th>
<th align="center" valign="top">Trafficformer</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="top" rowspan="8">Paired <italic>t</italic>-tests</td>
<td align="left" valign="top" rowspan="2">Step 1</td>
<td align="left" valign="top"><italic>t</italic>-statistic</td>
<td align="center" valign="top">6.03</td>
<td align="center" valign="top">10.96</td>
</tr>
<tr>
<td align="left" valign="top"><italic>p</italic>-value</td>
<td align="center" valign="top">1.67E-09</td>
<td align="center" valign="top">8.74E-28</td>
</tr>
<tr>
<td align="left" valign="top" rowspan="2">Step 24</td>
<td align="left" valign="top"><italic>t</italic>-statistic</td>
<td align="center" valign="top">2.84</td>
<td align="center" valign="top">14.91</td>
</tr>
<tr>
<td align="left" valign="top"><italic>p</italic>-value</td>
<td align="center" valign="top">4.55E-03</td>
<td align="center" valign="top">3.89E-47</td>
</tr>
<tr>
<td align="left" valign="top" rowspan="2">Step 123</td>
<td align="left" valign="top"><italic>t</italic>-statistic</td>
<td align="center" valign="top">2.85</td>
<td align="center" valign="top">8.59</td>
</tr>
<tr>
<td align="left" valign="top"><italic>p</italic>-value</td>
<td align="center" valign="top">4.34E-03</td>
<td align="center" valign="top">9.82E-18</td>
</tr>
<tr>
<td align="left" valign="top" rowspan="2">Average</td>
<td align="left" valign="top"><italic>t</italic>-statistic</td>
<td align="center" valign="top">3.91</td>
<td align="center" valign="top">11.49</td>
</tr>
<tr>
<td align="left" valign="top"><italic>p</italic>-value</td>
<td align="center" valign="top">2.96E-03</td>
<td align="center" valign="top">3.27E-18</td>
</tr>
<tr>
<td align="left" valign="top" rowspan="8">DM tests</td>
<td align="left" valign="top" rowspan="2">Step 1</td>
<td align="left" valign="top">DM statistic</td>
<td align="center" valign="top" colspan="2">22.70</td>
</tr>
<tr>
<td align="left" valign="top"><italic>p</italic>-value</td>
<td align="center" valign="top" colspan="2">0.00</td>
</tr>
<tr>
<td align="left" valign="top" rowspan="2">Step 24</td>
<td align="left" valign="top">DM statistic</td>
<td align="center" valign="top" colspan="2">&#x2212;8.38</td>
</tr>
<tr>
<td align="left" valign="top"><italic>p</italic>-value</td>
<td align="center" valign="top" colspan="2">1.55E-15</td>
</tr>
<tr>
<td align="left" valign="top" rowspan="2">Step 123</td>
<td align="left" valign="top">DM statistic</td>
<td align="center" valign="top" colspan="2">2.53</td>
</tr>
<tr>
<td align="left" valign="top"><italic>p</italic>-value</td>
<td align="center" valign="top" colspan="2">0.01</td>
</tr>
<tr>
<td align="left" valign="top" rowspan="2">Average</td>
<td align="left" valign="top">DM statistic</td>
<td align="center" valign="top" colspan="2">5.62</td>
</tr>
<tr>
<td align="left" valign="top"><italic>p</italic>-value</td>
<td align="center" valign="top" colspan="2">3.87E-3</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="fig" rid="fig5">Figure 5</xref> shows the loss curves of the four deep neural network models on the validation set and the training time of DL comparison model training set.</p>
<fig position="float" id="fig5">
<label>Figure 5</label>
<caption>
<p>Mean square error of DL comparison model validation set and training time of DL comparison model training set.</p>
</caption>
<graphic xlink:href="fnbot-19-1527908-g005.tif"/>
</fig>
<p>Due to the introduction of early stopping, the number of iterations of each model during the training process is different. Interestingly, as the model complexity increases, the model training time gradually increases, which is opposite to the trend of model accuracy. DMLP and LSTM+MLP still show similar training time, and both models converge in about 50 epochs. TGG-LSTM converges in 84 epochs, while Trafficformer converges in 93 epochs. Figure on the right of <xref ref-type="fig" rid="fig5">Figure 5</xref> shows the training time of the four algorithms on the training set at the same step size, from which similar conclusions can be drawn. It can be seen that relatively simple network architectures such as DMLP and LSTM+MLP are significantly faster in training than larger networks such as TGG-LSTM and Trafficformer. This shows that improving model accuracy comes at the cost of increasing training time. Therefore, in practical applications, it is necessary to balance accuracy and complexity according to specific scenarios and requirements. For scenarios where traffic flow patterns are relatively stable and have high real time requirements, simple models may have advantages due to their fast-computing speed and relatively simple deployment methods. For scenarios where traffic conditions are complex and changeable and have strict requirements on prediction accuracy, complex models have high training and deployment costs but can provide more accurate predictions and help with traffic management decisions.</p>
</sec>
<sec id="sec21">
<label>4.3.2</label>
<title>Ablation study</title>
<p>The Trafficformer model is a DL framework composed of three modules: traffic node feature extraction, traffic node feature interaction, and traffic node speed forecasting. The experimental data for models 3&#x2013;6 in <xref ref-type="table" rid="tab2">Table 2</xref> have demonstrated the necessity of using separate feature extraction and prediction modules, underscoring the significant advantages of employing MLP as the feature extraction module in terms of accuracy and efficiency. With the other modules kept unchanged, this section focuses primarily on the analysis of the effectiveness of the node feature interaction module.</p>
<p><xref ref-type="fig" rid="fig6">Figure 6</xref> presents the performance of the models on the training, validation, and test sets when the number of layers within the module&#x2019;s internal encoder represented by <italic>L</italic> varies (where <italic>L</italic>&#x202F;=&#x202F;0 indicates the absence of the feature interaction module). It can be observed that as the number of encoder layers increases from 0, the performance of the model on the training set, validation set, and test set shows a trend of first rising and then stabilizing. This is because in the initial stage, increasing the number of encoder layers enables the model to gradually learn more complex spatiotemporal features and potential patterns in traffic flow data. The model achieves optimal performance when the number of encoder layers reaches 6. Therefore, this study sets the number of encoder layers in the interaction module to 6. In addition, it can be found that even without the spatial mask matrix based on road topology as <italic>a priori</italic> constraint, the performance of the model is still better after adding the interaction module. This is mainly due to the structural design inside the interaction module. The encoder in the interaction module can perform multi-level feature extraction and transformation on the input traffic node features, and enhance the model&#x2019;s ability to learn complex relationships between nodes through information transmission and fusion between different layers. In addition, in each layer of the encoder, through the multi-head attention mechanism, the model can simultaneously focus on the correlation of different nodes in different feature subspaces, thereby capturing the dynamic change pattern of traffic flow in time and space dimensions.</p>
<fig position="float" id="fig6">
<label>Figure 6</label>
<caption>
<p><bold>(A)</bold> Training loss curve, <bold>(B)</bold> validation set mean square error, <bold>(C)</bold> test set mean absolute error.</p>
</caption>
<graphic xlink:href="fnbot-19-1527908-g006.tif"/>
</fig>
<p>Furthermore, to better understand the effect of attention mechanism in the interaction module, this study plots the topological connectivity graph of the road network at using node indices as the x and y coordinates. As shown in <xref ref-type="fig" rid="fig7">Figure 7A</xref>, the yellow region represents the spatially connected target nodes. This connectivity does not imply the existence of roads for vehicle passage between the nodes but rather indicates the spatial range reachable by vehicles traveling at free flow speed. The spatial mask mentioned in the paper is also constructed based on this concept. <xref ref-type="fig" rid="fig7">Figure 7B</xref> displays the attention relationships between different nodes, where darker colors indicate stronger correlations between nodes. It can be observed that the learned attention of the Trafficformer model is within the range of the connectivity graph. Additionally, the darker regions in the graph mostly correspond to busy traffic segments as highway entrances or exits. Taking the location highlighted by the red box in <xref ref-type="fig" rid="fig7">Figure 7A</xref> as an example, it is a crossroad near the entrance of Mercer Island, located between I-90 and the city&#x2019;s main arterial roads. This segment is a significant feature in the Loop dataset, and the dark markings within the yellow box in <xref ref-type="fig" rid="fig7">Figure 7B</xref> confirm this observation.</p>
<fig position="float" id="fig7">
<label>Figure 7</label>
<caption>
<p><bold>(A)</bold> The actual road network topology connectivity graph, <bold>(B)</bold> model attention value.</p>
</caption>
<graphic xlink:href="fnbot-19-1527908-g007.tif"/>
</fig>
<p>Based on the aforementioned analysis, this study introduces constraints based on road topology in both single-layer and multilayer interaction modules to investigate the importance of spatial masks. As shown in <xref ref-type="table" rid="tab4">Table 4</xref>, for a network structure with only one interactive unit, after adding a spatial mask, the model&#x2019;s prediction accuracy of node speed increased by 6.27, 9.34, and 10.41% on the training set, verification set, and test set, respectively. For a network structure with six interactive units, after adding spatial masks, the prediction accuracy of the model on the training set, validation set and test set increased by 33.95, 17.28 and 18.37%, respectively. Obviously, with the addition of spatial mask prior, the performance of the interaction module is significantly improved. This is mainly attributed to the optimization of the spatial mask in the model mechanism. From the perspective of interaction mode, it limits the range of interactive nodes, allowing the model to focus on highly accessible traffic nodes when calculating attention scores and feature fusion, avoiding interference from irrelevant nodes and accurately capturing influencing factors. From the perspective of information transfer, by discarding a large number of irrelevant node information, the model reduces the spread of redundant information during the training process, thereby significantly reducing the amount of calculation and improving the operating efficiency of the model. Therefore, the addition of spatial mask can enable the model to efficiently learn the spatial dependence in the traffic network, which is of key value in Trafficformer.</p>
<table-wrap position="float" id="tab4">
<label>Table 4</label>
<caption>
<p>Comparison model evaluation indexes.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="top" colspan="2" rowspan="2">Datasets and evaluation metrics</th>
<th align="center" valign="top" colspan="2">Single-layer interaction control group 1</th>
<th align="center" valign="top" colspan="2">Multilayer interaction control group 2</th>
</tr>
<tr>
<th align="center" valign="top">No spatial mask</th>
<th align="center" valign="top">With spatial mask</th>
<th align="center" valign="top">No spatial mask</th>
<th align="center" valign="top">With spatial mask</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" valign="middle">Training set</td>
<td align="left" valign="middle">MSE (mph)<sup>2</sup></td>
<td align="center" valign="middle">5.44E-04</td>
<td align="center" valign="middle">5.10E-04</td>
<td align="center" valign="middle">5.76E-04</td>
<td align="center" valign="middle">3.63E-04</td>
</tr>
<tr>
<td align="left" valign="middle" rowspan="2">Validation set</td>
<td align="left" valign="middle">MAE (mph)</td>
<td align="center" valign="middle">2.34</td>
<td align="center" valign="middle">2.30</td>
<td align="center" valign="middle">2.24</td>
<td align="center" valign="middle">2.10</td>
</tr>
<tr>
<td align="left" valign="middle">MSE (mph)<sup>2</sup></td>
<td align="center" valign="middle">5.12E-04</td>
<td align="center" valign="middle">4.64E-04</td>
<td align="center" valign="middle">4.90E-04</td>
<td align="center" valign="middle">4.05E-04</td>
</tr>
<tr>
<td align="left" valign="middle" rowspan="3">Test set</td>
<td align="left" valign="middle">MAE (mph)</td>
<td align="center" valign="middle">2.34</td>
<td align="center" valign="middle">2.24</td>
<td align="center" valign="middle">2.30</td>
<td align="center" valign="middle">2.10</td>
</tr>
<tr>
<td align="left" valign="middle">MAPE (%)</td>
<td align="center" valign="middle">5.50</td>
<td align="center" valign="middle">5.10</td>
<td align="center" valign="middle">5.40</td>
<td align="center" valign="middle">4.70</td>
</tr>
<tr>
<td align="left" valign="middle">RMSE (mph)</td>
<td align="center" valign="middle">3.50</td>
<td align="center" valign="middle">3.31</td>
<td align="center" valign="middle">3.41</td>
<td align="center" valign="middle">3.08</td>
</tr>
</tbody>
</table>
</table-wrap>
<p><xref ref-type="fig" rid="fig8">Figure 8</xref> shows the comparison curves of the true value (blue curve) and the predicted value (grey curve) in the test set. It is apparent that, despite the traffic flow&#x2019;s operating conditions, the predicted curve closely follows the actual curve. This observation indicates that the Trafficformer model is capable of effectively extracting traffic flow features and achieving high-precision predictions for spatiotemporal fused traffic networks.</p>
<fig position="float" id="fig8">
<label>Figure 8</label>
<caption>
<p>Example of traffic speed forecasting.</p>
</caption>
<graphic xlink:href="fnbot-19-1527908-g008.tif"/>
</fig>
</sec>
</sec>
</sec>
<sec sec-type="conclusions" id="sec22">
<label>5</label>
<title>Conclusion</title>
<p>In this paper, a DL framework built upon the Transformer architecture is proposed to address short-term prediction challenges in spatiotemporal fused traffic networks. Specifically, the multilayer perceptron and multi-head attention mechanisms are employed to efficiently extract spatiotemporal features of traffic flow. Prior constraints based on traffic node connectivity are also incorporated to limit interactions to reachable nodes, reducing unnecessary noise and improving both algorithm stability and precision. Test results demonstrate that the Trafficformer framework possesses a robust network structure and outperforms other baseline methods in both accuracy and computational complexity, making it particularly suitable for large-scale traffic forecasting tasks. In addition, using the learned attention distribution, managers can identify key traffic nodes and adjust control strategies accordingly, such as extending the green time of major roads or adjusting the signal phase of surrounding intersections, thereby optimizing traffic flow, alleviating congestion, and improving traffic efficiency.</p>
<p>Nevertheless, it is important to acknowledge the limitations of this paper. The model in this paper is mainly trained and predicted based on conventional traffic data. However, traffic flow is affected by many special factors such as weather, traffic accidents, and road construction. The model is not adaptable and flexible enough to these special situations, and the prediction accuracy will be reduced when encountering abnormal situations. In future work, more metadata, including but not limited to weather data, event report data, etc., will be introduced, and these special factors will be incorporated into the model training process. This aims to enhance the model&#x2019;s ability to cope with various complex situations, thereby improving its prediction accuracy under abnormal conditions.</p>
</sec>
</body>
<back>
<sec sec-type="data-availability" id="sec23">
<title>Data availability statement</title>
<p>The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.</p>
</sec>
<sec sec-type="author-contributions" id="sec24">
<title>Author contributions</title>
<p>AC: Methodology, Validation, Writing &#x2013; original draft. YJ: Investigation, Visualization, Writing &#x2013; review &#x0026; editing. YB: Funding acquisition, Methodology, Supervision, Writing &#x2013; review &#x0026; editing.</p>
</sec>
<sec sec-type="funding-information" id="sec25">
<title>Funding</title>
<p>The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This research was funded by the Projects of National Natural Science Foundation of China [grant numbers 52220105001, 52131203, &#x0026; 72471102] and the Plan Project of the Science and Technology Department of Jilin Province [grant number 20230508048RC].</p>
</sec>
<sec sec-type="COI-statement" id="sec26">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="ai-statement" id="sec27">
<title>Generative AI statement</title>
<p>The authors declare that no Gen AI was used in the creation of this manuscript.</p>
</sec>
<sec sec-type="disclaimer" id="sec28">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="ref1"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Berghaus</surname> <given-names>M.</given-names></name> <name><surname>Lamberty</surname> <given-names>S.</given-names></name> <name><surname>Ehlers</surname> <given-names>J.</given-names></name> <name><surname>Kall&#x00F3;</surname> <given-names>E.</given-names></name> <name><surname>Oeser</surname> <given-names>M.</given-names></name></person-group> (<year>2024</year>). <article-title>Vehicle trajectory dataset from drone videos including off-ramp and congested traffic&#x2013;analysis of data quality, traffic flow, and accident risk</article-title>. <source>Commun. Transp. Res.</source> <volume>4</volume>:<fpage>100133</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.commtr.2024.100133</pub-id></citation></ref>
<ref id="ref2"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bie</surname> <given-names>Y.</given-names></name> <name><surname>Ji</surname> <given-names>Y.</given-names></name> <name><surname>Ma</surname> <given-names>D.</given-names></name></person-group> (<year>2024</year>). <article-title>Multi-agent deep reinforcement learning collaborative traffic signal control method considering intersection heterogeneity</article-title>. <source>Transp. Res. Part C Emerg. Technol.</source> <volume>164</volume>:<fpage>104663</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.trc.2024.104663</pub-id></citation></ref>
<ref id="ref3"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cai</surname> <given-names>L.</given-names></name> <name><surname>Janowicz</surname> <given-names>K.</given-names></name> <name><surname>Mai</surname> <given-names>G.</given-names></name> <name><surname>Yan</surname> <given-names>B.</given-names></name> <name><surname>Zhu</surname> <given-names>R.</given-names></name></person-group> (<year>2020</year>). <article-title>Traffic transformer: capturing the continuity and periodicity of time series for traffic forecasting</article-title>. <source>Trans. GIS</source> <volume>24</volume>, <fpage>736</fpage>&#x2013;<lpage>755</lpage>. doi: <pub-id pub-id-type="doi">10.1111/tgis.12644</pub-id></citation></ref>
<ref id="ref4"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>C.</given-names></name> <name><surname>Liu</surname> <given-names>Y.</given-names></name> <name><surname>Chen</surname> <given-names>L.</given-names></name> <name><surname>Zhang</surname> <given-names>C.</given-names></name></person-group> (<year>2022</year>). <article-title>Bidirectional spatial-temporal adaptive transformer for urban traffic flow forecasting</article-title>. <source>IEEE Trans. Neural Netw. Learn. Syst.</source> <volume>34</volume>, <fpage>6913</fpage>&#x2013;<lpage>6925</lpage>. doi: <pub-id pub-id-type="doi">10.1109/TNNLS.2022.3183903</pub-id>, PMID: <pub-id pub-id-type="pmid">35771780</pub-id></citation></ref>
<ref id="ref5"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>X.</given-names></name> <name><surname>Tang</surname> <given-names>H.</given-names></name> <name><surname>Wu</surname> <given-names>Y.</given-names></name> <name><surname>Shen</surname> <given-names>H.</given-names></name> <name><surname>Li</surname> <given-names>J.</given-names></name></person-group> (<year>2024</year>). <article-title>Adp STGCN: adaptive spatial&#x2013;temporal graph convolutional network for traffic forecasting</article-title>. <source>Knowl.-Based Syst.</source> <volume>301</volume>:<fpage>112295</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.knosys.2024.112295</pub-id>, PMID: <pub-id pub-id-type="pmid">39764307</pub-id></citation></ref>
<ref id="ref6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cui</surname> <given-names>Z.</given-names></name> <name><surname>Henrickson</surname> <given-names>K.</given-names></name> <name><surname>Ke</surname> <given-names>R.</given-names></name> <name><surname>Wang</surname> <given-names>Y.</given-names></name></person-group> (<year>2019</year>). <article-title>Traffic graph convolutional recurrent neural network: a deep learning framework for network-scale traffic learning and forecasting</article-title>. <source>IEEE Trans. Intell. Transp. Syst.</source> <volume>21</volume>, <fpage>4883</fpage>&#x2013;<lpage>4894</lpage>. doi: <pub-id pub-id-type="doi">10.1109/TITS.2019.2950416</pub-id>, PMID: <pub-id pub-id-type="pmid">39573497</pub-id></citation></ref>
<ref id="ref7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ding</surname> <given-names>C.</given-names></name> <name><surname>Zhu</surname> <given-names>L.</given-names></name> <name><surname>Shen</surname> <given-names>L.</given-names></name> <name><surname>Li</surname> <given-names>Z.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Liang</surname> <given-names>Q.</given-names></name></person-group> (<year>2024</year>). <article-title>The intelligent traffic flow control system based on 6G and optimized genetic algorithm</article-title>. <source>IEEE Trans. Intell. Transp. Syst.</source>, <fpage>1</fpage>&#x2013;<lpage>14</lpage>. doi: <pub-id pub-id-type="doi">10.1109/TITS.2024.3467269</pub-id></citation></ref>
<ref id="ref8"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Dong</surname> <given-names>H.</given-names></name> <name><surname>Meng</surname> <given-names>Z.</given-names></name> <name><surname>Wang</surname> <given-names>Y.</given-names></name> <name><surname>Jia</surname> <given-names>L.</given-names></name> <name><surname>Qin</surname> <given-names>Y.</given-names></name></person-group> (<year>2021</year>). <article-title>Multi-step spatial-temporal fusion network for traffic flow forecasting</article-title>. In <conf-name>Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (IEEE: ITSC)</conf-name>, <fpage>3412</fpage>&#x2013;<lpage>3419</lpage>.</citation></ref>
<ref id="ref9"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Eleonora</surname> <given-names>A.</given-names></name> <name><surname>Pinar</surname> <given-names>B.</given-names></name></person-group> (<year>2023</year>). <article-title>Potential impact of autonomous vehicles in mixed traffic from simulation using real traffic flow</article-title>. <source>J. Intell. Connect. Veh.</source> <volume>6</volume>, <fpage>1</fpage>&#x2013;<lpage>15</lpage>. doi: <pub-id pub-id-type="doi">10.26599/JICV.2023.9210001</pub-id>, PMID: <pub-id pub-id-type="pmid">38437756</pub-id></citation></ref>
<ref id="ref10"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fang</surname> <given-names>W.</given-names></name> <name><surname>Zhuo</surname> <given-names>W.</given-names></name> <name><surname>Song</surname> <given-names>Y.</given-names></name> <name><surname>Yan</surname> <given-names>J.</given-names></name> <name><surname>Zhou</surname> <given-names>T.</given-names></name> <name><surname>Qin</surname> <given-names>J.</given-names></name></person-group> (<year>2023</year>). <article-title>&#x0394;free-LSTM: an error distribution free deep learning for short-term traffic flow forecasting</article-title>. <source>Neurocomputing</source> <volume>526</volume>, <fpage>180</fpage>&#x2013;<lpage>190</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.neucom.2023.01.009</pub-id></citation></ref>
<ref id="ref11"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Gong</surname> <given-names>J.</given-names></name> <name><surname>Liu</surname> <given-names>Y.</given-names></name> <name><surname>Li</surname> <given-names>T.</given-names></name> <name><surname>Chai</surname> <given-names>H.</given-names></name> <name><surname>Wang</surname> <given-names>X.</given-names></name> <name><surname>Feng</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2023</year>). <article-title>Empowering spatial knowledge graph for Mobile traffic prediction</article-title>. In <conf-name>Proceedings of the 31st ACM International Conference on Advances in Geographic Information Systems</conf-name>, Association for Computing Machinery, <fpage>1</fpage>&#x2013;<lpage>11</lpage>.</citation></ref>
<ref id="ref12"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gonzales</surname> <given-names>S. M.</given-names></name> <name><surname>Iftikhar</surname> <given-names>H.</given-names></name> <name><surname>L&#x00F3;pez-Gonzales</surname> <given-names>J. L.</given-names></name></person-group> (<year>2024</year>). <article-title>Analysis and forecasting of electricity prices using an improved time series ensemble approach: an application to the Peruvian electricity market</article-title>. <source>Aims Math.</source> <volume>9</volume>, <fpage>21952</fpage>&#x2013;<lpage>21971</lpage>. doi: <pub-id pub-id-type="doi">10.3934/math.20241067</pub-id></citation></ref>
<ref id="ref13"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Guo</surname> <given-names>B.</given-names></name> <name><surname>Huang</surname> <given-names>Z.</given-names></name> <name><surname>Zheng</surname> <given-names>Z.</given-names></name> <name><surname>Zhang</surname> <given-names>F.</given-names></name> <name><surname>Wang</surname> <given-names>P.</given-names></name></person-group> (<year>2024</year>). <article-title>Understanding the predictability of path flow distribution in urban road networks using an information entropy approach</article-title>. <source>Multimodal Transp.</source> <volume>3</volume>:<fpage>100135</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.multra.2024.100135</pub-id></citation></ref>
<ref id="ref14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Guo</surname> <given-names>X.</given-names></name> <name><surname>Zhang</surname> <given-names>Q.</given-names></name> <name><surname>Jiang</surname> <given-names>J.</given-names></name> <name><surname>Peng</surname> <given-names>M.</given-names></name> <name><surname>Zhu</surname> <given-names>M.</given-names></name> <name><surname>Yang</surname> <given-names>H. F.</given-names></name></person-group> (<year>2024</year>). <article-title>Towards explainable traffic flow prediction with large language models</article-title>. <source>Commun. Transp. Res.</source> <volume>4</volume>:<fpage>100150</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.commtr.2024.100150</pub-id></citation></ref>
<ref id="ref15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hamed</surname> <given-names>M. M.</given-names></name> <name><surname>Al-Masaeid</surname> <given-names>H. R.</given-names></name> <name><surname>Said</surname> <given-names>Z. M. B.</given-names></name></person-group> (<year>1995</year>). <article-title>Short-term prediction of traffic volume in urban arterials</article-title>. <source>J. Transp. Eng.</source> <volume>121</volume>, <fpage>249</fpage>&#x2013;<lpage>254</lpage>. doi: <pub-id pub-id-type="doi">10.1061/(ASCE)0733-947X(1995)121:3(249)</pub-id></citation></ref>
<ref id="ref16"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Han</surname> <given-names>D. C.</given-names></name></person-group> (<year>2024</year>). <article-title>Prediction of traffic volume based on deep learning model for AADT correction</article-title>. <source>Appl. Sci.</source> <volume>14</volume>:<fpage>9436</fpage>. doi: <pub-id pub-id-type="doi">10.3390/app14209436</pub-id></citation></ref>
<ref id="ref17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>He</surname> <given-names>Y.</given-names></name> <name><surname>Li</surname> <given-names>L.</given-names></name> <name><surname>Zhu</surname> <given-names>X.</given-names></name> <name><surname>Tsui</surname> <given-names>K. L.</given-names></name></person-group> (<year>2022</year>). <article-title>Multi-graph convolutional-recurrent neural network (MGC-RNN) for short-term forecasting of transit passenger flow</article-title>. <source>IEEE Trans. Intell. Transp. Syst.</source> <volume>23</volume>, <fpage>18155</fpage>&#x2013;<lpage>18174</lpage>. doi: <pub-id pub-id-type="doi">10.1109/TITS.2022.3150600</pub-id></citation></ref>
<ref id="ref18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Iftikhar</surname> <given-names>H.</given-names></name> <name><surname>Gonzales</surname> <given-names>S. M.</given-names></name> <name><surname>Zywio&#x0142;ek</surname> <given-names>J.</given-names></name> <name><surname>L&#x00F3;pez-Gonzales</surname> <given-names>J. L.</given-names></name></person-group> (<year>2024</year>). <article-title>Electricity demand forecasting using a novel time series ensemble technique</article-title>. <source>IEEE Access</source> <volume>12</volume>, <fpage>88963</fpage>&#x2013;<lpage>88975</lpage>. doi: <pub-id pub-id-type="doi">10.1109/ACCESS.2024.3419551</pub-id></citation></ref>
<ref id="ref19"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Iftikhar</surname> <given-names>H.</given-names></name> <name><surname>Zafar</surname> <given-names>A.</given-names></name> <name><surname>Turpo-Chaparro</surname> <given-names>J. E.</given-names></name> <name><surname>Canas Rodrigues</surname> <given-names>P.</given-names></name> <name><surname>L&#x00F3;pez-Gonzales</surname> <given-names>J. L.</given-names></name></person-group> (<year>2023</year>). <article-title>Forecasting day-ahead Brent crude oil prices using hybrid combinations of time series models</article-title>. <source>Mathematics</source> <volume>11</volume>:<fpage>3548</fpage>. doi: <pub-id pub-id-type="doi">10.3390/math11163548</pub-id></citation></ref>
<ref id="ref20"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ji</surname> <given-names>J.</given-names></name> <name><surname>Bie</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>L.</given-names></name></person-group> (<year>2023</year>). <article-title>Optimal electric bus fleet scheduling for a route with charging facility sharing</article-title>. <source>Transp. Res. Part C Emerg. Technol.</source> <volume>147</volume>:<fpage>104010</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.trc.2022.104010</pub-id></citation></ref>
<ref id="ref21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ji</surname> <given-names>J.</given-names></name> <name><surname>Bie</surname> <given-names>Y.</given-names></name> <name><surname>Zeng</surname> <given-names>Z.</given-names></name> <name><surname>Wang</surname> <given-names>L.</given-names></name></person-group> (<year>2022</year>). <article-title>Trip energy consumption estimation for electric buses</article-title>. <source>Commun. Transp. Res.</source> <volume>2</volume>:<fpage>100069</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.commtr.2022.100069</pub-id></citation></ref>
<ref id="ref22"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kipf</surname> <given-names>T. N.</given-names></name> <name><surname>Welling</surname> <given-names>M.</given-names></name></person-group> (<year>2016</year>). <article-title>Semi-supervised classification with graph convolutional networks</article-title>. <source>arXiv</source> <volume>arXiv</volume>:<fpage>1609.02907</fpage>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.1609.02907</pub-id></citation></ref>
<ref id="ref23"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lei Ba</surname> <given-names>J.</given-names></name> <name><surname>Kiros</surname> <given-names>J. R.</given-names></name> <name><surname>Hinton</surname> <given-names>G. E.</given-names></name></person-group> (<year>2016</year>). <article-title>Layer normalization</article-title>. <source>arXiv</source> <volume>arXiv</volume>:<fpage>1607.06450</fpage>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.1607.06450</pub-id></citation></ref>
<ref id="ref24"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Bai</surname> <given-names>F.</given-names></name> <name><surname>Lyu</surname> <given-names>C.</given-names></name> <name><surname>Qu</surname> <given-names>X.</given-names></name> <name><surname>Liu</surname> <given-names>Y.</given-names></name></person-group> (<year>2025</year>). <article-title>A systematic review of generative adversarial networks for traffic state prediction: overview, taxonomy, and future prospects</article-title>. <source>Inf. Fusion</source> <volume>102915</volume>:<fpage>102915</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.inffus.2024.102915</pub-id>, PMID: <pub-id pub-id-type="pmid">39764307</pub-id></citation></ref>
<ref id="ref25"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>L.</given-names></name> <name><surname>Ma</surname> <given-names>H.</given-names></name> <name><surname>Jia</surname> <given-names>Z.</given-names></name></person-group> (<year>2021</year>). <article-title>Change detection from SAR images based on convolutional neural networks guided by saliency enhancement</article-title>. <source>Remote Sens.</source> <volume>13</volume>:<fpage>3697</fpage>. doi: <pub-id pub-id-type="doi">10.3390/rs13183697</pub-id></citation></ref>
<ref id="ref26"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>L.</given-names></name> <name><surname>Ma</surname> <given-names>H.</given-names></name> <name><surname>Jia</surname> <given-names>Z.</given-names></name></person-group> (<year>2022</year>). <article-title>Multiscale geometric analysis fusion-based unsupervised change detection in remote sensing images via FLICM model</article-title>. <source>Entropy</source> <volume>24</volume>:<fpage>291</fpage>. doi: <pub-id pub-id-type="doi">10.3390/e24020291</pub-id>, PMID: <pub-id pub-id-type="pmid">35205585</pub-id></citation></ref>
<ref id="ref27"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>L.</given-names></name> <name><surname>Ma</surname> <given-names>H.</given-names></name> <name><surname>Jia</surname> <given-names>Z.</given-names></name></person-group> (<year>2023</year>). <article-title>Gamma correction-based automatic unsupervised change detection in SAR images via FLICM model</article-title>. <source>J. Indian Soc. Remote Sens.</source> <volume>51</volume>, <fpage>1077</fpage>&#x2013;<lpage>1088</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s12524-023-01674-4</pub-id></citation></ref>
<ref id="ref28"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>L.</given-names></name> <name><surname>Ma</surname> <given-names>H.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Zhao</surname> <given-names>X.</given-names></name> <name><surname>Lv</surname> <given-names>M.</given-names></name> <name><surname>Jia</surname> <given-names>Z.</given-names></name></person-group> (<year>2024c</year>). <article-title>Synthetic aperture radar image change detection based on principal component analysis and two-level clustering</article-title>. <source>Remote Sens.</source> <volume>16</volume>:<fpage>1861</fpage>. doi: <pub-id pub-id-type="doi">10.3390/rs16111861</pub-id></citation></ref>
<ref id="ref29"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>L.</given-names></name> <name><surname>Shi</surname> <given-names>Y.</given-names></name> <name><surname>Lv</surname> <given-names>M.</given-names></name> <name><surname>Jia</surname> <given-names>Z.</given-names></name> <name><surname>Liu</surname> <given-names>M.</given-names></name> <name><surname>Zhao</surname> <given-names>X.</given-names></name> <etal/></person-group>. (<year>2024a</year>). <article-title>Infrared and visible image fusion via sparse representation and guided filtering in laplacian pyramid domain</article-title>. <source>Remote Sens.</source> <volume>16</volume>:<fpage>3804</fpage>. doi: <pub-id pub-id-type="doi">10.3390/rs16203804</pub-id></citation></ref>
<ref id="ref30"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Yu</surname> <given-names>R.</given-names></name> <name><surname>Shahabi</surname> <given-names>C.</given-names></name> <name><surname>Liu</surname> <given-names>Y.</given-names></name></person-group> (<year>2017</year>). <article-title>Diffusion convolutional recurrent neural network: Data-driven traffic forecasting</article-title>. <source>arXiv</source> <volume>arXiv</volume>:<fpage>1707.01926</fpage>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.1707.01926</pub-id></citation></ref>
<ref id="ref31"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>L.</given-names></name> <name><surname>Zhao</surname> <given-names>X.</given-names></name> <name><surname>Hou</surname> <given-names>H.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Lv</surname> <given-names>M.</given-names></name> <name><surname>Jia</surname> <given-names>Z.</given-names></name> <etal/></person-group>. (<year>2024b</year>). <article-title>Fractal dimension-based multi-focus image fusion via coupled neural P systems in NSCT domain</article-title>. <source>Fractal Fract.</source> <volume>8</volume>:<fpage>554</fpage>. doi: <pub-id pub-id-type="doi">10.3390/fractalfract8100554</pub-id></citation></ref>
<ref id="ref32"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>Y.</given-names></name> <name><surname>Zheng</surname> <given-names>H.</given-names></name> <name><surname>Feng</surname> <given-names>X.</given-names></name> <name><surname>Chen</surname> <given-names>Z.</given-names></name></person-group> (<year>2017</year>). <article-title>Short-term traffic flow prediction with conv-LSTM</article-title>. In <conf-name>Proceedings of the 9th International Conference on Wireless Communications and Signal Processing (IEEE: WCSP)</conf-name>, <fpage>1</fpage>&#x2013;<lpage>6</lpage>.</citation></ref>
<ref id="ref33"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Loshchilov</surname> <given-names>I.</given-names></name></person-group> (<year>2017</year>). <article-title>Decoupled weight decay regularization</article-title>. <source>arXiv</source> <volume>arXiv</volume>:<fpage>1711.05101</fpage>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.1711.05101</pub-id></citation></ref>
<ref id="ref34"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lu</surname> <given-names>J.</given-names></name> <name><surname>Osorio</surname> <given-names>C.</given-names></name></person-group> (<year>2018</year>). <article-title>A probabilistic traffic-theoretic network loading model suitable for large-scale network analysis</article-title>. <source>Transp. Sci.</source> <volume>52</volume>, <fpage>1509</fpage>&#x2013;<lpage>1530</lpage>. doi: <pub-id pub-id-type="doi">10.1287/trsc.2017.0804</pub-id>, PMID: <pub-id pub-id-type="pmid">19642375</pub-id></citation></ref>
<ref id="ref35"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Luo</surname> <given-names>H.</given-names></name> <name><surname>Bie</surname> <given-names>Y.</given-names></name> <name><surname>Jin</surname> <given-names>S.</given-names></name></person-group> (<year>2024</year>). <article-title>Reinforcement learning for traffic signal control in hybrid action space</article-title>. <source>IEEE Trans. Intell. Transp. Syst.</source> <volume>25</volume>, <fpage>5225</fpage>&#x2013;<lpage>5241</lpage>. doi: <pub-id pub-id-type="doi">10.1109/TITS.2023.3344585</pub-id></citation></ref>
<ref id="ref36"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ma</surname> <given-names>C.</given-names></name> <name><surname>Dai</surname> <given-names>G.</given-names></name> <name><surname>Zhou</surname> <given-names>J.</given-names></name></person-group> (<year>2021</year>). <article-title>Short-term traffic flow prediction for urban road sections based on time series analysis and LSTM_BILSTM method</article-title>. <source>IEEE Trans. Intell. Transp. Syst.</source> <volume>23</volume>, <fpage>5615</fpage>&#x2013;<lpage>5624</lpage>. doi: <pub-id pub-id-type="doi">10.1109/TITS.2021.3055258</pub-id></citation></ref>
<ref id="ref37"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ma</surname> <given-names>C.</given-names></name> <name><surname>Zhao</surname> <given-names>Y.</given-names></name> <name><surname>Dai</surname> <given-names>G.</given-names></name> <name><surname>Xu</surname> <given-names>X.</given-names></name> <name><surname>Wong</surname> <given-names>S. C.</given-names></name></person-group> (<year>2022</year>). <article-title>A novel STFSA-CNN-GRU hybrid model for short-term traffic speed prediction</article-title>. <source>IEEE Trans. Intell. Transp. Syst.</source> <volume>24</volume>, <fpage>3728</fpage>&#x2013;<lpage>3737</lpage>. doi: <pub-id pub-id-type="doi">10.1109/TITS.2021.3117835</pub-id>, PMID: <pub-id pub-id-type="pmid">39573497</pub-id></citation></ref>
<ref id="ref38"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mohammadian</surname> <given-names>S.</given-names></name> <name><surname>Zheng</surname> <given-names>Z.</given-names></name> <name><surname>Haque</surname> <given-names>M. M.</given-names></name> <name><surname>Bhaskar</surname> <given-names>A.</given-names></name></person-group> (<year>2023</year>). <article-title>Continuum modeling of freeway traffic flows: state-of-the-art, challenges and future directions in the era of connected and automated vehicles</article-title>. <source>Commun. Transp. Res.</source> <volume>3</volume>:<fpage>100107</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.commtr.2023.100107</pub-id></citation></ref>
<ref id="ref39"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Narmadha</surname> <given-names>S.</given-names></name> <name><surname>Vijayakumar</surname> <given-names>V.</given-names></name></person-group> (<year>2023</year>). <article-title>Spatiotemporal vehicle traffic flow prediction using multivariate CNN and LSTM model</article-title>. <source>Mater. Today</source> <volume>81</volume>, <fpage>826</fpage>&#x2013;<lpage>833</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.matpr.2021.04.249</pub-id>, PMID: <pub-id pub-id-type="pmid">39759867</pub-id></citation></ref>
<ref id="ref40"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nigam</surname> <given-names>A.</given-names></name> <name><surname>Srivastava</surname> <given-names>S.</given-names></name></person-group> (<year>2023</year>). <article-title>Hybrid deep learning models for traffic stream variables prediction during rainfall. Multimodal</article-title>. <source>Transportation</source> <volume>2</volume>:<fpage>100052</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.multra.2022.100052</pub-id>, PMID: <pub-id pub-id-type="pmid">39759867</pub-id></citation></ref>
<ref id="ref41"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Okutani</surname> <given-names>I.</given-names></name> <name><surname>Stephanedes</surname> <given-names>Y. J.</given-names></name></person-group> (<year>1984</year>). <article-title>Dynamic prediction of traffic volume through Kalman filtering theory</article-title>. <source>Transp. Res. B Methodol.</source> <volume>18</volume>, <fpage>1</fpage>&#x2013;<lpage>11</lpage>. doi: <pub-id pub-id-type="doi">10.1016/0191-2615(84)90002-X</pub-id></citation></ref>
<ref id="ref42"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Oliveira</surname> <given-names>D. D.</given-names></name> <name><surname>Rampinelli</surname> <given-names>M.</given-names></name> <name><surname>Tozatto</surname> <given-names>G. Z.</given-names></name> <name><surname>Andre&#x00E3;o</surname> <given-names>R. V.</given-names></name> <name><surname>M&#x00FC;ller</surname> <given-names>S. M.</given-names></name></person-group> (<year>2021</year>). <article-title>Forecasting vehicular traffic flow using MLP and LSTM</article-title>. <source>Neural Comput. &#x0026; Applic.</source> <volume>33</volume>, <fpage>17245</fpage>&#x2013;<lpage>17256</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s00521-021-06315-w</pub-id></citation></ref>
<ref id="ref43"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Omar</surname> <given-names>M.</given-names></name> <name><surname>Yakub</surname> <given-names>F.</given-names></name> <name><surname>Abdullah</surname> <given-names>S. S.</given-names></name> <name><surname>Abd Rahim</surname> <given-names>M. S.</given-names></name> <name><surname>Zuhairi</surname> <given-names>A. H.</given-names></name> <name><surname>Govindan</surname> <given-names>N.</given-names></name></person-group> (<year>2024</year>). <article-title>One-step vs horizon-step training strategies for multi-step traffic flow forecasting with direct particle swarm optimization grid search support vector regression and long short-term memory</article-title>. <source>Expert Syst. Appl.</source> <volume>252</volume>:<fpage>124154</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.eswa.2024.124154</pub-id></citation></ref>
<ref id="ref44"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Parishwad</surname> <given-names>O.</given-names></name> <name><surname>Jiang</surname> <given-names>S.</given-names></name> <name><surname>Gao</surname> <given-names>K.</given-names></name></person-group> (<year>2023</year>). <article-title>Investigating machine learning for simulating urban transport patterns: a comparison with traditional macro-models</article-title>. <source>Multimodal Transp.</source> <volume>2</volume>:<fpage>100085</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.multra.2023.100085</pub-id></citation></ref>
<ref id="ref45"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pascanu</surname> <given-names>R.</given-names></name></person-group> (<year>2013</year>). <article-title>On the difficulty of training recurrent neural networks</article-title>. <source>arXiv</source> <volume>arXiv</volume>:<fpage>1211.5063</fpage>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.1211.5063</pub-id></citation></ref>
<ref id="ref46"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rong</surname> <given-names>Y.</given-names></name> <name><surname>Xu</surname> <given-names>Z.</given-names></name> <name><surname>Liu</surname> <given-names>J.</given-names></name> <name><surname>Liu</surname> <given-names>H.</given-names></name> <name><surname>Ding</surname> <given-names>J.</given-names></name> <name><surname>Liu</surname> <given-names>X.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>Du-bus: a realtime bus waiting time estimation system based on multi-source data</article-title>. <source>IEEE Trans. Intell. Transp. Syst.</source> <volume>23</volume>, <fpage>24524</fpage>&#x2013;<lpage>24539</lpage>. doi: <pub-id pub-id-type="doi">10.1109/TITS.2022.3210170</pub-id></citation></ref>
<ref id="ref47"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ruder</surname> <given-names>S.</given-names></name></person-group> (<year>2016</year>). <article-title>An overview of gradient descent optimization algorithms</article-title>. <source>arXiv</source>:<fpage>1609.04747</fpage>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.1609.04747</pub-id></citation></ref>
<ref id="ref48"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Scarselli</surname> <given-names>F.</given-names></name> <name><surname>Gori</surname> <given-names>M.</given-names></name> <name><surname>Tsoi</surname> <given-names>A. C.</given-names></name> <name><surname>Hagenbuchner</surname> <given-names>M.</given-names></name> <name><surname>Monfardini</surname> <given-names>G.</given-names></name></person-group> (<year>2008</year>). <article-title>The graph neural network model</article-title>. <source>IEEE Trans. Neural Netw.</source> <volume>20</volume>, <fpage>61</fpage>&#x2013;<lpage>80</lpage>. doi: <pub-id pub-id-type="doi">10.1109/TNN.2008.2005605</pub-id>, PMID: <pub-id pub-id-type="pmid">19068426</pub-id></citation></ref>
<ref id="ref49"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schmidhuber</surname> <given-names>J.</given-names></name> <name><surname>Hochreiter</surname> <given-names>S.</given-names></name></person-group> (<year>1997</year>). <article-title>Long short-term memory</article-title>. <source>Neural Comput.</source> <volume>9</volume>, <fpage>1735</fpage>&#x2013;<lpage>1780</lpage>. doi: <pub-id pub-id-type="doi">10.1162/neco.1997.9.8.1735</pub-id>, PMID: <pub-id pub-id-type="pmid">9377276</pub-id></citation></ref>
<ref id="ref50"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shi</surname> <given-names>Z.</given-names></name> <name><surname>Zhang</surname> <given-names>N.</given-names></name> <name><surname>Schonfeld</surname> <given-names>P. M.</given-names></name> <name><surname>Zhang</surname> <given-names>J.</given-names></name></person-group> (<year>2020</year>). <article-title>Short-term metro passenger flow forecasting using ensemble-chaos support vector regression. Transportmetrica a: transport</article-title>. <source>Science</source> <volume>16</volume>, <fpage>194</fpage>&#x2013;<lpage>212</lpage>. doi: <pub-id pub-id-type="doi">10.1080/23249935.2019.1692956</pub-id>, PMID: <pub-id pub-id-type="pmid">39743787</pub-id></citation></ref>
<ref id="ref51"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Smola</surname> <given-names>A. J.</given-names></name> <name><surname>Sch&#x00F6;lkopf</surname> <given-names>B.</given-names></name></person-group> (<year>2004</year>). <article-title>A tutorial on support vector regression</article-title>. <source>Stat. Comput.</source> <volume>14</volume>, <fpage>199</fpage>&#x2013;<lpage>222</lpage>. doi: <pub-id pub-id-type="doi">10.1023/B:STCO.0000035301.49549.88</pub-id></citation></ref>
<ref id="ref52"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sun</surname> <given-names>G.</given-names></name> <name><surname>Song</surname> <given-names>L.</given-names></name> <name><surname>Yu</surname> <given-names>H.</given-names></name> <name><surname>Chang</surname> <given-names>V.</given-names></name> <name><surname>Du</surname> <given-names>X.</given-names></name> <name><surname>Guizani</surname> <given-names>M.</given-names></name></person-group> (<year>2018a</year>). <article-title>V2V routing in a VANET based on the autoregressive integrated moving average model</article-title>. <source>IEEE Trans. Veh. Technol.</source> <volume>68</volume>, <fpage>908</fpage>&#x2013;<lpage>922</lpage>. doi: <pub-id pub-id-type="doi">10.1109/TVT.2018.2884525</pub-id>, PMID: <pub-id pub-id-type="pmid">39573497</pub-id></citation></ref>
<ref id="ref53"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sun</surname> <given-names>G.</given-names></name> <name><surname>Zhang</surname> <given-names>Y.</given-names></name> <name><surname>Liao</surname> <given-names>D.</given-names></name> <name><surname>Yu</surname> <given-names>H.</given-names></name> <name><surname>Du</surname> <given-names>X.</given-names></name> <name><surname>Guizani</surname> <given-names>M.</given-names></name></person-group> (<year>2018b</year>). <article-title>Bus-trajectory-based street-centric routing for message delivery in urban vehicular ad hoc networks</article-title>. <source>IEEE Trans. Veh. Technol.</source> <volume>67</volume>, <fpage>7550</fpage>&#x2013;<lpage>7563</lpage>. doi: <pub-id pub-id-type="doi">10.1109/TVT.2018.2828651</pub-id></citation></ref>
<ref id="ref54"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Tian</surname> <given-names>Y.</given-names></name> <name><surname>Pan</surname> <given-names>L.</given-names></name></person-group> (<year>2015</year>). <article-title>Predicting short-term traffic flow by long short-term memory recurrent neural network</article-title>. In <conf-name>Proceedings of the 2015 IEEE International Conference on Smart City (IEEE: Smart City)</conf-name>, <fpage>153</fpage>&#x2013;<lpage>158</lpage>.</citation></ref>
<ref id="ref55"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Vaswani</surname> <given-names>A.</given-names></name> <name><surname>Shazeer</surname> <given-names>N.</given-names></name> <name><surname>Parmar</surname> <given-names>N.</given-names></name> <name><surname>Uszkoreit</surname> <given-names>J.</given-names></name> <name><surname>Jones</surname> <given-names>L.</given-names></name> <name><surname>Gomez</surname> <given-names>A.N.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Attention is all you need</article-title>. In <conf-name>Proceedings of the Advances in Neural Information Processing Systems</conf-name>, Neural Information Processing Systems, <fpage>4</fpage>&#x2013;<lpage>9</lpage>.</citation></ref>
<ref id="ref56"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>Q.</given-names></name> <name><surname>Chen</surname> <given-names>J.</given-names></name> <name><surname>Song</surname> <given-names>Y.</given-names></name> <name><surname>Li</surname> <given-names>X.</given-names></name> <name><surname>Xu</surname> <given-names>W.</given-names></name></person-group> (<year>2024</year>). <article-title>Fusing visual quantified features for heterogeneous traffic flow prediction</article-title>. <source>Promet-Traffic Transp.</source> <volume>36</volume>, <fpage>1068</fpage>&#x2013;<lpage>1077</lpage>. doi: <pub-id pub-id-type="doi">10.7307/ptt.v36i6.667</pub-id></citation></ref>
<ref id="ref57"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>Z.</given-names></name> <name><surname>Ruan</surname> <given-names>S.</given-names></name> <name><surname>Huang</surname> <given-names>T.</given-names></name> <name><surname>Zhou</surname> <given-names>H.</given-names></name> <name><surname>Zhang</surname> <given-names>S.</given-names></name> <name><surname>Wang</surname> <given-names>Y.</given-names></name> <etal/></person-group>. (<year>2024</year>). <article-title>A lightweight multi-layer perceptron for efficient multivariate time series forecasting</article-title>. <source>Knowl.-Based Syst.</source> <volume>288</volume>:<fpage>111463</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.knosys.2024.111463</pub-id></citation></ref>
<ref id="ref58"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Shi</surname> <given-names>Q.</given-names></name></person-group> (<year>2013</year>). <article-title>Short-term traffic speed forecasting hybrid model based on chaos&#x2013;wavelet analysis-support vector machine theory</article-title>. <source>Transp. Res. Part C Emerg. Technol.</source> <volume>27</volume>, <fpage>219</fpage>&#x2013;<lpage>232</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.trc.2012.08.004</pub-id></citation></ref>
<ref id="ref59"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Susanto</surname> <given-names>C.</given-names></name></person-group> (<year>2023</year>). <article-title>Traffic flow prediction with Heterogenous data using a hybrid CNN-LSTM model</article-title>. <source>Comput. Mater. Contin.</source> <volume>76</volume>, <fpage>3097</fpage>&#x2013;<lpage>3112</lpage>. doi: <pub-id pub-id-type="doi">10.32604/cmc.2023.040914</pub-id></citation></ref>
<ref id="ref60"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Wang</surname> <given-names>H.</given-names></name> <name><surname>Wang</surname> <given-names>C.</given-names></name></person-group> (<year>2023</year>). <article-title>Optimal charging pile configuration and charging scheduling for electric bus routes considering the impact of ambient temperature on charging power</article-title>. <source>Sustain. For.</source> <volume>15</volume>:<fpage>7375</fpage>. doi: <pub-id pub-id-type="doi">10.3390/su15097375</pub-id></citation></ref>
<ref id="ref61"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>F.</given-names></name> <name><surname>Xin</surname> <given-names>X.</given-names></name> <name><surname>Lei</surname> <given-names>Z.</given-names></name> <name><surname>Zhang</surname> <given-names>Q.</given-names></name> <name><surname>Yao</surname> <given-names>H.</given-names></name> <name><surname>Wang</surname> <given-names>X.</given-names></name> <etal/></person-group>. (<year>2024</year>). <article-title>Transformer-based Spatio-temporal traffic prediction for Access and metro networks</article-title>. <source>J. Lightwave Technol.</source> <volume>42</volume>, <fpage>5204</fpage>&#x2013;<lpage>5213</lpage>. doi: <pub-id pub-id-type="doi">10.1109/JLT.2024.3393709</pub-id></citation></ref>
<ref id="ref62"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Zhao</surname> <given-names>C.</given-names></name> <name><surname>Liu</surname> <given-names>Z.</given-names></name></person-group> (<year>2024</year>). <article-title>Can historical accident data improve sustainable urban traffic safety? A predictive modeling study</article-title>. <source>Sustainability</source> <volume>16</volume>:<fpage>9642</fpage>. doi: <pub-id pub-id-type="doi">10.3390/su16229642</pub-id></citation></ref>
<ref id="ref63"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yan</surname> <given-names>H.</given-names></name> <name><surname>Ma</surname> <given-names>X.</given-names></name> <name><surname>Pu</surname> <given-names>Z.</given-names></name></person-group> (<year>2021</year>). <article-title>Learning dynamic and hierarchical traffic spatiotemporal features with transformer</article-title>. <source>IEEE Trans. Intell. Transp. Syst.</source> <volume>23</volume>, <fpage>22386</fpage>&#x2013;<lpage>22399</lpage>. doi: <pub-id pub-id-type="doi">10.1109/TITS.2021.3102983</pub-id>, PMID: <pub-id pub-id-type="pmid">39573497</pub-id></citation></ref>
<ref id="ref64"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>Y.</given-names></name> <name><surname>Lin</surname> <given-names>J.</given-names></name> <name><surname>Zheng</surname> <given-names>Y.</given-names></name></person-group> (<year>2022</year>). <article-title>Short-time traffic forecasting in tourist service areas based on a CNN and GRU neural network</article-title>. <source>Appl. Sci.</source> <volume>12</volume>:<fpage>9114</fpage>. doi: <pub-id pub-id-type="doi">10.3390/app12189114</pub-id></citation></ref>
<ref id="ref65"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yu</surname> <given-names>J.</given-names></name></person-group> (<year>2021</year>). <article-title>A new way of airline traffic forecasting based on GCN-LSTM</article-title>. <source>Front. Neurorobot.</source> <volume>15</volume>:<fpage>661037</fpage>. doi: <pub-id pub-id-type="doi">10.3389/fnbot.2021.661037</pub-id>, PMID: <pub-id pub-id-type="pmid">34955800</pub-id></citation></ref>
<ref id="ref66"><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Yu</surname> <given-names>R.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Shahabi</surname> <given-names>C.</given-names></name> <name><surname>Demiryurek</surname> <given-names>U.</given-names></name> <name><surname>Liu</surname> <given-names>Y.</given-names></name></person-group> (<year>2017</year>). <article-title>Deep learning: a generic approach for extreme condition traffic forecasting</article-title>. In <conf-name>Proceedings of the 2017 SIAM International Conference on data Mining</conf-name>, Society for Industrial and Applied Mathematics. <fpage>777</fpage>&#x2013;<lpage>785</lpage>.</citation></ref>
<ref id="ref67"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>H.</given-names></name> <name><surname>Wang</surname> <given-names>X.</given-names></name> <name><surname>Cao</surname> <given-names>J.</given-names></name> <name><surname>Tang</surname> <given-names>M.</given-names></name> <name><surname>Guo</surname> <given-names>Y.</given-names></name></person-group> (<year>2018</year>). <article-title>A multivariate short-term traffic flow forecasting method based on wavelet analysis and seasonal time series</article-title>. <source>Appl. Intell.</source> <volume>48</volume>, <fpage>3827</fpage>&#x2013;<lpage>3838</lpage>. doi: <pub-id pub-id-type="doi">10.1007/s10489-018-1181-7</pub-id></citation></ref>
<ref id="ref68"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>H.</given-names></name> <name><surname>Yang</surname> <given-names>G.</given-names></name> <name><surname>Yu</surname> <given-names>H.</given-names></name> <name><surname>Zheng</surname> <given-names>Z.</given-names></name></person-group> (<year>2023</year>). <article-title>Kalman filter-based CNN-BiLSTM-ATT model for traffic flow prediction</article-title>. <source>Comput. Mater. Contin.</source> <volume>76</volume>, <fpage>1047</fpage>&#x2013;<lpage>1063</lpage>. doi: <pub-id pub-id-type="doi">10.32604/cmc.2023.039274</pub-id></citation></ref>
<ref id="ref69"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>Z.</given-names></name> <name><surname>Chen</surname> <given-names>W.</given-names></name> <name><surname>Wu</surname> <given-names>X.</given-names></name> <name><surname>Chen</surname> <given-names>P. C.</given-names></name> <name><surname>Liu</surname> <given-names>J.</given-names></name></person-group> (<year>2017</year>). <article-title>LSTM network: a deep learning approach for short-term traffic forecast</article-title>. <source>IET Intell. Transp. Syst.</source> <volume>11</volume>, <fpage>68</fpage>&#x2013;<lpage>75</lpage>. doi: <pub-id pub-id-type="doi">10.1049/iet-its.2016.0208</pub-id></citation></ref>
<ref id="ref70"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhu</surname> <given-names>W.</given-names></name> <name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>Fu</surname> <given-names>T.</given-names></name> <name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Zhang</surname> <given-names>J.</given-names></name> <name><surname>Shangguan</surname> <given-names>Q.</given-names></name></person-group> (<year>2021</year>). <article-title>Dynamic prediction of traffic incident duration on urban expressways: a deep learning approach based on LSTM and MLP</article-title>. <source>J. Intell. Connect. Veh.</source> <volume>4</volume>, <fpage>80</fpage>&#x2013;<lpage>91</lpage>. doi: <pub-id pub-id-type="doi">10.1108/JICV-03-2021-0004</pub-id></citation></ref>
<ref id="ref71"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zoican</surname> <given-names>S.</given-names></name> <name><surname>Zoican</surname> <given-names>R.</given-names></name> <name><surname>Galatchi</surname> <given-names>D.</given-names></name> <name><surname>Vochin</surname> <given-names>M.</given-names></name></person-group> (<year>2024</year>). <article-title>Graph-based neural networks&#x2019; framework using microcontrollers for energy-efficient traffic forecasting</article-title>. <source>Appl. Sci.</source> <volume>14</volume>:<fpage>412</fpage>. doi: <pub-id pub-id-type="doi">10.3390/app14010412</pub-id></citation></ref>
</ref-list>
</back>
</article>