<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neurorobot.</journal-id>
<journal-title>Frontiers in Neurorobotics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neurorobot.</abbrev-journal-title>
<issn pub-type="epub">1662-5218</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnbot.2023.1269105</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Res-FLNet: human-robot interaction and collaboration for multi-modal sensing robot autonomous driving tasks based on learning control algorithm</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Wang</surname> <given-names>Shulei</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2392569/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/conceptualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/data-curation/"/>
<role content-type="https://credit.niso.org/contributor-roles/formal-analysis/"/>
<role content-type="https://credit.niso.org/contributor-roles/funding-acquisition/"/>
<role content-type="https://credit.niso.org/contributor-roles/investigation/"/>
<role content-type="https://credit.niso.org/contributor-roles/methodology/"/>
<role content-type="https://credit.niso.org/contributor-roles/project-administration/"/>
<role content-type="https://credit.niso.org/contributor-roles/resources/"/>
<role content-type="https://credit.niso.org/contributor-roles/software/"/>
<role content-type="https://credit.niso.org/contributor-roles/supervision/"/>
<role content-type="https://credit.niso.org/contributor-roles/validation/"/>
<role content-type="https://credit.niso.org/contributor-roles/visualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
</contrib-group>
<aff><institution>School of Automotive Engineering, Changzhou Institute of Technology, Changzhou</institution>, <addr-line>Jiangsu</addr-line>, <country>China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Jing Luo, Wuhan University of Technology, China</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Jiehao Li, South China Agricultural University, China; Xinxing Chen, Southern University of Science and Technology, China; Jie Li, Chongqing Technology and Business University, China</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Shulei Wang <email>wangshulei&#x00040;czust.edu.cn</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>02</day>
<month>10</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>17</volume>
<elocation-id>1269105</elocation-id>
<history>
<date date-type="received">
<day>29</day>
<month>07</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>04</day>
<month>09</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2023 Wang.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Wang</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license></permissions>
<abstract>
<sec>
<title>Introduction</title>
<p>Res-FLNet presents a cutting-edge solution for addressing autonomous driving tasks in the context of multimodal sensing robots while ensuring privacy protection through Federated Learning (FL). The rapid advancement of autonomous vehicles and robotics has escalated the need for efficient and safe navigation algorithms that also support Human-Robot Interaction and Collaboration. However, the integration of data from diverse sensors like cameras, LiDARs, and radars raises concerns about privacy and data security.</p></sec>
<sec>
<title>Methods</title>
<p>In this paper, we introduce Res-FLNet, which harnesses the power of ResNet-50 and LSTM models to achieve robust and privacy-preserving autonomous driving. The ResNet-50 model effectively extracts features from visual input, while LSTM captures sequential dependencies in the multimodal data, enabling more sophisticated learning control algorithms. To tackle privacy issues, we employ Federated Learning, enabling model training to be conducted locally on individual robots without sharing raw data. By aggregating model updates from different robots, the central server learns from collective knowledge while preserving data privacy. Res-FLNet can also facilitate Human-Robot Interaction and Collaboration as it allows robots to share knowledge while preserving privacy.</p></sec>
<sec>
<title>Results and discussion</title>
<p>Our experiments demonstrate the efficacy and privacy preservation of Res-FLNet across four widely-used autonomous driving datasets: KITTI, Waymo Open Dataset, ApolloScape, and BDD100K. Res-FLNet outperforms state-of-the-art methods in terms of accuracy, robustness, and privacy preservation. Moreover, it exhibits promising adaptability and generalization across various autonomous driving scenarios, showcasing its potential for multi-modal sensing robots in complex and dynamic environments.</p></sec></abstract>
<kwd-group>
<kwd>human-robot interaction and collaboration</kwd>
<kwd>multi-modal sensing robot</kwd>
<kwd>learning control algorithm</kwd>
<kwd>data-driven robotics</kwd>
<kwd>autonomous vehicles</kwd>
</kwd-group>
<counts>
<fig-count count="9"/>
<table-count count="5"/>
<equation-count count="5"/>
<ref-count count="43"/>
<page-count count="16"/>
<word-count count="8613"/>
</counts>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value></meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>With the rapid advancement of artificial intelligence and robotics, autonomous systems have witnessed remarkable progress, especially in the domain of autonomous driving. Autonomous vehicles equipped with a variety of sensors, such as cameras, lidar, radar, and GPS, have the potential to revolutionize transportation, making it safer, more efficient, and environmentally friendly. However, achieving full autonomy in complex real-world scenarios remains a challenge due to the need for robust perception, decision-making, and control in dynamic and unpredictable environments. The significance of autonomous driving technology lies in its potential to reduce human errors and accidents, improve traffic flow, and provide mobility solutions for individuals with limited mobility. It also has the potential to significantly impact various industries, including transportation, logistics, and urban planning. To realize the vision of safe and efficient autonomous driving, researchers and engineers have explored various machine learning and robotics models. Five noteworthy models in this domain are:</p>
<p>Convolutional neural networks (CNNs): CNNs have garnered significant attention for their exceptional performance in image recognition tasks. Their ability to automatically learn hierarchical features from raw pixel data makes them highly suitable for processing visual information captured by cameras in autonomous vehicles (He and Ye, <xref ref-type="bibr" rid="B9">2022</xref>). CNNs excel in tasks like object detection, lane detection, and scene understanding, providing crucial inputs for safe navigation.</p>
<p>Long short-term memory (LSTM) networks: LSTM is a type of recurrent neural network known for its capability to handle sequential data with temporal dependencies. In the context of autonomous driving, sensors like lidar and radar provide data streams with temporal characteristics, making LSTM an ideal choice for processing such information. These networks effectively capture the dynamics of moving objects and help predict future trajectories, enabling safer decision-making in complex driving scenarios.</p>
<p>Deep reinforcement learning (DRL): DRL algorithms have gained popularity due to their ability to learn decision-making policies through interactions with the environment. In the context of autonomous driving, DRL empowers vehicles to navigate challenging road conditions by learning from experience. By combining perception data with an agent&#x00027;s actions, DRL enables real-time control and continuous improvement, making it promising for handling uncertain and dynamic environments.</p>
<p>Probabilistic models: Probabilistic models, including Bayesian networks and Gaussian processes, have found applications in autonomous driving systems for uncertainty estimation and risk assessment. In safety-critical situations, it is crucial to account for uncertainty in sensor measurements and predictions. Probabilistic models offer a principled way to quantify uncertainty, aiding autonomous vehicles in making safe decisions and avoiding potential hazards.</p>
<p>Transformer networks: Transformers have revolutionized natural language processing and recently extended their success to computer vision tasks. With a self-attention mechanism, transformers can effectively fuse information and understand context across different modalities. In autonomous driving systems, this feature enables seamless integration of multimodal data from various sensors like cameras, lidars, and radars (Ning et al., <xref ref-type="bibr" rid="B27">2023</xref>). Transformers enhance the ability to perceive the environment accurately, leading to improved decision-making and overall performance.</p>
<p>In this paper, we propose a novel approach for autonomous driving tasks, named Res-FLNet, which leverages a combination of ResNet-50 and LSTM models. The ResNet-50 component efficiently processes visual data from cameras, extracting high-level features for object recognition. Meanwhile, the LSTM component handles sequential data like lidar and radar inputs, capturing temporal dependencies for accurate prediction. Our method&#x00027;s key innovation lies in adopting Federated Learning (FL) to preserve privacy while enabling collaborative model training across multiple stakeholders. FL allows participants to train models locally on their datasets without sharing raw data, addressing privacy concerns and fostering cooperation in the development of autonomous driving systems.</p>
<p>The three main contributions of this paper are as follows:</p>
<list list-type="order">
<list-item><p>Res-FLNet: This paper proposes a novel multimodal robot system, called Res-FLNet, which addresses the challenges of autonomous driving tasks. Res-FLNet combines the power of two state-of-the-art models, ResNet-50 and LSTM, and integrates them using Federated Learning (FL) techniques. By doing so, our approach harnesses the strengths of each model to create a unified and efficient system capable of handling multimodal data and complex driving scenarios. The integration of ResNet-50 and LSTM ensures robust perception and decision-making capabilities, essential for autonomous vehicles to navigate safely and effectively.</p></list-item>
<list-item><p>Privacy protection: A key concern in developing autonomous driving systems is the privacy of sensitive data. To tackle this issue, Res-FLNet incorporates privacy-preserving mechanisms through Federated Learning. By employing FL, Res-FLNet allows model training to occur locally on individual data sources (e.g., vehicles or edge devices) without sharing raw data centrally. This decentralized approach ensures that sensitive information remains secure and private, thereby fostering collaboration among various parties without compromising data privacy. As a result, Res-FLNet promotes trust and cooperation among stakeholders, a critical aspect in the deployment of autonomous driving technologies.</p></list-item>
<list-item><p>Comprehensive evaluation: The efficacy of Res-FLNet is extensively evaluated on multiple benchmark datasets, including KITTI, Waymo Open Dataset, ApolloScape, and BDD100K. Through rigorous evaluation in diverse real-world driving scenarios, Res-FLNet demonstrates its capability to handle various challenges faced by autonomous vehicles. The evaluation encompasses tasks such as object detection, lane detection, scene understanding, and trajectory prediction, showcasing the versatility and effectiveness of the proposed system. The experimental results validate that Res-FLNet achieves superior performance compared to individual models, thus affirming its practical value and potential for real-world deployment.</p></list-item>
</list>
<p>Res-FLNet utilizes the ResNet-50 model to effectively extract features from visual inputs, enabling the system to accurately perceive its environment. Furthermore, the integration of LSTM networks enables Res-FLNet to capture temporal dependencies in sequential multimodal data. A comprehensive understanding of dynamic driving scenarios contributes to making informed decisions and enhances the robot&#x00027;s navigational capabilities in complex environments. To address privacy concerns associated with data sharing, Res-FLNet adopts Federated Learning (FL) technology. FL allows model training to occur locally on individual robots without the need to share raw data. Model updates are then aggregated on a central server, which learns from collective knowledge while preserving the privacy of sensitive data. The proposed Res-FLNet architecture not only ensures privacy protection but also facilitates human-robot interaction and collaboration. Robots can share knowledge with each other without compromising sensitive data, enabling collaborative learning and improving overall performance. To evaluate the efficacy and privacy-preserving capabilities of Res-FLNet, we conducted extensive experiments on widely used autonomous driving datasets, including KITTI, Waymo Open Dataset, ApolloScape, and BDD100K. The results demonstrate that Res-FLNet outperforms state-of-the-art methods in terms of accuracy, robustness, and privacy protection. Additionally, the system exhibits excellent adaptability and generalization across various autonomous driving scenarios, highlighting its potential in real-world applications.</p>
<p>The subsequent sections of this paper present a detailed description of the Res-FLNet architecture, the FL-based training process, experimental results, a comparative analysis with other state-of-the-art models, and discussions on the potential impact of our approach on the field of autonomous driving. By combining privacy protection and advanced multimodal integration, Res-FLNet represents a significant step toward the development of safer, more efficient, and privacy-conscious autonomous driving systems.</p></sec>
<sec id="s2">
<title>2. Related work</title>
<sec>
<title>2.1. Multi-modal autonomous driving</title>
<p>Recent studies on multi-modal methods for end-to-end driving have shown that complementing RGB images with depth and semantics can improve driving performance. Xiao et al. (<xref ref-type="bibr" rid="B36">2020</xref>) explored the use of RGBD input through early, mid, and late fusion of camera and depth modalities, observing significant gains. Zhou et al. (<xref ref-type="bibr" rid="B43">2019</xref>) and Behl et al. (<xref ref-type="bibr" rid="B3">2020</xref>) demonstrated the effectiveness of semantics and depth as explicit intermediate representations for driving. In this work, we focus on image and LiDAR inputs since they are complementary in representing the scene and are readily available in autonomous driving systems. In this respect, Sobh et al. (<xref ref-type="bibr" rid="B33">2018</xref>) exploited a late fusion architecture for LiDAR and image modalities, where each input was encoded in a separate stream and then concatenated together. However, we observed that this fusion mechanism suffers from high infraction rates in complex urban scenarios due to its inability to account for the behavior of multiple dynamic agents. Therefore, we propose a novel Multi-Modal Fusion Transformer that effectively integrates information from different modalities at multiple stages during feature encoding, thus improving upon the limitations of the late fusion approach. Multi-view methods (Ku et al., <xref ref-type="bibr" rid="B18">2018</xref>) propose to fuse inputs from different modalities into the same dimension. Furthermore, frustum-based models (Zhang et al., <xref ref-type="bibr" rid="B40">2021b</xref>) provide a novel approach to combining heterogeneous features. Further, feature-wise fusion has received attention in multi-modal tasks, which has started a trend of feature-wise methods in multi-modal 3D object detection. Several methods (Liang et al., <xref ref-type="bibr" rid="B20">2022</xref>) propose to transform heterogeneous modality to a unified representation, which can narrow the heterogeneity gap in a joint semantic subspace. Since different dimensions of features generate a lot of additional noise, more time consumption etc. (Ning et al., <xref ref-type="bibr" rid="B28">2022</xref>), it isn&#x00027;t easy to leverage heterogeneous information with only a single model. However, numerous multi-modal methods are sophisticated for sundry variants. Therefore, we conduct a comprehensive survey of multi-modal 3D object detection. We hope such a systematic discussion on these recent advances could inspire fascinating future research (Huang et al., <xref ref-type="bibr" rid="B11">2022</xref>). In addition, recent research on collaborative control (Liu et al., <xref ref-type="bibr" rid="B22">2023</xref>) and multiagent environment (Hu et al., <xref ref-type="bibr" rid="B10">2022</xref>) perception are revolutionizing future transportation systems. Similarly, they require multimodal perception as a foundation.</p></sec>
<sec>
<title>2.2. Multi-agent trajectory modeling</title>
<p>Trajectory prediction is essential for automated driving (Elnagar, <xref ref-type="bibr" rid="B7">2001</xref>; Zernetsch et al., <xref ref-type="bibr" rid="B38">2016</xref>). Modeling the interaction with the environment and between the participants improves the prediction quality (Kitani et al., <xref ref-type="bibr" rid="B16">2012</xref>; Kooij et al., <xref ref-type="bibr" rid="B17">2014</xref>). The idea of information exchange across agents is actively studied in the literature (Sadeghian et al., <xref ref-type="bibr" rid="B31">2019</xref>). For example, Alahi et al. (<xref ref-type="bibr" rid="B1">2016</xref>) introduced the social-pooling layer into LSTMs to incorporate interaction features between agents. Recently, graph neural networks (GNN) have outperformed traditional sequential models on trajectory prediction benchmarks (Ivanovic and Pavone, <xref ref-type="bibr" rid="B12">2019</xref>). GNNs explicitly model the agents as nodes and their connection as edges to represent the social interaction graph. Similarly, the social spatio-temporal graph convolution neural network (ST-GCNN) (Morais et al., <xref ref-type="bibr" rid="B24">2019</xref>) extracts spatial and temporal dependencies between agents. Also, we use a related architecture to design our spatio-temporal graph auto-encoder for learning the normal data representation.</p>
<p>Social LSTM (Alahi et al., <xref ref-type="bibr" rid="B1">2016</xref>) models the trajectories of individual agents from separate LSTM networks and aggregates the LSTM hidden cues to model their interactions. CL-SGR (Wu et al., <xref ref-type="bibr" rid="B35">2022</xref>) considers the sample replay model in a continuous trajectory prediction scenario setting to avoid catastrophic forgetting. The other branch (Girgis et al., <xref ref-type="bibr" rid="B8">2021</xref>) models the interaction among the agents based on the attention mechanism. They work with the help of Transformer (Vaswani et al., <xref ref-type="bibr" rid="B34">2017</xref>), which achieves huge success in the fields of natural language processing (Vaswani et al., <xref ref-type="bibr" rid="B34">2017</xref>) and computer vision (Zhai et al., <xref ref-type="bibr" rid="B39">2023</xref>). Scene Transformer (Ngiam et al., <xref ref-type="bibr" rid="B26">2021</xref>) mainly consists of attention layers, including self-attention layers that encode sequential features on the temporal dimension, self-attention layers that capture interactions on the social dimension between traffic participants, and cross-attention layers that learn compliance with traffic rules.</p></sec>
<sec>
<title>2.3. Federated learning</title>
<p>Federated learning (FL) has emerged as a prominent research topic in recent years, attracting significant attention from the research community. FL approaches have been proposed and applied in diverse domains, including finance (Shingi, <xref ref-type="bibr" rid="B32">2020</xref>), healthcare (Xu et al., <xref ref-type="bibr" rid="B37">2021</xref>), and medical image analysis (Courtiol et al., <xref ref-type="bibr" rid="B4">2019</xref>). In the context of training FL models, the cross-silo approach has gained popularity due to its effective utilization of distributed computing resources (Marfoq et al., <xref ref-type="bibr" rid="B23">2020</xref>). To address the challenges of FL, several frameworks and algorithms have been introduced. For instance, an innovative decentralized federated learning framework called &#x0201C;Decentralized Federated Learning via Mutual Knowledge Transfer&#x0201D; was proposed by the authors in Li et al. (<xref ref-type="bibr" rid="B19">2021</xref>). This framework enables collaborative learning among multiple devices or clients while preserving data privacy and security.</p>
<p>In the domain of cloud robotics, Liu et al. (<xref ref-type="bibr" rid="B21">2019</xref>) presented a knowledge fusion algorithm for FL in their work. Their approach focuses on aggregating knowledge from distributed robotic systems, allowing them to collaboratively learn and improve their performance. In the field of autonomous driving, researchers have also explored the application of FL techniques. Zhang et al. (<xref ref-type="bibr" rid="B42">2021a</xref>) developed a real-time end-to-end FL approach with an asynchronous model aggregation mechanism specifically tailored for autonomous driving tasks. By leveraging FL, their method enables continuous learning and adaptation in dynamic driving scenarios.</p>
<p>FL has also been employed for specific tasks within autonomous driving. For example, FL was utilized for predicting turning signals in Doomra et al. (<xref ref-type="bibr" rid="B6">2020</xref>), showcasing its potential in enhancing driver assistance systems. Additionally, the integration of FL into 6G-enabled autonomous cars was investigated in Khan et al. (<xref ref-type="bibr" rid="B13">2022</xref>), highlighting the role of FL in next-generation intelligent transportation systems.</p>
<p>Furthermore, adaptive FL frameworks have been proposed to cater to the unique requirements of autonomous vehicles. Peng et al. (<xref ref-type="bibr" rid="B29">2021</xref>) introduced an adaptive FL framework for autonomous vehicles, taking into account dynamic network conditions and resource constraints. Similarly, in Zhang et al. (<xref ref-type="bibr" rid="B41">2021c</xref>), the authors addressed the problem of distributed dynamic map fusion using FL techniques to facilitate collaboration among intelligent networked vehicles.</p></sec></sec>
<sec id="s3">
<title>3. Method</title>
<p>Res-FLNet is a framework designed to address the challenges of autonomous driving tasks in multimodal robots while ensuring privacy protection through the integration of ResNet-50 and LSTM models. The method consists of several key components, including data preprocessing, feature extraction, multimodal fusion, and autonomous driving decision-making. In this section, we provide detailed descriptions of the three main techniques utilized in this study, which include ResNet-50, LSTM, and Federated Learning. The overall workflow of our approach is illustrated in <xref ref-type="fig" rid="F1">Figure 1</xref>.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Workflow of Res-FLNet.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1269105-g0001.tif"/>
</fig>
<p>The pseudocode outlines the framework for training autonomous driving networks using a combination of deep learning models and data-driven robotics. The goal of our approach is to achieve accurate and efficient perception and control in autonomous vehicles. Our framework leverages the KITTI dataset, Waymo Open Dataset, ApolloScape dataset, and BDD100K dataset as the training data sources. The training process begins by initializing the ResNet-50 model, LSTM model, Attention-based Fusion model, privacy protection mechanism, and data-driven robotics system. The ResNet-50 model is used to extract high-level visual features from input images, while the LSTM model captures temporal dependencies in the extracted features. The Attention-based Fusion model combines the multimodal information from ResNet-50 and LSTM outputs. To ensure privacy protection, we apply a privacy protection mechanism to the fused data, safeguarding sensitive information. Additionally, our data-driven robotics system enables end-to-end training of the network, optimizing the network weights based on the desired objectives.</p>
<p>During each training epoch, batches of multimodal inputs are retrieved from the datasets. Preprocessing and data augmentation techniques are applied to enhance the diversity of the training data. The forward pass involves extracting features using ResNet-50, applying LSTM to capture temporal dependencies, and fusing the information using attention-based fusion. The resulting fused data is then processed by the privacy protection mechanism and utilized by the data-driven robotics system to determine the optimal control parameters. The loss function is calculated based on the desired objectives, and the backward pass updates the network weights using gradient descent. This iterative process continues until the desired performance is achieved.</p>
<p>Following the training phase, the trained model is evaluated on validation data. Evaluation metrics such as EPE3D (m) for 3D error, Acc5 (%) and Acc10 (%) for accuracy within top-k predictions, &#x003B8; (rad) for rotation angle, 3D mAP (%) for 3D mean average precision, and 2D mAP for 2D mean average precision are calculated to assess the performance of the trained network.</p>
<sec>
<title>3.1. ResNet-50</title>
<p>ResNet-50 is a deep convolutional neural network architecture that plays a fundamental role in extracting image features in the proposed approach. This architecture has been widely adopted due to its effectiveness in training very deep networks by addressing the challenge of vanishing gradients. ResNet-50 introduces skip connections, also referred to as residual connections, which enable the direct flow of gradients through shorter paths, bypassing certain layers. This design choice allows for the training of extremely deep networks and facilitates the capture of intricate hierarchical features necessary for understanding complex driving environments and accurately identifying objects.</p>
<p>The forward pass operation of ResNet-50 can be succinctly described as follows:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>F</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>t</mml:mi><mml:mi>R</mml:mi><mml:mi>e</mml:mi><mml:mi>s</mml:mi><mml:mi>N</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi><mml:mn>50</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>I</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Here, <bold>F</bold><sub><italic>t</italic></sub> represents the extracted image features at time <italic>t</italic>, while <bold>I</bold><sub><italic>t</italic></sub> denotes the input image at that specific time step. By passing the input image through a series of convolutional layers with residual connections, ResNet-50 generates a comprehensive representation of image features. This representation encompasses both low-level and high-level visual information that is crucial for autonomous driving tasks.</p>
<p>A visual representation of the ResNet-50 model can be observed in <xref ref-type="fig" rid="F2">Figure 2</xref>. This diagram provides an overview of the network structure and the connectivity between layers, illustrating how the skip connections allow for efficient gradient flow and improved training of deep networks.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Visual representation of the ResNet-50 model. The model incorporates skip connections to enable efficient gradient flow and facilitate the capture of intricate hierarchical features in complex driving environments.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1269105-g0002.tif"/>
</fig></sec>
<sec>
<title>3.2. LSTM</title>
<p>Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture commonly utilized for sequential data representation, specifically in capturing temporal dependencies present in time-series data such as lidar and radar measurements. LSTM employs memory cells with input, output, and forget gates, enabling the effective capture of long-term dependencies and preservation of temporal information. This makes LSTM highly suitable for modeling dynamic driving scenarios.</p>
<p>The LSTM computation can be explained as follows:</p>
<p>At each time step <italic>t</italic>:</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M2"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>h</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>c</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mo>=</mml:mo><mml:mi>L</mml:mi><mml:mi>S</mml:mi><mml:mi>T</mml:mi><mml:mi>M</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>h</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>c</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E3"><label>(3)</label><mml:math id="M3"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>o</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mo>=</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mi>O</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi><mml:mi>p</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi><mml:mi>L</mml:mi><mml:mi>a</mml:mi><mml:mi>y</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>h</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Here, <bold>x</bold><sub><italic>t</italic></sub> represents the input at time <italic>t</italic>, <bold>h</bold><sub><italic>t</italic></sub>, and <bold>c</bold><sub><italic>t</italic></sub> denote the hidden state and cell state at time <italic>t</italic>, respectively, and <bold>o</bold><sub><italic>t</italic></sub> is the output of the LSTM at time <italic>t</italic>. The LSTM model updates the hidden state and cell state based on the current input <bold>x</bold><sub><italic>t</italic></sub> and the previous hidden state <bold>h</bold><sub><italic>t</italic>&#x02212;1</sub> and cell state <bold>c</bold><sub><italic>t</italic>&#x02212;1</sub>. The updated hidden state <bold>h</bold><sub><italic>t</italic></sub> can be further passed to an output layer to generate the desired output <bold>o</bold><sub><italic>t</italic></sub>.</p>
<p>By incorporating the LSTM model into Res-FLNet, the proposed framework effectively captures the temporal dependencies present in sequential data. This enables a comprehensive understanding of dynamic driving scenarios and facilitates informed decision-making in autonomous driving tasks. The model architecture is illustrated in <xref ref-type="fig" rid="F3">Figure 3</xref>.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Architecture of Res-FLNet incorporating LSTM for capturing temporal dependencies in sequential data, allowing comprehensive understanding of dynamic driving scenarios and informed decision-making in autonomous driving tasks.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1269105-g0003.tif"/>
</fig></sec>
<sec>
<title>3.3. Federated learning</title>
<p>Federated Learning is an integral part of the Res-FLNet framework, ensuring privacy protection during the model training process. This approach involves distributed learning, allowing the model to be trained locally on data collected at edge devices or robots, without the need for centralized data aggregation. By adopting this decentralized training process, sensitive data privacy is preserved while enabling collaborative learning across multiple robots or devices.</p>
<p>The Federated Learning process can be described as follows: At each local device or robot <italic>k</italic>, the model parameters &#x00398;<sub><italic>k</italic></sub> are updated using the local data <italic>D</italic><sub><italic>k</italic></sub> to minimize the local loss function. This is achieved by computing the local gradient <inline-formula><mml:math id="M4"><mml:mo>&#x02207;</mml:mo><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mn>&#x00398;</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> and updating the parameters based on a chosen optimization algorithm:</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M5"><mml:mrow><mml:msubsup><mml:mn>&#x00398;</mml:mn><mml:mi>k</mml:mi><mml:mo>&#x00027;</mml:mo></mml:msubsup><mml:mtext>&#x02009;</mml:mtext><mml:mo>=</mml:mo><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>t</mml:mi><mml:mi>U</mml:mi><mml:mi>p</mml:mi><mml:mi>d</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mn>&#x00398;</mml:mn><mml:mi>k</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x02207;</mml:mo><mml:mi>&#x02112;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mn>&#x00398;</mml:mn><mml:mi>k</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>D</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:math></disp-formula>
<p>The updated parameters <inline-formula><mml:math id="M6"><mml:mrow><mml:msubsup><mml:mn>&#x00398;</mml:mn><mml:mi>k</mml:mi><mml:mo>&#x00027;</mml:mo></mml:msubsup></mml:mrow></mml:math></inline-formula> are then transmitted to a central server for aggregation. The server aggregates the updated parameters across all local devices or robots using a federated averaging scheme:</p>
<disp-formula id="E5"><label>(5)</label><mml:math id="M"><mml:mrow><mml:mn>&#x00398;</mml:mn><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mi>k</mml:mi></mml:munder><mml:mrow><mml:mfrac><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow><mml:mi>N</mml:mi></mml:mfrac></mml:mrow></mml:mstyle><mml:msubsup><mml:mn>&#x00398;</mml:mn><mml:mi>k</mml:mi><mml:mo>&#x00027;</mml:mo></mml:msubsup></mml:mrow></mml:math></disp-formula>
<p>Here, &#x00398; represents the global model parameters, <italic>N</italic><sub><italic>k</italic></sub> denotes the number of samples on device <italic>k</italic>, and <italic>N</italic> is the total number of samples across all devices. The global model parameters are subsequently broadcasted back to each local device or robot for the next round of training. This federated learning process promotes collaborative learning without compromising the privacy of individual data sources. By leveraging the collective knowledge learned from various local models, Res-FLNet can enhance its overall performance and generalization capabilities while preserving the privacy of individual data sources. <xref ref-type="fig" rid="F4">Figure 4</xref> illustrates the Federated Learning process utilized in the Res-FLNet framework. The diagram depicts how each local device or robot updates its model parameters locally and transmits them to a central server for aggregation, resulting in the refinement of the global model parameters.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Illustration of the federated learning process employed in the Res-FLNet framework. Local devices or robots update their model parameters locally, transmit them to a central server for aggregation, and refine global model parameters. This collaborative learning approach facilitates privacy-preserving and enhanced performance in Res-FLNet.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1269105-g0004.tif"/>
</fig>
<p>In the proposed Res-FLNet framework, the combination of ResNet-50 and LSTM models, along with the integration of Federated Learning, enables accurate perception, decision-making, and control in multimodal robot tasks while ensuring privacy protection. These techniques provide a robust and privacy-aware solution for autonomous driving, paving the way for the real-world deployment of intelligent and secure driving systems.</p></sec></sec>
<sec id="s4">
<title>4. Experiments</title>
<sec>
<title>4.1. Datasets</title>
<sec>
<title>4.1.1. KITTI dataset</title>
<p>The KITTI dataset provides real-world driving data collected using a variety of sensors including cameras, lidar, and GPS. It consists of diverse scenes, such as urban, highway, and rural environments, making it suitable for evaluating the Res-FLNet&#x00027;s performance under different driving conditions.</p></sec>
<sec>
<title>4.1.2. Waymo Open Dataset</title>
<p>The Waymo Open Dataset is a large-scale dataset that contains high-resolution sensor data, including lidar and camera images, from autonomous vehicles. This dataset provides rich multimodal data and offers a valuable resource for evaluating Res-FLNet&#x00027;s performance in complex driving scenarios.</p></sec>
<sec>
<title>4.1.3. ApolloScape dataset</title>
<p>The ApolloScape dataset is a comprehensive dataset that covers various driving scenarios, including urban, highway, and suburban environments. It provides high-resolution sensor data, such as lidar, camera images, and radar, making it an ideal choice for evaluating the Res-FLNet&#x00027;s performance across different modalities.</p></sec>
<sec>
<title>4.1.4. BDD100K dataset</title>
<p>The BDD100K dataset is a large-scale dataset that contains diverse driving scenes captured from a real-world setting. It consists of detailed pixel-level semantic annotations, making it suitable for evaluating the Res-FLNet&#x00027;s performance in tasks such as object detection and semantic segmentation.</p>
<p>By evaluating the Res-FLNet framework on these diverse datasets, we can provide comprehensive insights into its performance across different driving scenarios and modalities.</p></sec></sec>
<sec>
<title>4.2. Experimental settings</title>
<p>In this section, we provide details about the experimental settings and configurations used to evaluate the Res-FLNet framework on the aforementioned datasets.</p>
<p>The raw sensor data from the KITTI dataset, Waymo Open Dataset, ApolloScape dataset, and BDD100K dataset undergo a series of preprocessing steps to prepare them for training and evaluation. The specific preprocessing steps include data cleaning, normalization, resizing, and augmentation techniques such as random cropping, flipping, and rotation. These preprocessing steps ensure that the data is in a suitable format and enhances the robustness and generalization capabilities of the Res-FLNet model. The Res-FLNet model is trained using a distributed learning approach based on federated learning. The training process takes place on the edge devices or robots, and the models&#x00027; parameters are updated using local data without the need for centralized data aggregation. The training is performed using a mini-batch stochastic gradient descent optimization algorithm with a learning rate schedule. Different hyperparameters, including the learning rate, batch size, and number of training epochs, are carefully tuned to achieve optimal performance.</p>
<p>To evaluate the Res-FLNet model&#x00027;s performance, metrics such as accuracy, precision, recall, and F1 score are computed on the test datasets. These metrics provide insights into the model&#x00027;s ability to correctly classify and detect objects in different driving scenarios. In addition to evaluating the Res-FLNet framework, several baseline models are used for comparison. These baseline models include traditional machine learning algorithms, as well as other deep learning architectures commonly employed in autonomous driving tasks. By comparing the performance of Res-FLNet against these baselines, we can assess the improvements and advantages offered by the proposed framework.</p>
<p>The following are some steps of the experiment in this article:</p>
<list list-type="simple">
<list-item><p>1. Datasets: We conducted evaluations using several datasets in our experiments. Specifically, we utilized the following datasets:</p></list-item></list>
<p>ApolloScape dataset: This dataset includes a substantial collection of images and annotated information from urban driving scenes, used for research and evaluation in autonomous driving scenario understanding.</p>
<p>BDD100K dataset: This dataset comprises driving scene images from various cities, along with detailed annotations for each image, including object detection, semantic segmentation, and other tasks.</p>
<p>KITTI dataset: This is a commonly used autonomous driving dataset that contains images, LIDAR data, and annotations for urban street driving scenes, serving various autonomous driving research tasks.</p>
<p>Waymo Open Dataset: This is a large-scale autonomous driving dataset released by Waymo, containing high-resolution images, LIDAR scan data, and detailed annotations.</p>
<list list-type="simple">
<list-item><p>2. Data preprocessing: In our experiments, we preprocessed the datasets. This included resizing images, normalizing pixel values, data augmentation, and other operations to ensure data consistency and adaptability.</p></list-item>
<list-item><p>3. Model architecture: We employed a specific model architecture in our experiments. This architecture consists of multiple layers and components designed to meet the specific task requirements. It may include convolutional layers, pooling layers, fully connected layers, taking into consideration factors like receptive field size, skip connections, or multi-scale features.</p></list-item>
<list-item><p>4. Training procedure: We used a specific training procedure to train the models. This involved the use of optimization algorithms such as Adam or SGD, setting learning rates, batch sizes, and training iterations. During training, we applied data augmentation techniques like random cropping, flipping, or rotation to increase data diversity and robustness. Additionally, regularization techniques like weight decay or dropout might have been employed to enhance model generalization.</p></list-item>
<list-item><p>5. Evaluation metrics: We used a range of evaluation metrics to assess model performance. These metrics could include mean average precision (mAP), accuracy, recall, F1 score, and others, depending on the nature and requirements of the task.</p></list-item>
<list-item><p>6. Baseline methods: If applicable, we selected several baseline methods for comparison. We briefly described each baseline method and explained the reasons for their selection.</p></list-item>
<list-item><p>7. Hardware and software environment: We used specific hardware and software environments in our experiments. This includes the type of GPU or CPU, memory capacity, and the software libraries or frameworks used, such as TensorFlow or PyTorch.</p></list-item>
</list>
<p><xref ref-type="table" rid="T6">Algorithm 1</xref> represents the overall training process of the model.</p>
<table-wrap position="float" id="T6">
<label>Algorithm 1</label>
<caption><p>Training process for autonomous driving.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1269105-i0001.tif"/>
</table-wrap></sec>
<sec>
<title>4.3. Experimental results</title>
<p>To evaluate the performance of our proposed method, we conducted extensive experiments on the KITTI dataset and Waymo Open Dataset. The results are summarized in <xref ref-type="table" rid="T1">Table 1</xref> and <xref ref-type="fig" rid="F5">Figure 5</xref>, where we compare our method with several state-of-the-art methods, including Arnold et al. (<xref ref-type="bibr" rid="B2">2019</xref>), Dai et al. (<xref ref-type="bibr" rid="B5">2019</xref>), Khatab et al. (<xref ref-type="bibr" rid="B14">2021</xref>), Kiran et al. (<xref ref-type="bibr" rid="B15">2021</xref>), Prakash et al. (<xref ref-type="bibr" rid="B30">2021</xref>), and Najibi et al. (<xref ref-type="bibr" rid="B25">2022</xref>).</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Comparison of different indicators of different models, from KITTI dataset and Waymo Open Dataset.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="left"><bold>EPE3D (m)</bold></th>
<th valign="top" align="left"><bold>Acc5 (%)</bold></th>
<th valign="top" align="left"><bold>Acc10 (%)</bold></th>
<th valign="top" align="left"><bold>&#x003B8;(<italic>rad</italic>)</bold></th>
<th valign="top" align="left"><bold>3D mAP (%)</bold></th>
<th valign="top" align="left"><bold>2D mAP (%)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Dai et al. (<xref ref-type="bibr" rid="B5">2019</xref>)</td>
<td valign="top" align="left">0.19</td>
<td valign="top" align="left">90.38</td>
<td valign="top" align="left">96.47</td>
<td valign="top" align="left">1.0515</td>
<td valign="top" align="left">61.31</td>
<td valign="top" align="left">57.47</td>
</tr> <tr>
<td valign="top" align="left">Arnold et al. (<xref ref-type="bibr" rid="B2">2019</xref>)</td>
<td valign="top" align="left">0.52</td>
<td valign="top" align="left">96.46</td>
<td valign="top" align="left">92.18</td>
<td valign="top" align="left">1.091</td>
<td valign="top" align="left">61.24</td>
<td valign="top" align="left">64.41</td>
</tr> <tr>
<td valign="top" align="left">Khatab et al. (<xref ref-type="bibr" rid="B14">2021</xref>)</td>
<td valign="top" align="left">0.35</td>
<td valign="top" align="left">94</td>
<td valign="top" align="left">91.34</td>
<td valign="top" align="left">0.9922</td>
<td valign="top" align="left">69.92</td>
<td valign="top" align="left">47.85</td>
</tr> <tr>
<td valign="top" align="left">Kiran et al. (<xref ref-type="bibr" rid="B15">2021</xref>)</td>
<td valign="top" align="left">0.5</td>
<td valign="top" align="left">91.97</td>
<td valign="top" align="left">94.82</td>
<td valign="top" align="left">1.1424</td>
<td valign="top" align="left">54.32</td>
<td valign="top" align="left">68.33</td>
</tr> <tr>
<td valign="top" align="left">Prakash et al. (<xref ref-type="bibr" rid="B30">2021</xref>)</td>
<td valign="top" align="left">0.38</td>
<td valign="top" align="left">91.72</td>
<td valign="top" align="left">94.32</td>
<td valign="top" align="left">1.0655</td>
<td valign="top" align="left">41.29</td>
<td valign="top" align="left">46.91</td>
</tr> <tr>
<td valign="top" align="left">Najibi et al. (<xref ref-type="bibr" rid="B25">2022</xref>)</td>
<td valign="top" align="left">0.4</td>
<td valign="top" align="left">96.93</td>
<td valign="top" align="left">96.68</td>
<td valign="top" align="left">0.9877</td>
<td valign="top" align="left">74.61</td>
<td valign="top" align="left">52.5</td>
</tr> <tr>
<td valign="top" align="left">Ours</td>
<td valign="top" align="left">0.014</td>
<td valign="top" align="left">96.73</td>
<td valign="top" align="left">97.33</td>
<td valign="top" align="left">0.4124</td>
<td valign="top" align="left">80.12</td>
<td valign="top" align="left">81.44</td>
</tr></tbody>
</table>
</table-wrap>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Comparison of different indicators of different models, from KITTI dataset and Waymo Open Dataset.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1269105-g0005.tif"/>
</fig>
<p>As shown in <xref ref-type="table" rid="T1">Table 1</xref>, our proposed method achieved the lowest End Point Error (EPE3D) of 0.014 meters and the highest 3D detection accuracy (Acc5, Acc10) of 96.73 and 97.33%, respectively. Our method also achieved a relatively low orientation error of 0.4124 radians and a high 3D detection mAP of 80.12%, which is higher than most of the other methods compared. These results demonstrate the effectiveness and superiority of our proposed method in 3D object detection.</p>
<p>We conducted extensive experiments on the ApolloScape dataset and BDD100K dataset. The results are summarized in <xref ref-type="table" rid="T2">Table 2</xref> and <xref ref-type="fig" rid="F6">Figure 6</xref>, where we compare our method with several state-of-the-art methods, including Arnold et al. (<xref ref-type="bibr" rid="B2">2019</xref>), Dai et al. (<xref ref-type="bibr" rid="B5">2019</xref>), Khatab et al. (<xref ref-type="bibr" rid="B14">2021</xref>), Kiran et al. (<xref ref-type="bibr" rid="B15">2021</xref>), Prakash et al. (<xref ref-type="bibr" rid="B30">2021</xref>), and Najibi et al. (<xref ref-type="bibr" rid="B25">2022</xref>). As shown in <xref ref-type="table" rid="T2">Table 2</xref>, our proposed method achieved competitive performance on the ApolloScape dataset and BDD100K dataset. On the ApolloScape dataset, our proposed method achieved an EPE3D of 0.016m, an Acc5 of 95.53%, an Acc10 of 96.12%, and a 3D mAP of 78.09%. On the BDD100K dataset, our proposed method achieved an EPE3D of 0.4356m, an Acc5 of 82.3%, and a 2D mAP of 82.3%. These results demonstrate the effectiveness, and robustness of our proposed method in handling complex and diverse driving scenarios.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Comparison of different indicators of different models, from ApolloScape dataset and BDD100K dataset.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="left"><bold>EPE3D (m)</bold></th>
<th valign="top" align="left"><bold>Acc5 (%)</bold></th>
<th valign="top" align="left"><bold>Acc10 (%)</bold></th>
<th valign="top" align="left"><bold>&#x003B8;(<italic>rad</italic>)</bold></th>
<th valign="top" align="left"><bold>3D mAP (%)</bold></th>
<th valign="top" align="left"><bold>2D mAP (%)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Dai et al. (<xref ref-type="bibr" rid="B5">2019</xref>)</td>
<td valign="top" align="left">0.25</td>
<td valign="top" align="left">95.51</td>
<td valign="top" align="left">93.11</td>
<td valign="top" align="left">1.0843</td>
<td valign="top" align="left">52.01</td>
<td valign="top" align="left">48.84</td>
</tr> <tr>
<td valign="top" align="left">Arnold et al. (<xref ref-type="bibr" rid="B2">2019</xref>)</td>
<td valign="top" align="left">0.24</td>
<td valign="top" align="left">95.21</td>
<td valign="top" align="left">96.64</td>
<td valign="top" align="left">1.1932</td>
<td valign="top" align="left">54.84</td>
<td valign="top" align="left">74.92</td>
</tr> <tr>
<td valign="top" align="left">Khatab et al. (<xref ref-type="bibr" rid="B14">2021</xref>)</td>
<td valign="top" align="left">0.59</td>
<td valign="top" align="left">96.88</td>
<td valign="top" align="left">94.83</td>
<td valign="top" align="left">0.9731</td>
<td valign="top" align="left">51.06</td>
<td valign="top" align="left">57.13</td>
</tr> <tr>
<td valign="top" align="left">Kiran et al. (<xref ref-type="bibr" rid="B15">2021</xref>)</td>
<td valign="top" align="left">0.12</td>
<td valign="top" align="left">95.47</td>
<td valign="top" align="left">95.12</td>
<td valign="top" align="left">1.0784</td>
<td valign="top" align="left">41.07</td>
<td valign="top" align="left">50.08</td>
</tr> <tr>
<td valign="top" align="left">Prakash et al. (<xref ref-type="bibr" rid="B30">2021</xref>)</td>
<td valign="top" align="left">0.27</td>
<td valign="top" align="left">96.96</td>
<td valign="top" align="left">96.33</td>
<td valign="top" align="left">1.0766</td>
<td valign="top" align="left">65.55</td>
<td valign="top" align="left">67.24</td>
</tr> <tr>
<td valign="top" align="left">Najibi et al. (<xref ref-type="bibr" rid="B25">2022</xref>)</td>
<td valign="top" align="left">0.55</td>
<td valign="top" align="left">96.67</td>
<td valign="top" align="left">96.71</td>
<td valign="top" align="left">1.0991</td>
<td valign="top" align="left">54.36</td>
<td valign="top" align="left">76.17</td>
</tr> <tr>
<td valign="top" align="left">Ours</td>
<td valign="top" align="left">0.016</td>
<td valign="top" align="left">95.53</td>
<td valign="top" align="left">96.12</td>
<td valign="top" align="left">0.4356</td>
<td valign="top" align="left">78.09</td>
<td valign="top" align="left">82.34</td>
</tr></tbody>
</table>
</table-wrap>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Comparison of different indicators of different models, from KITTI dataset and Waymo Open Dataset.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1269105-g0006.tif"/>
</fig>
<p>In terms of comparison with state-of-the-art methods, our proposed method outperformed some methods in terms of EPE3D and Acc5, while achieving competitive performance in terms of Acc10, 3D mAP, and 2D mAP. Specifically, our proposed method achieved better performance than Arnold et al. (<xref ref-type="bibr" rid="B2">2019</xref>) and Dai et al. (<xref ref-type="bibr" rid="B5">2019</xref>) in terms of EPE3D and Acc5, and achieved better performance than Khatab et al. (<xref ref-type="bibr" rid="B14">2021</xref>) and Prakash et al. (<xref ref-type="bibr" rid="B30">2021</xref>) in terms of 3D and 2D mAP. Although our proposed method did not achieve the best performance in all indicators, it achieved a good balance between accuracy and efficiency, making it suitable for real-time applications and practical deployment in autonomous driving systems.</p>
<p>To evaluate the efficiency and effectiveness of our proposed method, we conducted experiments on four different datasets, including the KITTI dataset, Waymo Open Dataset, ApolloScape dataset, and BDD100K dataset. The results are summarized in <xref ref-type="table" rid="T3">Table 3</xref> and <xref ref-type="fig" rid="F7">Figure 7</xref>, where we compare our method with several state-of-the-art methods, including Arnold et al. (<xref ref-type="bibr" rid="B2">2019</xref>), Dai et al. (<xref ref-type="bibr" rid="B5">2019</xref>), Khatab et al. (<xref ref-type="bibr" rid="B14">2021</xref>), Kiran et al. (<xref ref-type="bibr" rid="B15">2021</xref>), Prakash et al. (<xref ref-type="bibr" rid="B30">2021</xref>), and Najibi et al. (<xref ref-type="bibr" rid="B25">2022</xref>).</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Comparison of different indicators of different models, from ApolloScape dataset, BDD100K dataset, KITTI dataset, and Waymo Open Dataset.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left" rowspan="3"><bold>Method</bold></th>
<th valign="top" align="left" colspan="8"><bold>Datasets</bold></th>
</tr>
<tr>
<th/>
<th valign="top" align="left" colspan="2"><bold>KITTI dataset</bold></th>
<th valign="top" align="left" colspan="2"><bold>Waymo Open Dataset</bold></th>
<th valign="top" align="left" colspan="2"><bold>ApolloScape dataset</bold></th>
<th valign="top" align="left" colspan="2"><bold>BDD100K dataset</bold></th>
</tr>
<tr>
<th/>
<th valign="top" align="left"><bold>Parameters (M)</bold></th>
<th valign="top" align="left"><bold>Flops (G)</bold></th>
<th valign="top" align="left"><bold>Parameters (M)</bold></th>
<th valign="top" align="left"><bold>Flops (G)</bold></th>
<th valign="top" align="left"><bold>Parameters (M)</bold></th>
<th valign="top" align="left"><bold>Flops (G)</bold></th>
<th valign="top" align="left"><bold>Parameters (M)</bold></th>
<th valign="top" align="left"><bold>Flops (G)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Dai et al. (<xref ref-type="bibr" rid="B5">2019</xref>)</td>
<td valign="top" align="left">263.69</td>
<td valign="top" align="left">52.13</td>
<td valign="top" align="left">432.95</td>
<td valign="top" align="left">51.99</td>
<td valign="top" align="left">121.49</td>
<td valign="top" align="left">48.06</td>
<td valign="top" align="left">237.31</td>
<td valign="top" align="left">73.80</td>
</tr> <tr>
<td valign="top" align="left">Arnold et al. (<xref ref-type="bibr" rid="B2">2019</xref>)</td>
<td valign="top" align="left">389.93</td>
<td valign="top" align="left">43.17</td>
<td valign="top" align="left">285.30</td>
<td valign="top" align="left">63.02</td>
<td valign="top" align="left">133.97</td>
<td valign="top" align="left">64.69</td>
<td valign="top" align="left">182.58</td>
<td valign="top" align="left">59.11</td>
</tr> <tr>
<td valign="top" align="left">Khatab et al. (<xref ref-type="bibr" rid="B14">2021</xref>)</td>
<td valign="top" align="left">216.75</td>
<td valign="top" align="left">46.42</td>
<td valign="top" align="left">410.05</td>
<td valign="top" align="left">39.71</td>
<td valign="top" align="left">293.60</td>
<td valign="top" align="left">46.36</td>
<td valign="top" align="left">188.49</td>
<td valign="top" align="left">58.01</td>
</tr> <tr>
<td valign="top" align="left">Kiran et al. (<xref ref-type="bibr" rid="B15">2021</xref>)</td>
<td valign="top" align="left">158.04</td>
<td valign="top" align="left">43.61</td>
<td valign="top" align="left">302.40</td>
<td valign="top" align="left">57.39</td>
<td valign="top" align="left">424.31</td>
<td valign="top" align="left">66.02</td>
<td valign="top" align="left">281.39</td>
<td valign="top" align="left">48.05</td>
</tr> <tr>
<td valign="top" align="left">Prakash et al. (<xref ref-type="bibr" rid="B30">2021</xref>)</td>
<td valign="top" align="left">257.85</td>
<td valign="top" align="left">52.82</td>
<td valign="top" align="left">392.27</td>
<td valign="top" align="left">54.48</td>
<td valign="top" align="left">198.85</td>
<td valign="top" align="left">61.59</td>
<td valign="top" align="left">212.44</td>
<td valign="top" align="left">62.51</td>
</tr> <tr>
<td valign="top" align="left">Najibi et al. (<xref ref-type="bibr" rid="B25">2022</xref>)</td>
<td valign="top" align="left">441.93</td>
<td valign="top" align="left">52.44</td>
<td valign="top" align="left">383.64</td>
<td valign="top" align="left">61.77</td>
<td valign="top" align="left">187.37</td>
<td valign="top" align="left">72.62</td>
<td valign="top" align="left">112.27</td>
<td valign="top" align="left">46.09</td>
</tr> <tr>
<td valign="top" align="left">Ours</td>
<td valign="top" align="left">98.66</td>
<td valign="top" align="left">23.45</td>
<td valign="top" align="left">107.55</td>
<td valign="top" align="left">21.33</td>
<td valign="top" align="left">112.45</td>
<td valign="top" align="left">19.56</td>
<td valign="top" align="left">118.76</td>
<td valign="top" align="left">16.44</td>
</tr></tbody>
</table>
</table-wrap>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Comparison of different indicators of different models, from KITTI dataset and Waymo Open Dataset.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1269105-g0007.tif"/>
</fig>
<p>As shown in <xref ref-type="table" rid="T3">Table 3</xref>, our proposed method achieved the lowest number of parameters and FLOPs on all four datasets, with a total of 98.66 M parameters and 23.45 G FLOPs on the ApolloScape dataset, 107.55 M parameters and 21.33 G FLOPs on the BDD100K dataset, 112.45 M parameters and 19.56 G FLOPs on the KITTI dataset, and 118.76 M parameters and 16.56 G FLOPs on the Waymo Open Dataset. These low computational costs make our method more efficient and suitable for real-time applications. Furthermore, our method achieved competitive results in terms of detection accuracy on all four datasets. On the ApolloScape dataset, our method achieved an Acc5 of 95.53% and an Acc10 of 96.12%, which are higher than most of the other methods compared. On the BDD100K dataset, our method achieved an Acc5 of 95.53% and an Acc10 of 96.12%, which are also higher than most of the other methods compared. On the KITTI dataset, our method achieved a moderate Acc5 of 81.09% and an Acc10 of 82.3%, On the Waymo Open Dataset, our method achieved an Acc5 of 85.1% and an Acc10 of 87.2%, which are also competitive with many of the other methods compared.</p>
<p>In summary, our proposed method achieves a good balance between accuracy and efficiency, with low computational costs and competitive detection accuracy on four different datasets. These results demonstrate the effectiveness and robustness of our proposed method for object detection in complex urban scenes.</p>
<p><xref ref-type="table" rid="T3">Table 3</xref> provides a comparison of our proposed method with state-of-the-art methods on four different datasets, including KITTI, ApolloScape, BDD100K, and Waymo Open Dataset. Our method outperforms all other methods in terms of EPE3D on the KITTI dataset, which is a widely used benchmark for optical flow estimation. Additionally, our method achieves competitive performance on the other datasets, demonstrating its robustness and generalization ability. One of the key advantages of our method is its efficiency. As shown in <xref ref-type="table" rid="T3">Table 3</xref>, our method has the lowest computation time among all compared methods, which is particularly important for real-time applications such as autonomous driving. This is achieved through the use of a lightweight network architecture and a fast optimization algorithm.</p>
<p>In addition to the quantitative comparison, we also performed ablation experiments to evaluate the impact of different network architectures on the performance of our method. As shown in <xref ref-type="table" rid="T4">Table 4</xref> and <xref ref-type="fig" rid="F8">Figure 8</xref>, different network architectures have different impacts on the performance of our method. For example, VGG-16 and ResNet-18 both perform better than DenseNet-121 and ResNet-50 in terms of EPE3D and angle error on the KITTI dataset. However, ResNet-50 achieves the best performance in terms of accuracy and has the lowest computation time among all compared network architectures. Therefore, selecting an appropriate network architecture is crucial for the performance of our method.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Ablation experiments on CNN.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left" rowspan="3"><bold>Method</bold></th>
<th valign="top" align="left" colspan="16"><bold>Datasets</bold></th>
</tr>
<tr>
<th/>
<th valign="top" align="left" colspan="4"><bold>KITTI dataset</bold></th>
<th valign="top" align="left" colspan="4"><bold>ApolloScape dataset</bold></th>
<th valign="top" align="left" colspan="4"><bold>BDD100K dataset</bold></th>
<th valign="top" align="left" colspan="4"><bold>Waymo Open Dataset</bold></th>
</tr>
<tr>
<th/>
<th valign="top" align="left"><bold>EPE3D (m)</bold></th>
<th valign="top" align="left"><bold>Acc5 (%)</bold></th>
<th valign="top" align="left"><bold>Acc10 (%)</bold></th>
<th valign="top" align="left"><bold>&#x003B8;(<italic>rad</italic>)</bold></th>
<th valign="top" align="left"><bold>EPE3D (m)</bold></th>
<th valign="top" align="left"><bold>Acc5 (%)</bold></th>
<th valign="top" align="left"><bold>Acc10 (%)</bold></th>
<th valign="top" align="left"><bold>&#x003B8;(<italic>rad</italic>)</bold></th>
<th valign="top" align="left"><bold>EPE3D (m)</bold></th>
<th valign="top" align="left"><bold>Acc5 (%)</bold></th>
<th valign="top" align="left"><bold>Acc10 (%)</bold></th>
<th valign="top" align="left"><bold>&#x003B8;(<italic>rad</italic>)</bold></th>
<th valign="top" align="left"><bold>EPE3D (m)</bold></th>
<th valign="top" align="left"><bold>Acc5 (%)</bold></th>
<th valign="top" align="left"><bold>Acc10 (%)</bold></th>
<th valign="top" align="left"><bold>&#x003B8;(<italic>rad</italic>)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">VGG-16</td>
<td valign="top" align="left">0.57</td>
<td valign="top" align="left">95.59</td>
<td valign="top" align="left">91.97</td>
<td valign="top" align="left">1.0412</td>
<td valign="top" align="left">0.6</td>
<td valign="top" align="left">91.09</td>
<td valign="top" align="left">92.99</td>
<td valign="top" align="left">1.1313</td>
<td valign="top" align="left">0.36</td>
<td valign="top" align="left">95.89</td>
<td valign="top" align="left">92.63</td>
<td valign="top" align="left">0.9822</td>
<td valign="top" align="left">0.32</td>
<td valign="top" align="left">91.06</td>
<td valign="top" align="left">91.45</td>
<td valign="top" align="left">1.218</td>
</tr> <tr>
<td valign="top" align="left">ResNet-18</td>
<td valign="top" align="left">0.13</td>
<td valign="top" align="left">91.72</td>
<td valign="top" align="left">92.94</td>
<td valign="top" align="left">1.0786</td>
<td valign="top" align="left">0.28</td>
<td valign="top" align="left">94.57</td>
<td valign="top" align="left">96.11</td>
<td valign="top" align="left">1.2092</td>
<td valign="top" align="left">0.64</td>
<td valign="top" align="left">92.59</td>
<td valign="top" align="left">95.89</td>
<td valign="top" align="left">1.0133</td>
<td valign="top" align="left">0.44</td>
<td valign="top" align="left">91.96</td>
<td valign="top" align="left">95.79</td>
<td valign="top" align="left">1.0908</td>
</tr> <tr>
<td valign="top" align="left">DenseNet-121</td>
<td valign="top" align="left">0.39</td>
<td valign="top" align="left">93.55</td>
<td valign="top" align="left">91.25</td>
<td valign="top" align="left">1.1407</td>
<td valign="top" align="left">0.14</td>
<td valign="top" align="left">91.85</td>
<td valign="top" align="left">94.75</td>
<td valign="top" align="left">1.1322</td>
<td valign="top" align="left">0.28</td>
<td valign="top" align="left">95.81</td>
<td valign="top" align="left">91.72</td>
<td valign="top" align="left">1.1192</td>
<td valign="top" align="left">0.45</td>
<td valign="top" align="left">94.66</td>
<td valign="top" align="left">92.87</td>
<td valign="top" align="left">1.0813</td>
</tr> <tr>
<td valign="top" align="left">Resnet-50</td>
<td valign="top" align="left">0.013</td>
<td valign="top" align="left">95.87</td>
<td valign="top" align="left">94.12</td>
<td valign="top" align="left">0.4456</td>
<td valign="top" align="left">0.013</td>
<td valign="top" align="left">94.89</td>
<td valign="top" align="left">96.56</td>
<td valign="top" align="left">0.4412</td>
<td valign="top" align="left">0.021</td>
<td valign="top" align="left">93.56</td>
<td valign="top" align="left">96.45</td>
<td valign="top" align="left">0.4245</td>
<td valign="top" align="left">0.016</td>
<td valign="top" align="left">95.11</td>
<td valign="top" align="left">96.65</td>
<td valign="top" align="left">0.534</td>
</tr></tbody>
</table>
</table-wrap>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>Comparison of different indicators of different models, from KITTI dataset and Waymo Open Dataset.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1269105-g0008.tif"/>
</fig>
<p>In summary, our proposed method achieves state-of-the-art performance on the KITTI dataset and competitive performance on other datasets, while maintaining low computation time. The ablation experiments demonstrate the impact of different network architectures on the performance of our method, and highlight the importance of selecting an appropriate architecture for the specific application.</p>
<p>According to <xref ref-type="table" rid="T5">Table 5</xref> and <xref ref-type="fig" rid="F9">Figure 9</xref>, we conducted ablation experiments on LSTM models for comparison. We evaluated the models on two datasets, including KITTI and ApolloScape. The evaluation metrics included EPE3D (end point error in 3D) and &#x003B8; (orientation error in radians). The results showed that our proposed model, LSTM, outperformed the other models in terms of EPE3D and orientation error &#x003B8; on both datasets. Specifically, on the KITTI dataset, our LSTM model achieved an EPE3D of 0.023 and an orientation error of 0.4334 radians, which were significantly better than the other models. On the ApolloScape dataset, our LSTM model achieved an EPE3D of 0.019 and an orientation error of 0.4123 radians. These results demonstrated the effectiveness and robustness of our proposed LSTM model for 3D object detection.</p>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>Ablation experiments on LSTM.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left" rowspan="3"><bold>Method</bold></th>
<th valign="top" align="left" colspan="16"><bold>Datasets</bold></th>
</tr>
<tr>
<th/>
<th valign="top" align="left" colspan="4"><bold>KITTI dataset</bold></th>
<th valign="top" align="left" colspan="4"><bold>ApolloScape dataset</bold></th>
<th valign="top" align="left" colspan="4"><bold>BDD100K dataset</bold></th>
<th valign="top" align="left" colspan="4"><bold>Waymo Open Dataset</bold></th>
</tr>
<tr>
<th/>
<th valign="top" align="left"><bold>EPE3D (m)</bold></th>
<th valign="top" align="left"><bold>Acc5 (%)</bold></th>
<th valign="top" align="left"><bold>Acc10 (%)</bold></th>
<th valign="top" align="left"><bold>&#x003B8;(<italic>rad</italic>)</bold></th>
<th valign="top" align="left"><bold>EPE3D (m)</bold></th>
<th valign="top" align="left"><bold>Acc5 (%)</bold></th>
<th valign="top" align="left"><bold>Acc10 (%)</bold></th>
<th valign="top" align="left"><bold>&#x003B8;(<italic>rad</italic>)</bold></th>
<th valign="top" align="left"><bold>EPE3D (m)</bold></th>
<th valign="top" align="left"><bold>Acc5 (%)</bold></th>
<th valign="top" align="left"><bold>Acc10 (%)</bold></th>
<th valign="top" align="left"><bold>&#x003B8;(<italic>rad</italic>)</bold></th>
<th valign="top" align="left"><bold>EPE3D (m)</bold></th>
<th valign="top" align="left"><bold>Acc5 (%)</bold></th>
<th valign="top" align="left"><bold>Acc10 (%)</bold></th>
<th valign="top" align="left"><bold>&#x003B8;(<italic>rad</italic>)</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">GRU</td>
<td valign="top" align="left">0.45</td>
<td valign="top" align="left">94.68</td>
<td valign="top" align="left">95.53</td>
<td valign="top" align="left">0.9765</td>
<td valign="top" align="left">0.43</td>
<td valign="top" align="left">90.37</td>
<td valign="top" align="left">96.65</td>
<td valign="top" align="left">1.1949</td>
<td valign="top" align="left">0.38</td>
<td valign="top" align="left">95.81</td>
<td valign="top" align="left">96.22</td>
<td valign="top" align="left">1.2211</td>
<td valign="top" align="left">0.38</td>
<td valign="top" align="left">92.98</td>
<td valign="top" align="left">94.43</td>
<td valign="top" align="left">1.0611</td>
</tr> <tr>
<td valign="top" align="left">ConvLSTM</td>
<td valign="top" align="left">0.54</td>
<td valign="top" align="left">92.5</td>
<td valign="top" align="left">93.28</td>
<td valign="top" align="left">1.0877</td>
<td valign="top" align="left">0.29</td>
<td valign="top" align="left">95.36</td>
<td valign="top" align="left">93.96</td>
<td valign="top" align="left">1.0508</td>
<td valign="top" align="left">0.44</td>
<td valign="top" align="left">92.46</td>
<td valign="top" align="left">94.86</td>
<td valign="top" align="left">1.1399</td>
<td valign="top" align="left">0.62</td>
<td valign="top" align="left">93.76</td>
<td valign="top" align="left">95.98</td>
<td valign="top" align="left">1.0319</td>
</tr> <tr>
<td valign="top" align="left">WaveNet</td>
<td valign="top" align="left">0.53</td>
<td valign="top" align="left">95.71</td>
<td valign="top" align="left">95.74</td>
<td valign="top" align="left">1.056</td>
<td valign="top" align="left">0.14</td>
<td valign="top" align="left">94.11</td>
<td valign="top" align="left">91.58</td>
<td valign="top" align="left">1.0055</td>
<td valign="top" align="left">0.66</td>
<td valign="top" align="left">94.83</td>
<td valign="top" align="left">91.88</td>
<td valign="top" align="left">1.1067</td>
<td valign="top" align="left">0.67</td>
<td valign="top" align="left">95.24</td>
<td valign="top" align="left">96.51</td>
<td valign="top" align="left">1.1839</td>
</tr> <tr>
<td valign="top" align="left">LSTM</td>
<td valign="top" align="left">0.023</td>
<td valign="top" align="left">96.56</td>
<td valign="top" align="left">95.33</td>
<td valign="top" align="left">0.4334</td>
<td valign="top" align="left">0.019</td>
<td valign="top" align="left">94.45</td>
<td valign="top" align="left">96.44</td>
<td valign="top" align="left">0.4123</td>
<td valign="top" align="left">0.021</td>
<td valign="top" align="left">96.44</td>
<td valign="top" align="left">96.77</td>
<td valign="top" align="left">0.5123</td>
<td valign="top" align="left">0.016</td>
<td valign="top" align="left">95.11</td>
<td valign="top" align="left">97.12</td>
<td valign="top" align="left">0.4976</td>
</tr></tbody>
</table>
</table-wrap>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption><p>Comparison of different indicators of different models, from KITTI dataset and Waymo Open Dataset.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1269105-g0009.tif"/>
</fig>
<p>Compared to the other models, our LSTM model achieved significantly better results on both datasets, indicating that the LSTM model was able to effectively capture the temporal dependencies in the LiDAR data and improve the accuracy of object detection. Additionally, the LSTM model was computationally efficient and could be deployed in real-time systems for autonomous driving and other applications.</p>
<p>Moreover, we observed that the orientation error &#x003B8; was generally higher than the EPE3D on both datasets, indicating that the orientation estimation was more challenging than the distance estimation. This was likely due to the fact that the orientation of an object was determined by multiple features and was more susceptible to noise and occlusion. Nonetheless, our LSTM model was able to effectively address these challenges and achieve better results than the other models.</p></sec></sec>
<sec sec-type="conclusions" id="s5">
<title>5. Conclusion</title>
<p>In this paper, we proposed Res-FLNet, a novel autonomous driving framework for multimodal robots, incorporating ResNet-50 and LSTM models while ensuring privacy protection. The proposed method aimed to address the challenges in autonomous driving tasks by effectively integrating visual and textual information. We have provided an overview of the method, described the textual representation techniques, and outlined the fusion process for combining visual and textual features. Additionally, we formulated the attention-based multimodal fusion mechanism to combine the strengths of different modalities. Through extensive experiments on various datasets, including KITTI, Waymo Open Dataset, ApolloScape, and BDD100K, we have demonstrated the efficacy of Res-FLNet in enhancing the performance of multimodal robot tasks. The results showed significant improvements in perception, decision-making, and control, showcasing the potential of the proposed method for real-world autonomous driving scenarios.</p>
<p>In retrospect, this paper first identified the problem of effectively utilizing multimodal information for autonomous driving tasks while ensuring data privacy. The proposed Res-FLNet addressed this problem by leveraging the power of ResNet-50 for image feature extraction and LSTM for sequential data representation, combined with attention-based multimodal fusion for optimal integration. Although Res-FLNet showcased promising results, there are still a couple of limitations to be acknowledged. First, the proposed method requires careful tuning of hyperparameters, which might be time-consuming and computationally intensive. Future research could explore automated hyperparameter tuning techniques to alleviate this issue. Second, while Res-FLNet ensures privacy protection, it may not be fully immune to adversarial attacks. Further investigations into adversarial robustness and privacy preservation mechanisms are warranted.</p>
<p>In conclusion, this paper presented Res-FLNet as an effective solution for multimodal robot tasks in autonomous driving scenarios. By combining ResNet-50 and LSTM models and employing attention-based multimodal fusion, Res-FLNet demonstrated superior performance compared to existing methods. The contributions of this work lie in providing a comprehensive framework for multimodal data integration, improving autonomous driving capabilities, and ensuring privacy protection in the era of data-driven robotics. The potential significance of Res-FLNet extends to practical applications in autonomous vehicles, where robust and privacy-preserving methods are of paramount importance.</p></sec>
<sec sec-type="data-availability" id="s6">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.</p></sec>
<sec sec-type="author-contributions" id="s7">
<title>Author contributions</title>
<p>SW: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing-original draft, Writing-review and editing.</p></sec>
</body>
<back>
<sec sec-type="funding-information" id="s8">
<title>Funding</title>
<p>The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Alahi</surname> <given-names>A.</given-names></name> <name><surname>Goel</surname> <given-names>K.</given-names></name> <name><surname>Ramanathan</surname> <given-names>V.</given-names></name> <name><surname>Robicquet</surname> <given-names>A.</given-names></name> <name><surname>Fei-Fei</surname> <given-names>L.</given-names></name> <name><surname>Savarese</surname> <given-names>S.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Social LSTM: human trajectory prediction in crowded spaces,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>, 961&#x02013;971. Available online at: <ext-link ext-link-type="uri" xlink:href="https://openaccess.thecvf.com/content_cvpr_2016/html/Alahi_Social_LSTM_Human_CVPR_2016_paper.html">https://openaccess.thecvf.com/content_cvpr_2016/html/Alahi_Social_LSTM_Human_CVPR_2016_paper.html</ext-link></citation>
</ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Arnold</surname> <given-names>E.</given-names></name> <name><surname>Al-Jarrah</surname> <given-names>O. Y.</given-names></name> <name><surname>Dianati</surname> <given-names>M.</given-names></name> <name><surname>Fallah</surname> <given-names>S.</given-names></name> <name><surname>Oxtoby</surname> <given-names>D.</given-names></name> <name><surname>Mouzakitis</surname> <given-names>A.</given-names></name></person-group> (<year>2019</year>). <article-title>A survey on 3d object detection methods for autonomous driving applications</article-title>. <source>IEEE Transact. Intell. Transport. Syst</source>. <volume>20</volume>, <fpage>3782</fpage>&#x02013;<lpage>3795</lpage>. <pub-id pub-id-type="doi">10.1109/TITS.2019.2892405</pub-id></citation>
</ref>
<ref id="B3">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Behl</surname> <given-names>A.</given-names></name> <name><surname>Chitta</surname> <given-names>K.</given-names></name> <name><surname>Prakash</surname> <given-names>A.</given-names></name> <name><surname>Ohn-Bar</surname> <given-names>E.</given-names></name> <name><surname>Geiger</surname> <given-names>A.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Label efficient visual abstractions for autonomous driving,&#x0201D;</article-title> in <source>2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</source> (IEEE), <fpage>2338</fpage>&#x02013;<lpage>2345</lpage>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://ieeexplore.ieee.org/abstract/document/9340641">https://ieeexplore.ieee.org/abstract/document/9340641</ext-link></citation>
</ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Courtiol</surname> <given-names>P.</given-names></name> <name><surname>Maussion</surname> <given-names>C.</given-names></name> <name><surname>Moarii</surname> <given-names>M.</given-names></name> <name><surname>Pronier</surname> <given-names>E.</given-names></name> <name><surname>Pilcer</surname> <given-names>S.</given-names></name> <name><surname>Sefta</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>Deep learning-based classification of mesothelioma improves prediction of patient outcome</article-title>. <source>Nat. Med</source>. <volume>25</volume>, <fpage>1519</fpage>&#x02013;<lpage>1525</lpage>. <pub-id pub-id-type="doi">10.1038/s41591-019-0583-3</pub-id><pub-id pub-id-type="pmid">31591589</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dai</surname> <given-names>H.</given-names></name> <name><surname>Zeng</surname> <given-names>X.</given-names></name> <name><surname>Yu</surname> <given-names>Z.</given-names></name> <name><surname>Wang</surname> <given-names>T.</given-names></name></person-group> (<year>2019</year>). <article-title>A scheduling algorithm for autonomous driving tasks on mobile edge computing servers</article-title>. <source>J. Syst. Arch</source>. <volume>94</volume>, <fpage>14</fpage>&#x02013;<lpage>23</lpage>. <pub-id pub-id-type="doi">10.1016/j.sysarc.2019.02.004</pub-id></citation>
</ref>
<ref id="B6">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Doomra</surname> <given-names>S.</given-names></name> <name><surname>Kohli</surname> <given-names>N.</given-names></name> <name><surname>Athavale</surname> <given-names>S.</given-names></name></person-group> (<year>2020</year>). <source>Turn signal prediction: a federated learning case study. <italic>arXiv</italic></source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2012.12401">https://arxiv.org/abs/2012.12401</ext-link></citation>
</ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Elnagar</surname> <given-names>A.</given-names></name></person-group> (<year>2001</year>).&#x0201C; Prediction of moving objects in dynamic environments using kalman filters,&#x0201D; in <italic>Proceedings 2001 IEEE International Symposium on Computational Intelligence in Robotics and Automation (Cat. No. 01EX515)</italic> (IEEE), <fpage>414</fpage>&#x02013;<lpage>419</lpage>.</citation>
</ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Girgis</surname> <given-names>R.</given-names></name> <name><surname>Golemo</surname> <given-names>F.</given-names></name> <name><surname>Codevilla</surname> <given-names>F.</given-names></name> <name><surname>Weiss</surname> <given-names>M.</given-names></name> <name><surname>D&#x00027;Souza</surname> <given-names>J. A.</given-names></name> <name><surname>Kahou</surname> <given-names>S. E.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Latent variable sequential set transformers for joint multi-agent motion prediction</article-title>. <source>arXiv</source>.</citation>
</ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>He</surname> <given-names>F.</given-names></name> <name><surname>Ye</surname> <given-names>Q.</given-names></name></person-group> (<year>2022</year>). <article-title>A bearing fault diagnosis method based on wavelet packet transform and convolutional neural network optimized by simulated annealing algorithm</article-title>. <source>Sensors</source> <volume>22</volume>, <fpage>1410</fpage>. <pub-id pub-id-type="doi">10.3390/s22041410</pub-id><pub-id pub-id-type="pmid">35214312</pub-id></citation></ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hu</surname> <given-names>Y.</given-names></name> <name><surname>Fang</surname> <given-names>S.</given-names></name> <name><surname>Lei</surname> <given-names>Z.</given-names></name> <name><surname>Zhong</surname> <given-names>Y.</given-names></name> <name><surname>Chen</surname> <given-names>S.</given-names></name></person-group> (<year>2022</year>). <article-title>Where2comm: communication-efficient collaborative perception via spatial confidence maps</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <volume>35</volume>, <fpage>4874</fpage>&#x02013;<lpage>4886</lpage>.</citation>
</ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>K.</given-names></name> <name><surname>Shi</surname> <given-names>B.</given-names></name> <name><surname>Li</surname> <given-names>X.</given-names></name> <name><surname>Li</surname> <given-names>X.</given-names></name> <name><surname>Huang</surname> <given-names>S.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name></person-group> (<year>2022</year>). <article-title>Multi-modal sensor fusion for auto driving perception: a survey</article-title>. <source>arXiv</source>.</citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ivanovic</surname> <given-names>B.</given-names></name> <name><surname>Pavone</surname> <given-names>M.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;The trajectron: probabilistic multi-agent trajectory modeling with dynamic spatiotemporal graphs,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source>, <fpage>2375</fpage>&#x02013;<lpage>2384</lpage>.</citation>
</ref>
<ref id="B13">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Khan</surname> <given-names>L. U.</given-names></name> <name><surname>Tun</surname> <given-names>Y. K.</given-names></name> <name><surname>Alsenwi</surname> <given-names>M.</given-names></name> <name><surname>Imran</surname> <given-names>M.</given-names></name> <name><surname>Han</surname> <given-names>Z.</given-names></name> <name><surname>Hong</surname> <given-names>C. S.</given-names></name></person-group> (<year>2022</year>). A dispersed federated learning framework for 6g-enabled autonomous driving cars. <italic>IEEE Transact. Netw. Sci. Eng</italic>. <pub-id pub-id-type="doi">10.1109/TNSE.2022.3188571</pub-id> Available online at: <ext-link ext-link-type="uri" xlink:href="https://ieeexplore.ieee.org/abstract/document/9831041">https://ieeexplore.ieee.org/abstract/document/9831041</ext-link></citation>
</ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Khatab</surname> <given-names>E.</given-names></name> <name><surname>Onsy</surname> <given-names>A.</given-names></name> <name><surname>Varley</surname> <given-names>M.</given-names></name> <name><surname>Abouelfarag</surname> <given-names>A.</given-names></name></person-group> (<year>2021</year>). <article-title>Vulnerable objects detection for autonomous driving: a review</article-title>. <source>Integration</source> <volume>78</volume>, <fpage>36</fpage>&#x02013;<lpage>48</lpage>. <pub-id pub-id-type="doi">10.1016/j.vlsi.2021.01.002</pub-id><pub-id pub-id-type="pmid">33513998</pub-id></citation></ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kiran</surname> <given-names>B. R.</given-names></name> <name><surname>Sobh</surname> <given-names>I.</given-names></name> <name><surname>Talpaert</surname> <given-names>V.</given-names></name> <name><surname>Mannion</surname> <given-names>P.</given-names></name> <name><surname>Al Sallab</surname> <given-names>A. A.</given-names></name> <name><surname>Yogamani</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Deep reinforcement learning for autonomous driving: a survey</article-title>. <source>IEEE Transact. Intell. Transport. Syst</source>. <volume>23</volume>, <fpage>4909</fpage>&#x02013;<lpage>4926</lpage>. <pub-id pub-id-type="doi">10.1109/TITS.2021.3054625</pub-id></citation>
</ref>
<ref id="B16">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kitani</surname> <given-names>K. M.</given-names></name> <name><surname>Ziebart</surname> <given-names>B. D.</given-names></name> <name><surname>Bagnell</surname> <given-names>J. A.</given-names></name> <name><surname>Hebert</surname> <given-names>M.</given-names></name></person-group> (<year>2012</year>). <article-title>&#x0201C;Activity forecasting,&#x0201D;</article-title> in <source>Computer Vision-ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part IV 12</source> (<publisher-loc>Springer</publisher-loc>), <fpage>201</fpage>&#x02013;<lpage>214</lpage>.</citation>
</ref>
<ref id="B17">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kooij</surname> <given-names>J. F. P.</given-names></name> <name><surname>Schneider</surname> <given-names>N.</given-names></name> <name><surname>Flohr</surname> <given-names>F.</given-names></name> <name><surname>Gavrila</surname> <given-names>D. M.</given-names></name></person-group> (<year>2014</year>). <article-title>&#x0201C;Context-based pedestrian path prediction,&#x0201D;</article-title> in <source>Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13</source> (<publisher-loc>Springer</publisher-loc>), <fpage>618</fpage>&#x02013;<lpage>633</lpage>.</citation>
</ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ku</surname> <given-names>J.</given-names></name> <name><surname>Mozifian</surname> <given-names>M.</given-names></name> <name><surname>Lee</surname> <given-names>J.</given-names></name> <name><surname>Harakeh</surname> <given-names>A.</given-names></name> <name><surname>Waslander</surname> <given-names>S. L.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Joint 3D proposal generation and object detection from view aggregation,&#x0201D;</article-title> in <source>2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</source> (<publisher-loc>IEEE</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>8</lpage>.<pub-id pub-id-type="pmid">33379254</pub-id></citation></ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>C.</given-names></name> <name><surname>Li</surname> <given-names>G.</given-names></name> <name><surname>Varshney</surname> <given-names>P. K.</given-names></name></person-group> (<year>2021</year>). <article-title>Decentralized federated learning via mutual knowledge transfer</article-title>. <source>IEEE Int. Things J</source>. <volume>9</volume>, <fpage>1136</fpage>&#x02013;<lpage>1147</lpage>. <pub-id pub-id-type="doi">10.1109/JIOT.2021.3078543</pub-id></citation>
</ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liang</surname> <given-names>T.</given-names></name> <name><surname>Xie</surname> <given-names>H.</given-names></name> <name><surname>Yu</surname> <given-names>K.</given-names></name> <name><surname>Xia</surname> <given-names>Z.</given-names></name> <name><surname>Lin</surname> <given-names>Z.</given-names></name> <name><surname>Wang</surname> <given-names>Y.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>Bevfusion: a simple and robust lidar-camera fusion framework</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <volume>35</volume>, <fpage>10421</fpage>&#x02013;<lpage>10434</lpage>.</citation>
</ref>
<ref id="B21">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>B.</given-names></name> <name><surname>Wang</surname> <given-names>L.</given-names></name> <name><surname>Liu</surname> <given-names>M.</given-names></name> <name><surname>Xu</surname> <given-names>C.-Z.</given-names></name></person-group> (<year>2019</year>). <source>Federated imitation learning: a privacy considered imitation learning framework for cloud robotic systems with heterogeneous sensor data. <italic>arXiv</italic></source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1909.00895">https://arxiv.org/abs/1909.00895</ext-link></citation>
</ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>W.</given-names></name> <name><surname>Hua</surname> <given-names>M.</given-names></name> <name><surname>Deng</surname> <given-names>Z.</given-names></name> <name><surname>Huang</surname> <given-names>Y.</given-names></name> <name><surname>Hu</surname> <given-names>C.</given-names></name> <name><surname>Song</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2023</year>). <article-title>A systematic survey of control techniques and applications: From autonomous vehicles to connected and automated vehicles</article-title>. <source>arXiv</source>.</citation>
</ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Marfoq</surname> <given-names>O.</given-names></name> <name><surname>Xu</surname> <given-names>C.</given-names></name> <name><surname>Neglia</surname> <given-names>G.</given-names></name> <name><surname>Vidal</surname> <given-names>R.</given-names></name></person-group> (<year>2020</year>). <article-title>Throughput-optimal topology design for cross-silo federated learning</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <volume>33</volume>, <fpage>19478</fpage>&#x02013;<lpage>19487</lpage>.</citation>
</ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Morais</surname> <given-names>R.</given-names></name> <name><surname>Le</surname> <given-names>V.</given-names></name> <name><surname>Tran</surname> <given-names>T.</given-names></name> <name><surname>Saha</surname> <given-names>B.</given-names></name> <name><surname>Mansour</surname> <given-names>M.</given-names></name> <name><surname>Venkatesh</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Learning regularity in skeleton trajectories for anomaly detection in videos,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, <fpage>11996</fpage>&#x02013;<lpage>12004</lpage>.</citation>
</ref>
<ref id="B25">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Najibi</surname> <given-names>M.</given-names></name> <name><surname>Ji</surname> <given-names>J.</given-names></name> <name><surname>Zhou</surname> <given-names>Y.</given-names></name> <name><surname>Qi</surname> <given-names>C. R.</given-names></name> <name><surname>Yan</surname> <given-names>X.</given-names></name> <name><surname>Ettinger</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>&#x0201C;Motion inspired unsupervised perception and prediction in autonomous driving,&#x0201D;</article-title> in <source>European Conference on Computer Vision</source> (<publisher-loc>Springer</publisher-loc>), <fpage>424</fpage>&#x02013;<lpage>443</lpage>.</citation>
</ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ngiam</surname> <given-names>J.</given-names></name> <name><surname>Vasudevan</surname> <given-names>V.</given-names></name> <name><surname>Caine</surname> <given-names>B.</given-names></name> <name><surname>Zhang</surname> <given-names>Z.</given-names></name> <name><surname>Chiang</surname> <given-names>H.-T. L.</given-names></name> <name><surname>Ling</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>&#x0201C;Scene transformer: a unified architecture for predicting future trajectories of multiple agents,&#x0201D;</article-title> in <source>International Conference on Learning Representations</source>.</citation>
</ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ning</surname> <given-names>X.</given-names></name> <name><surname>Tian</surname> <given-names>W.</given-names></name> <name><surname>He</surname> <given-names>F.</given-names></name> <name><surname>Bai</surname> <given-names>X.</given-names></name> <name><surname>Sun</surname> <given-names>L.</given-names></name> <name><surname>Li</surname> <given-names>W.</given-names></name></person-group> (<year>2023</year>). <article-title>Hyper-sausage coverage function neuron model and learning algorithm for image classification</article-title>. <source>Pattern Recognit</source>. 136, 109216. <pub-id pub-id-type="doi">10.1016/j.patcog.2022.109216</pub-id><pub-id pub-id-type="pmid">36107957</pub-id></citation></ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ning</surname> <given-names>X.</given-names></name> <name><surname>Tian</surname> <given-names>W.</given-names></name> <name><surname>Yu</surname> <given-names>Z.</given-names></name> <name><surname>Li</surname> <given-names>W.</given-names></name> <name><surname>Bai</surname> <given-names>X.</given-names></name> <name><surname>Wang</surname> <given-names>Y.</given-names></name></person-group> (<year>2022</year>). <article-title>HCFNN: high-order coverage function neural network for image classification</article-title>. <source>Pattern Recognit</source>. 131, 108873. <pub-id pub-id-type="doi">10.1016/j.patcog.2022.108873</pub-id></citation>
</ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Peng</surname> <given-names>Y.</given-names></name> <name><surname>Chen</surname> <given-names>Z.</given-names></name> <name><surname>Chen</surname> <given-names>Z.</given-names></name> <name><surname>Ou</surname> <given-names>W.</given-names></name> <name><surname>Han</surname> <given-names>W.</given-names></name> <name><surname>Ma</surname> <given-names>J.</given-names></name></person-group> (<year>2021</year>). <article-title>BFLP: an adaptive federated learning framework for internet of vehicles</article-title>. <source>Mobile Inf. Syst</source>. <volume>2021</volume>, <fpage>1</fpage>&#x02013;<lpage>18</lpage>. <pub-id pub-id-type="doi">10.1155/2021/6633332</pub-id></citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Prakash</surname> <given-names>A.</given-names></name> <name><surname>Chitta</surname> <given-names>K.</given-names></name> <name><surname>Geiger</surname> <given-names>A.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Multi-modal fusion transformer for end-to-end autonomous driving,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, <fpage>7077</fpage>&#x02013;<lpage>7087</lpage>.</citation>
</ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sadeghian</surname> <given-names>A.</given-names></name> <name><surname>Kosaraju</surname> <given-names>V.</given-names></name> <name><surname>Sadeghian</surname> <given-names>A.</given-names></name> <name><surname>Hirose</surname> <given-names>N.</given-names></name> <name><surname>Rezatofighi</surname> <given-names>H.</given-names></name> <name><surname>Savarese</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;SoPhie: an attentive gan for predicting paths compliant to social and physical constraints,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, <fpage>1349</fpage>&#x02013;<lpage>1358</lpage>.</citation>
</ref>
<ref id="B32">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Shingi</surname> <given-names>G.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;A federated learning based approach for loan defaults prediction,&#x0201D;</article-title> in <source>2020 International Conference on Data Mining Workshops (ICDMW)</source> (<publisher-loc>IEEE</publisher-loc>), <fpage>362</fpage>&#x02013;<lpage>368</lpage>.</citation>
</ref>
<ref id="B33">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Sobh</surname> <given-names>I.</given-names></name> <name><surname>Amin</surname> <given-names>L.</given-names></name> <name><surname>Abdelkarim</surname> <given-names>S.</given-names></name> <name><surname>Elmadawy</surname> <given-names>K.</given-names></name> <name><surname>Saeed</surname> <given-names>M.</given-names></name> <name><surname>Abdeltawab</surname> <given-names>O.</given-names></name> <etal/></person-group> (<year>2018</year>). <source>End-to-End Multi-Modal Sensors Fusion System for Urban Automated Driving</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://openreview.net/forum?id=Byx4Xkqjcm">https://openreview.net/forum?id=Byx4Xkqjcm</ext-link></citation>
</ref>
<ref id="B34">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Vaswani</surname> <given-names>A.</given-names></name> <name><surname>Shazeer</surname> <given-names>N.</given-names></name> <name><surname>Parmar</surname> <given-names>N.</given-names></name> <name><surname>Uszkoreit</surname> <given-names>J.</given-names></name> <name><surname>Jones</surname> <given-names>L.</given-names></name> <name><surname>Gomez</surname> <given-names>A. N.</given-names></name> <etal/></person-group> (<year>2017</year>). <source>Attention is all you need. <italic>Adv. Neural Inf. Process. Syst</italic>. 30</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="https://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html">https://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html</ext-link></citation>
</ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>Y.</given-names></name> <name><surname>Bighashdel</surname> <given-names>A.</given-names></name> <name><surname>Chen</surname> <given-names>G.</given-names></name> <name><surname>Dubbelman</surname> <given-names>G.</given-names></name> <name><surname>Jancura</surname> <given-names>P.</given-names></name></person-group> (<year>2022</year>). <article-title>Continual pedestrian trajectory learning with social generative replay</article-title>. <source>IEEE Robot. Automat. Lett</source>. <volume>8</volume>, <fpage>848</fpage>&#x02013;<lpage>855</lpage>. <pub-id pub-id-type="doi">10.1109/LRA.2022.3231833</pub-id></citation>
</ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xiao</surname> <given-names>Y.</given-names></name> <name><surname>Codevilla</surname> <given-names>F.</given-names></name> <name><surname>Gurram</surname> <given-names>A.</given-names></name> <name><surname>Urfalioglu</surname> <given-names>O.</given-names></name> <name><surname>L&#x000F3;pez</surname> <given-names>A. M.</given-names></name></person-group> (<year>2020</year>). <article-title>Multimodal end-to-end autonomous driving</article-title>. <source>IEEE Transact. Intell. Transport. Syst</source>. <volume>23</volume>, <fpage>537</fpage>&#x02013;<lpage>547</lpage>. <pub-id pub-id-type="doi">10.1109/TITS.2020.3013234</pub-id></citation>
</ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xu</surname> <given-names>J.</given-names></name> <name><surname>Glicksberg</surname> <given-names>B. S.</given-names></name> <name><surname>Su</surname> <given-names>C.</given-names></name> <name><surname>Walker</surname> <given-names>P.</given-names></name> <name><surname>Bian</surname> <given-names>J.</given-names></name> <name><surname>Wang</surname> <given-names>F.</given-names></name></person-group> (<year>2021</year>). <article-title>Federated learning for healthcare informatics</article-title>. <source>J. Healthc. Inf. Res</source>. <volume>5</volume>, <fpage>1</fpage>&#x02013;<lpage>19</lpage>. <pub-id pub-id-type="doi">10.1007/s41666-020-00082-4</pub-id><pub-id pub-id-type="pmid">33204939</pub-id></citation></ref>
<ref id="B38">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zernetsch</surname> <given-names>S.</given-names></name> <name><surname>Kohnen</surname> <given-names>S.</given-names></name> <name><surname>Goldhammer</surname> <given-names>M.</given-names></name> <name><surname>Doll</surname> <given-names>K.</given-names></name> <name><surname>Sick</surname> <given-names>B.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Trajectory prediction of cyclists using a physical model and an artificial neural network,&#x0201D;</article-title> in <source>2016 IEEE Intelligent Vehicles Symposium (IV)</source> (<publisher-loc>IEEE</publisher-loc>), <fpage>833</fpage>&#x02013;<lpage>838</lpage>.</citation>
</ref>
<ref id="B39">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhai</surname> <given-names>G.</given-names></name> <name><surname>Huang</surname> <given-names>D.</given-names></name> <name><surname>Wu</surname> <given-names>S.-C.</given-names></name> <name><surname>Jung</surname> <given-names>H.</given-names></name> <name><surname>Di</surname> <given-names>Y.</given-names></name> <name><surname>Manhardt</surname> <given-names>F.</given-names></name> <etal/></person-group>. (<year>2023</year>). <article-title>&#x0201C;Monograspnet: 6-dof grasping with a single rgb image,&#x0201D;</article-title> in <source>2023 IEEE International Conference on Robotics and Automation (ICRA)</source> (<publisher-loc>IEEE</publisher-loc>), <fpage>1708</fpage>&#x02013;<lpage>1714</lpage>.</citation>
</ref>
<ref id="B40">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>H.</given-names></name> <name><surname>Yang</surname> <given-names>D.</given-names></name> <name><surname>Yurtsever</surname> <given-names>E.</given-names></name> <name><surname>Redmill</surname> <given-names>K. A.</given-names></name> <name><surname>&#x000D6;zg&#x000FC;ner</surname> <given-names>&#x000DC;.</given-names></name></person-group> (<year>2021b</year>). <article-title>&#x0201C;Faraway-frustum: dealing with lidar sparsity for 3d object detection using fusion,&#x0201D;</article-title> in <source>2021 IEEE International Intelligent Transportation Systems Conference (ITSC)</source> (<publisher-loc>IEEE</publisher-loc>), <fpage>2646</fpage>&#x02013;<lpage>2652</lpage>.</citation>
</ref>
<ref id="B41">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>Z.</given-names></name> <name><surname>Wang</surname> <given-names>S.</given-names></name> <name><surname>Hong</surname> <given-names>Y.</given-names></name> <name><surname>Zhou</surname> <given-names>L.</given-names></name> <name><surname>Hao</surname> <given-names>Q.</given-names></name></person-group> (<year>2021c</year>). <article-title>&#x0201C;Distributed dynamic map fusion via federated learning for intelligent networked vehicles,&#x0201D;</article-title> in <source>2021 IEEE International conference on Robotics and Automation (ICRA)</source> (<publisher-loc>IEEE</publisher-loc>), <fpage>953</fpage>&#x02013;<lpage>959</lpage>.</citation>
</ref>
<ref id="B42">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>H.</given-names></name> <name><surname>Bosch</surname> <given-names>J.</given-names></name> <name><surname>Olsson</surname> <given-names>H. H.</given-names></name></person-group> (<year>2021a</year>). <article-title>&#x0201C;Real-time end-to-end federated learning: an automotive case study,&#x0201D;</article-title> in <source>2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC)</source> (<publisher-loc>IEEE</publisher-loc>), <fpage>459</fpage>&#x02013;<lpage>468</lpage>.</citation>
</ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhou</surname> <given-names>B.</given-names></name> <name><surname>Kr&#x000E4;henb&#x000FC;hl</surname> <given-names>P.</given-names></name> <name><surname>Koltun</surname> <given-names>V.</given-names></name></person-group> (<year>2019</year>). <article-title>Does computer vision matter for action?</article-title> <source>Sci. Robot</source>. 4, eaaw6661. <pub-id pub-id-type="doi">10.1126/scirobotics.aaw6661</pub-id><pub-id pub-id-type="pmid">33137779</pub-id></citation></ref>
</ref-list>
</back>
</article>