<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Sig. Proc.</journal-id>
<journal-title>Frontiers in Signal Processing</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Sig. Proc.</abbrev-journal-title>
<issn pub-type="epub">2673-8198</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">1198205</article-id>
<article-id pub-id-type="doi">10.3389/frsip.2023.1198205</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Signal Processing</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>4DEgo: ego-velocity estimation from high-resolution radar data</article-title>
<alt-title alt-title-type="left-running-head">Rai et al.</alt-title>
<alt-title alt-title-type="right-running-head">
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/frsip.2023.1198205">10.3389/frsip.2023.1198205</ext-link>
</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Rai</surname>
<given-names>Prashant Kumar</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1773613/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Strokina</surname>
<given-names>Nataliya</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1276559/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Ghabcheloo</surname>
<given-names>Reza</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/224108/overview"/>
</contrib>
</contrib-group>
<aff id="aff1">
<sup>1</sup>
<institution>Automation Technology and Mechanical Engineering, Faculty of Engineering and Natural Sciences</institution>, <institution>Tampere University</institution>, <addr-line>Tampere</addr-line>, <country>Finland</country>
</aff>
<aff id="aff2">
<sup>2</sup>
<institution>Computing Sciences, Faculty of Information Technology and Communication</institution>, <institution>Tampere University</institution>, <addr-line>Tampere</addr-line>, <country>Finland</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1598892/overview">Shobha Sundar Ram</ext-link>, Indraprastha Institute of Information Technology Delhi, India</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1623455/overview">Faran Awais Butt</ext-link>, University of Management and Technology, Pakistan</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/2296647/overview">Akanksha Sneh</ext-link>, Indraprastha Institute of Information Technology Delhi, India</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Prashant Kumar Rai, <email>prashant.rai@tuni.fi</email>
</corresp>
</author-notes>
<pub-date pub-type="epub">
<day>27</day>
<month>06</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>3</volume>
<elocation-id>1198205</elocation-id>
<history>
<date date-type="received">
<day>31</day>
<month>03</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>08</day>
<month>06</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2023 Rai, Strokina and Ghabcheloo.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Rai, Strokina and Ghabcheloo</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>Automotive radars allow for perception of the environment in adverse visibility and weather conditions. New high-resolution sensors have demonstrated potential for tasks beyond obstacle detection and velocity adjustment, such as mapping or target tracking. This paper proposes an end-to-end method for ego-velocity estimation based on radar scan registration. Our architecture includes a 3D convolution over all three channels of the heatmap, capturing features associated with motion, and an attention mechanism for selecting significant features for regression. To the best of our knowledge, this is the first work utilizing the full 3D radar heatmap for ego-velocity estimation. We verify the efficacy of our approach using the publicly available ColoRadar dataset and study the effect of architectural choices and distributional shifts on performance.</p>
</abstract>
<kwd-group>
<kwd>ego-motion estimation</kwd>
<kwd>4D automotive radar</kwd>
<kwd>autonomous navigation</kwd>
<kwd>transformers</kwd>
<kwd>attention</kwd>
</kwd-group>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Radar Signal Processing</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>Automotive radars have gained significant attention in recent years. New 76&#x2013;81&#xa0;GHz high-resolution sensors (<xref ref-type="bibr" rid="B13">Dickmann et al., 2016</xref>; <xref ref-type="bibr" rid="B15">Engels et al., 2017</xref>) have shown potential for tasks beyond obstacle detection and velocity adjustment. Unlike traditional automotive radars, they can be used for localization (<xref ref-type="bibr" rid="B18">Heller et al., 2021</xref>), SLAM (simultaneous localization and mapping) (<xref ref-type="bibr" rid="B20">Holder et al., 2019</xref>), and ego-motion estimation. The estimation of ego-motion can enable other higher-level tasks such as mapping, target tracking, state estimation for control, and planning (<xref ref-type="bibr" rid="B36">Steiner et al., 2018</xref>). The majority of ego-motion and odometry algorithms relies on onboard sensors like IMU cameras and lidar. Vision-based methods use a stream of images acquired with single or multiple cameras attached to the robot for relative transformation (rotation and translation) estimation (<xref ref-type="bibr" rid="B43">Yang et al., 2020</xref>). Lidar has been well explored for these tasks in recent years and performs extremely well for odometry/ego-motion estimation. Classical lidar-based odometry methods (<xref ref-type="bibr" rid="B44">Zhang and Singh, 2014</xref>; <xref ref-type="bibr" rid="B33">Shan et al., 2020</xref>; <xref ref-type="bibr" rid="B32">Shan and Englot, 2018</xref>) use ICP (iterative closest point) (<xref ref-type="bibr" rid="B6">Besl and McKay, 1992</xref>) and NDT (normal distributed transform)-based registration (<xref ref-type="bibr" rid="B30">Magnusson et al., 2007</xref>; <xref ref-type="bibr" rid="B45">Zhou et al., 2017</xref>).</p>
<p>Despite such advances, optical sensors like cameras and lidar have become unreliable in visually degraded environments and adverse weather. Automotive radars operate on millimeter wavelengths, and their emitted radio signal does not degrade much in the presence of dust, smoke, or adverse weather conditions. The frequency modulation features a sensor to be used in multiple intrinsic settings to adjust the range and field of view. Radar data are different from lidar point clouds and camera data and are collected as complex value tensors (<xref ref-type="bibr" rid="B15">Engels et al., 2017</xref>), as shown in <xref ref-type="fig" rid="F2">Figure 2</xref>. Because of its complex nature, interpreting these data is not trivial. In this study, we converted the raw data to a so-called &#x201c;heatmap&#x201d; before registration. <xref ref-type="fig" rid="F3">Figure 3</xref> visualizes one radar scan in a three-dimensional radar heatmap, which is data intensity <italic>versus</italic> range&#x2013;azimuth&#x2013;elevation. A bird&#x27;s-eye view of this heatmap is shown in <xref ref-type="fig" rid="F1">Figures 1</xref>, <xref ref-type="fig" rid="F4">4</xref>.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>Proposed method for estimating the linear and angular velocities from consecutive radar scans (<italic>S</italic>
<sub>
<italic>i</italic>
</sub>, <italic>S</italic>
<sub>
<italic>i</italic>&#x2b;1</sub>). The feature extractor is a 3DCNN, which learns the features from the scans; further features are passed to a transformer encoder, and then, a set of linear and angular velocities is obtained via regression.</p>
</caption>
<graphic xlink:href="frsip-03-1198205-g001.tif"/>
</fig>
<p>A major challenge of radar data is the noise that hinders hand-crafted feature extraction and semantic understanding of the signals. Several post-processing algorithms for radar point clouds have been proposed, such as CFAR (constant false alarm rate) (<xref ref-type="bibr" rid="B31">Rohling, 1983</xref>), which is primarily used in obstacle detection and avoidance. CFAR relies on the identification of high-intensity regions in the heatmap scans and applies a sliding window-based thresholding approach to select these regions. Methods like CFAR are better suited for detecting moving objects but often overlook small static objects due to their low reflectivity. However, when it comes to ego-motion estimation, it is crucial to not suppress the features of static targets. Therefore, this paper used high-resolution heatmaps as input for an end-to-end learning-based approach for ego-velocity estimation. We did this by learning a transformation (angular and linear velocities) between the consecutive pairs of 3D radar heatmaps. This method used a 3D convolutional neural network (3DCNN) (<xref ref-type="bibr" rid="B38">Tran et al., 2014</xref>) to extract the features associated with the objects in radar scans. A decoder network later performed the ego-velocity regression based on the feature matching across the pairs of scans, as illustrated in <xref ref-type="fig" rid="F1">Figure 1</xref> and explained in <xref ref-type="sec" rid="s3-2">Section 3.2</xref>. The major advantage of our method is that we do not select features by hand; our network learns them by using pose trajectory ground truth. The contributions of this paper can be summarized thus:<list list-type="simple">
<list-item>
<p>&#x2022; Using the full 3D radar heatmap scans for ego-motion estimation for scan-to-scan registration. The majority of current state-of-the-art methods seem to use radar point clouds, hand-crafted features, or additional sensors.</p>
</list-item>
<list-item>
<p>&#x2022; We propose an end-to-end ego-velocity estimation architecture, which includes a 3D convolution over all three channels of the heatmap scan to capture features associated with the motion and an attention mechanism for selection of the significant features for regression. Our method achieves 0.037&#xa0;m/s RMSE (root mean squared error) in linear forward speed and 0.048 deg/s in heading angle rate, tested on the publicly available ColoRadar dataset (<xref ref-type="bibr" rid="B26">Kramer et al., 2022</xref>).</p>
</list-item>
<list-item>
<p>&#x2022; We investigate the effect of ego-velocity regressor architecture through extensive experiments on different environments and speeds. We compare several alternatives for regressor architecture: without attention, transformer encoder, self-attention, and channel attention. In two of the selected results, we show that a) RMSE error increases by only 5% in the test environment compared to the training environment, while b) the error increases by 90% for higher-speed test data.</p>
</list-item>
</list>
</p>
<p>The paper is organized thus: <xref ref-type="sec" rid="s2">Section 2</xref> presents the related work; <xref ref-type="sec" rid="s3">Section 3</xref> includes problem formulation and network architecture details; radar data format, heatmap processing, ground truth calculation, and training are introduced in <xref ref-type="sec" rid="s4">Section 4</xref>; evaluation of models and result comparison are discussed in <xref ref-type="sec" rid="s5">Section 5</xref>; <xref ref-type="sec" rid="s6">Section 6</xref> concludes and suggests future work.</p>
</sec>
<sec id="s2">
<title>2 Related work</title>
<p>Radar has long been a sensor of choice for emergency braking and obstacle detection due to its capability of perception in visually degraded environments, such as bad weather (rain, fog, and snowfall), darkness, dust, and smoke. Initial research with automotive radars was conducted in the late 1990s (<xref ref-type="bibr" rid="B10">Clark and Durrant-Whyte, 1998</xref>); in the last several years, significant work has been conducted in radar-based odometry and SLAM (<xref ref-type="bibr" rid="B12">Daniel et al., 2017</xref>; <xref ref-type="bibr" rid="B17">Ghabcheloo and Siddiqui, 2018</xref>). Ego-motion estimation research predominantly focuses on two types of sensors: spinning radars and Doppler automotive radars (SoC radars).</p>
<sec id="s2-1">
<title>2.1 Spinning radar</title>
<p>Spinning radars have been widely used for SLAM and odometry due to their high-resolution image-like data and 360&#xb0; coverage. However, these radars are bulky (about 6&#xa0;kg) and are not energy-efficient. They provide only 2D scans (azimuth and range) through 360 deg spatial coverage. Several spinning radar datasets (<xref ref-type="bibr" rid="B3">Barnes et al., 2020</xref>; <xref ref-type="bibr" rid="B24">Kim et al., 2020</xref>; <xref ref-type="bibr" rid="B34">Sheeny et al., 2020</xref>; <xref ref-type="bibr" rid="B8">Burnett et al., 2022</xref>) are available for benchmarking the state-of-the-art methods. These methods usually fall into two categories: learning-based and non-learning-based. The majority of non-learning-based methods perform descriptor selection from image-like radar scans, followed by registration across consecutive frames. In an ego-motion estimation study, <xref ref-type="bibr" rid="B9">Cen and Newman (2018</xref>) used scan matching with hand-crafted feature points from radar scans. <xref ref-type="bibr" rid="B1">Adolfsson et al. (2021</xref>) proposed a method of selecting an arbitrary number of the highest intensity returns per azimuth, and, after oriented surface point calculation, registration was performed between the final key frame and the current frame. Some recent learning-based methods have extracted the key points end-to-end by self-supervised learning (e.g. <xref ref-type="bibr" rid="B4">Barnes et al., 2019</xref>). In <xref ref-type="bibr" rid="B7">Burnett et al. (2021</xref>), features were first learned in an unsupervised way, and then the feature extractor was combined with classical probabilistic estimators for ego-motion estimation.</p>
</sec>
<sec id="s2-2">
<title>2.2 SoC radar</title>
<p>SoC (system-on-chip) radars consume less power and are more lightweight than spinning radars. With the evolution of SoC radars over the past five&#xa0;years, new high-resolution sensors have been introduced. Modern radars provide 4D data (range, azimuth, elevation, and Doppler). With these radars, ego-motion estimation falls into two categories: instantaneous (single scan) and registration-based (multiple scans). Instantaneous ego-motion estimation relies on the Doppler velocity of targets in the scan and is solved through non-linear optimization (e.g. <xref ref-type="bibr" rid="B22">Kellner et al., 2013</xref>). The instantaneous approach cannot estimate 6DoF (three-dimensional linear and angular transformations) ego-motion from only one radar sensor. To solve full ego-motion, we need multiple radar sensors (minimum of two), as in <xref ref-type="bibr" rid="B23">Kellner et al. (2014</xref>), or an additional IMU (inertial measurement unit) sensor, as in <xref ref-type="bibr" rid="B17">Ghabcheloo and Siddiqui (2018</xref>). Another approach is to solve the ego-motion by registration across consecutive radar scans of a single radar sensor. For example, <xref ref-type="bibr" rid="B2">Almalioglu et al. (2021</xref>) used NDT registration (<xref ref-type="bibr" rid="B30">Magnusson et al., 2007</xref>) and an IMU-based motion model. All these methods operate on so-called radar point clouds that are pre-processed sparse radar points from the heatmaps. Pre-processing is performed, for example, by CFAR or simple intensity thresholding. Our method, on the other hand, uses full radar heatmaps and performs ego-velocity regression. We also propose a novel network architecture that has a 3DCNN for feature extraction and attention layers for selecting significant features for ego-velocity regression.</p>
<p>Learning-based methods for ego-motion estimation have emerged in recent years with the evolution of deep neural networks. State-of-the-art research has shown better performance than the classical methods for ego-motion estimation, optical-flow estimation, and SLAM front-end. MilliEgo (<xref ref-type="bibr" rid="B28">Lu et al., 2020</xref>) is an end-to-end approach for solving radar-based odometry. Its methodology differs from ours in several aspects: 1) milliEgo takes the radar point cloud as an input, which suppresses some useful information from the heatmap; 2) our network architecture is very different in terms of feature extraction and regression; 3) milliEgo has only been evaluated indoors, while we provide evaluation on an indoor-to-outdoor and low-speed-to-high-speed dataset; 4) milliEgo uses three single chip radar sensors (<xref ref-type="bibr" rid="B27">Li et al., 2022</xref>), while our sensor is a high-resolution TI AWR2243 (four-chip cascade radar; 5) milliEgo uses an additional IMU sensor.</p>
</sec>
</sec>
<sec sec-type="methods" id="s3">
<title>3 Methodology</title>
<p>This section starts with problem formulation in 3.1, where we formalize the ego-motion problem and introduce the loss function for supervised learning, and 3.2 explains the network architecture, including the feature extractor and ego-velocity regressor shown in <xref ref-type="fig" rid="F4">Figure 4</xref>.</p>
<sec id="s3-1">
<title>3.1 Problem formulation</title>
<p>The ego-motion of a frame attached to a moving body is the change in transformation <italic>T</italic> &#x3d; (<italic>R</italic>, <italic>t</italic>), rotation, and translation, respectively, over time with respect to a fixed frame. We used angular and linear 6-DoF twist (<italic>V</italic> &#x3d; [<italic>V</italic>
<sub>
<italic>x</italic>
</sub>, <italic>V</italic>
<sub>
<italic>y</italic>
</sub>, <italic>V</italic>
<sub>
<italic>z</italic>
</sub>], <italic>&#x3c9;</italic> &#x3d; [<italic>&#x3c9;</italic>
<sub>
<italic>x</italic>
</sub>, <italic>&#x3c9;</italic>
<sub>
<italic>y</italic>
</sub>, <italic>&#x3c9;</italic>
<sub>
<italic>z</italic>
</sub>]) to represent ego-motion. We solved this problem by using registration&#x2014;geometrically aligning two radar scans, which are in the form of intensity heatmaps. To solve the registration problem, we trained a model that takes as input two consecutive radar heatmaps (<italic>S</italic>
<sub>
<italic>i</italic>
</sub>, <italic>S</italic>
<sub>
<italic>i</italic>&#x2b;1</sub>) (<xref ref-type="fig" rid="F1">Figure 1</xref>) and outputs the predictions of the linear and angular velocities <inline-formula id="inf1">
<mml:math id="m1">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>V</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3c9;</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> as follows:<disp-formula id="e1">
<mml:math id="m2">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>V</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3c9;</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mi mathvariant="script">F</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>;</mml:mo>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
</mml:math>
<label>(1)</label>
</disp-formula>where <inline-formula id="inf2">
<mml:math id="m3">
<mml:mi mathvariant="script">F</mml:mi>
</mml:math>
</inline-formula> is a neural network with parameters <italic>&#x3b8;</italic>.</p>
<p>Our neural network is composed of an encoder (3DCNN feature extractor) and an ego-velocity regressor network <italic>D</italic> (<xref ref-type="fig" rid="F1">Figure 1</xref>). Details of the network architecture are given in <xref ref-type="sec" rid="s3-2">Section 3.2</xref>. The objective of the training is to find the set of parameters <italic>&#x3b8;</italic> that minimizes the distance between the network output <inline-formula id="inf3">
<mml:math id="m4">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>V</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3c9;</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> and the ground truth velocities (<italic>V</italic>, <italic>&#x3c9;</italic>) using the following loss:<disp-formula id="e2">
<mml:math id="m5">
<mml:mi>L</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>s</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mstyle displaystyle="true">
<mml:munder>
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>S</mml:mi>
</mml:mrow>
</mml:munder>
</mml:mstyle>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">&#x2016;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>V</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>V</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mo stretchy="false">&#x2016;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">&#x2016;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c9;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c9;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mo stretchy="false">&#x2016;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>,</mml:mo>
</mml:math>
<label>(2)</label>
</disp-formula>where <italic>S</italic> is the dataset and <italic>N</italic> &#x3d; &#x7c;<italic>S</italic>&#x7c; is the number of training data samples. Each sample includes a pair of heatmaps and a ground truth velocity vector. Scalar values <italic>w</italic>
<sub>1</sub> and <italic>w</italic>
<sub>2</sub> are weighting factors to balance the linear and angular regression portion of the loss.</p>
</sec>
<sec id="s3-2">
<title>3.2 Network architecture</title>
<p>The network architecture is illustrated in <xref ref-type="fig" rid="F4">Figure 4</xref>. It is composed of two main building blocks: a 3DCNN feature extractor and an ego-velocity regressor block. To estimate the ego-motion for a given pair, we needed to perform feature matching between consecutive radar heatmaps. Feature matching is the process of identifying corresponding features between two radar heatmaps. Corresponding features are those that represent the same region of interest (salient area with higher intensity) being tracked in both heatmaps. CNN is capable of learning features in the form of patches, corners, and edges. A fully connected regressor network can perform descriptor (feature vector) matching by finding the closest match from one heatmap to the other in the pair based on geometry (Euclidean distance in our case) (<xref ref-type="bibr" rid="B40">Wang et al., 2017</xref>) (<xref ref-type="bibr" rid="B11">Costante et al., 2016</xref>). By incorporating the transformer encoder in the ego-velocity regressor, we enhance the feature matching process for more accurate ego-motion estimation. Details of the network architecture are explained in <xref ref-type="sec" rid="s3-2-1">3.2.1</xref> and <xref ref-type="sec" rid="s3-2-2">3.2.2</xref>.</p>
<sec id="s3-2-1">
<title>3.2.1 Feature extraction</title>
<p>Our input data are three-dimensional, with intensity along three axes: range (128), azimuth (128), and elevation (32). We thus use a 3DCNN-based feature extractor to handle these data. Here, our feature extractor takes a pair of radar heatmaps in Cartesian form and generates the feature vector for use by the ego-velocity regressor. This network has nine convolutional layers, where the number of convolutional filters varies from 64 to 1024. Filter sizes for the first two layers are (7 &#xd7; 7 &#xd7; 7) and (5 &#xd7; 5 &#xd7; 5) and for the remaining layers is (3 &#xd7; 3 &#xd7; 3). The varying size of the filter helps the network learn the large- and small-scale features. We denote a feature vector obtained from the pair of scans by <italic>f</italic>. The dimension of <italic>f</italic> for each batch is (1,2,2,1024), and it is passed further to a regressor block for further processing.</p>
</sec>
<sec id="s3-2-2">
<title>3.2.2 Feature refinement and regression</title>
<p>The ego-velocity regressor block comprises a feature refinement block (referred to as the &#x201c;Transformer Block&#x201d; in <xref ref-type="fig" rid="F4">Figure 4</xref>), and a dual-head fully connected decoder (<bold>FC decoder</bold>). Each decoder head has three layers, and the last layer gives the final output as a vector of three elements. In the feature refinement block, we tested the following attention (<xref ref-type="bibr" rid="B39">Vaswani et al., 2017</xref>) strategies:<list list-type="simple">
<list-item>
<p>&#x2022; <bold>3DCNN &#x2b; SA &#x2b; FC</bold>, where &#x201c;SA&#x201d; is self-attention;</p>
</list-item>
<list-item>
<p>&#x2022; <bold>3DCNN &#x2b; CA &#x2b; FC</bold>, where &#x201c;CA&#x201d; is channel attention;</p>
</list-item>
<list-item>
<p>&#x2022; <bold>3DCNN &#x2b; Transformer &#x2b; FC</bold>, where &#x201c;Transformer&#x201d; is transformer encoder;</p>
</list-item>
<list-item>
<p>&#x2022; <bold>3DCNN &#x2b; FC</bold>, a model without attention.</p>
</list-item>
</list>
</p>
<p>The attention mechanism assigns higher weights to significant features in comparison to other features of the feature vector obtained from the 3DCNN feature extractor. In the following paragraphs, we provide details of the tested attention strategies.</p>
<p>
<bold>3DCNN &#x2b; Transformer &#x2b; FC:</bold> Transformers benefit from multi-head attention and have shown better performance than CNN on vision tasks (<xref ref-type="bibr" rid="B14">Dosovitskiy et al., 2021</xref>). Multi-head attention learns the local and global features from the input feature vector concatenated with the attention mask. We use the transformer encoder layers <italic>TransEnc</italic> with positional encoding <italic>PE</italic>; these select significant and stable features with their local and global context from the input feature vector. Since our input is one instance of an input pair, we follow positional encoding similar to <xref ref-type="bibr" rid="B14">Dosovitskiy et al. (2021</xref>), which is applied only to spatial dimensions. The positional encoding takes the feature vector of an input heatmap pair and generates the positional information for features in the input feature vector. The output of positional encoding <italic>PE</italic> is added element-wise to the feature vector, which is further passed through two transformer encoder layers (<xref ref-type="fig" rid="F4">Figure 4</xref>). Each of the transformer encoder layers has eight multi-head attention units, a layer normalization <italic>LayerNorm</italic>, a max pooling <italic>MaxPool</italic> layer (for aggregating and preserving contextual information associated with the features), and an activation function. The output feature vector from the transformer encoder <italic>f</italic>
<sub>
<italic>out</italic>
</sub> is passed through the FC dual-head velocity regressor.<disp-formula id="e3">
<mml:math id="m6">
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">out</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>M</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>l</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>y</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>N</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>m</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>E</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>c</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mi>E</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:math>
<label>(3)</label>
</disp-formula>
</p>
<p>
<bold>3DCNN &#x2b; SA &#x2b; FC:</bold> For the self-attention mechanism, we use the same attention technique as in <xref ref-type="bibr" rid="B28">Lu et al. (2020</xref>). The purpose of self-attention is to focus on stable and geometrically meaningful features rather than noisy and less stable features. Applied to the features obtained from the 3DCNN, this method performs global average pooling <italic>AvgPool</italic> to aggregate the features and outputs an attention mask. The attention mask will further be multiplied with the feature input through an element-wise multiplication operator &#x2297;. Denoting the number of channels in the feature vector (corresponding to the elevation dimension in the heatmap) by <italic>c</italic>, a dense layer with rectified linear unit activation by <italic>MLP</italic> and the self-attention by <italic>MLP</italic>, <italic>f</italic>
<sub>
<italic>out</italic>
</sub> is computed using the following equations:<disp-formula id="e4">
<mml:math id="m7">
<mml:msup>
<mml:mrow>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>M</mml:mi>
<mml:mi>L</mml:mi>
<mml:mi>P</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi>v</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>l</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
</mml:math>
<label>(4)</label>
</disp-formula>
<disp-formula id="e5">
<mml:math id="m8">
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">out</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>S</mml:mi>
<mml:mo>&#x2297;</mml:mo>
<mml:mi>f</mml:mi>
<mml:mo>.</mml:mo>
</mml:math>
<label>(5)</label>
</disp-formula>
</p>
<p>
<bold>3DCNN &#x2b; CA &#x2b; FC:</bold> Channel attention was proposed in <xref ref-type="bibr" rid="B41">Woo et al. (2018</xref>). For a given feature vector, channel attention generates the attention mask across all channels and learns rich contextual information with the help of max pooling <italic>MaxPool</italic>. It still uses <italic>AvgPool</italic> for aggregating spatial information in addition to learning their context. It is a lightweight attention module that is used for 3DCNN feature refinement. <italic>C</italic> is the channel attention mask, and <italic>f</italic>
<sub>
<italic>out</italic>
</sub> is computed as follows:<disp-formula id="e6">
<mml:math id="m9">
<mml:msup>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>&#x3c3;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mi>L</mml:mi>
<mml:mi>P</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi>v</mml:mi>
<mml:mi>g</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>l</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>M</mml:mi>
<mml:mi>L</mml:mi>
<mml:mi>P</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>P</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>l</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:math>
<label>(6)</label>
</disp-formula>
<italic>&#x3c3;</italic> is the sigmoid function for keeping the values between 0 and 1 in the attention mask.<disp-formula id="e7">
<mml:math id="m10">
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">out</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>C</mml:mi>
<mml:mo>&#x2297;</mml:mo>
<mml:mi>f</mml:mi>
<mml:mo>,</mml:mo>
</mml:math>
<label>(7)</label>
</disp-formula>where &#x2297; is the element-wise multiplication between the attention mask and the feature vector <italic>f</italic>.</p>
<p>
<bold>FC decoder:</bold> The final feature vector <italic>f</italic>
<sub>
<italic>out</italic>
</sub> is passed to the two regressor blocks <italic>D</italic>
<sub>1</sub> and <italic>D</italic>
<sub>2</sub> for linear and angular velocity regression:<disp-formula id="e8">
<mml:math id="m11">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>V</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">out</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
</mml:math>
<label>(8)</label>
</disp-formula>
<disp-formula id="e9">
<mml:math id="m12">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3c9;</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">out</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:math>
<label>(9)</label>
</disp-formula>
</p>
</sec>
</sec>
</sec>
<sec id="s4">
<title>4 Experiments</title>
<p>In this section, we explain the data in 4.1, raw data format and sensor details in 4.2, the process of heatmap generation in 4.3, ground truth calculation in 4.4, and model training in 4.5.</p>
<sec id="s4-1">
<title>4.1 Data</title>
<p>We evaluated our models on the ColoRadar (<xref ref-type="bibr" rid="B26">Kramer et al., 2022</xref>) dataset that was recorded in seven different indoor and outdoor environments with a hand-carried sensor rig for the tasks of localization, ego-motion estimation, and SLAM. The ColoRadar dataset has a total of 57 sequences. The sensor rig included a high-resolution radar, a low-resolution radar, a 3D Lidar, an IMU, and a Vicon motion capture system for indoors. Radar was mounted front-facing where (x &#x3d; azimuth, y &#x3d; range, z &#x3d; elevation). The authors provided ground truth poses (position and orientation) in the body (sensor rig) frame, generated by a 3D lidar-IMU-based SLAM package (<xref ref-type="bibr" rid="B19">Hess et al., 2016</xref>). Ground truths were provided as pose trajectory with timestamp for each sequence. The data are organized in Kitti format (<xref ref-type="bibr" rid="B16">Geiger et al., 2013</xref>), where the sensor readings and the ground truth poses are stored with their timestamps for each data sequence. We used the high-resolution radar data with the ground truth poses from the dataset. We used a subset of the data specified in <xref ref-type="table" rid="T1">Table 1</xref> for this experiment.</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>ColoRadar data sequences used in our experiments.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Environment name</th>
<th align="center">Speed</th>
<th align="center">Type</th>
<th align="center">Ground truth</th>
<th align="center">Platform</th>
<th align="center">Sequence length (seconds)</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Longboard</td>
<td align="center">Fast</td>
<td align="center">Outdoor</td>
<td align="center">Lidar-inertial</td>
<td align="center">Electric skateboard</td>
<td align="center">170 to 350</td>
</tr>
<tr>
<td align="left">Adger Army</td>
<td align="center">Slow</td>
<td align="center">Mine</td>
<td align="center">Lidar-inertial</td>
<td align="center">Walking</td>
<td align="center">150 to 480</td>
</tr>
<tr>
<td align="left">ARPG Lab, ECR, and Hallway</td>
<td align="center">Slow</td>
<td align="center">Structured room</td>
<td align="center">Lidar-inertial</td>
<td align="center">Walking</td>
<td align="center">100 to 250</td>
</tr>
<tr>
<td align="left">Outdoor</td>
<td align="center">Slow</td>
<td align="center">Outdoor</td>
<td align="center">Lidar-inertial</td>
<td align="center">Walking</td>
<td align="center">100 to 200</td>
</tr>
<tr>
<td align="left">Aspen</td>
<td align="center">Slow</td>
<td align="center">Outdoor night</td>
<td align="center">Lidar-inertial</td>
<td align="center">Walking</td>
<td align="center">100 to 200</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4-2">
<title>4.2 Radar sensor and raw data format</title>
<p>Radar data were collected with a high-resolution sensor (TI-MMWAVE Cascade AWR2243 (<xref ref-type="bibr" rid="B37">Swami et al., 2017</xref>))&#x2014;FMCW (frequency-modulated continuous wave) radar. It has three vertically placed elevation transmitter antennas with nine horizontal azimuth transmitter antennas to cover a three-dimensional field of view. It has 16 receiver antennas to receive the reflected signal from objects and landmarks in the environment. This sensor has a field-of-view of 140&#xb0; in azimuth and 45&#xb0; in elevation. It has an angular resolution of 1&#xb0; in azimuth and 15&#xb0; in elevation. As <xref ref-type="fig" rid="F2">Figure 2</xref> illustrates, the transmitters transmit the set of <italic>N</italic> electromagnetic signals known as chirps in each frame. Within each frame, the transmitted signal increases linearly with time from starting <italic>f</italic>
<sub>
<italic>c</italic>
</sub> to maximum frequency <italic>f</italic>
<sub>
<italic>c</italic>
</sub> &#x2b; <italic>B</italic>, where <italic>B</italic> is the bandwidth. Each chirp is sampled in time by the time difference <italic>T</italic>
<sub>
<italic>s</italic>
</sub>, or &#x201c;fast time dimension&#x201d;. Receivers receive the reflected signal by a time delay <italic>t</italic>, which is used to calculate the range of reflector. In each radar frame, the data are stored as a two-dimensional matrix of samples and chirps (number of chirps in the frame known as slow time dimension) for each receiver antenna. We get a three-dimensional (samples, chirps, and receivers) complex value tensor as raw data&#x2014;also known as &#x201c;ADC (analog-to-digital converted) data&#x201d;. In FMCW radars, spatial resolution of sensors is limited by the number of receiver antennas. To achieve better spatial resolution in azimuth and elevation without adding more physical antennas, a MIMO (multiple input&#x2014;multiple output) technique is used in modern radars. MIMO created a virtual receiver array of size (number of transmitters &#xd7; number of receivers) (<xref ref-type="bibr" rid="B15">Engels et al., 2017</xref>) for high-resolution angle estimation, with the output data dimensions becoming (samples, chirps, number of transmitters &#xd7; number of receivers).</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>Raw radar data formation for a given frame; details explained in <xref ref-type="sec" rid="s4-2">4.2</xref>. In each frame are transmitted chirps (blue dotted line, dots as samples) and received chirps (red lines). Received samples organized for each receiver by running FFT along fast time give us range and along slow time give us Doppler. We get angles by running FFT on receiver dimension.</p>
</caption>
<graphic xlink:href="frsip-03-1198205-g002.tif"/>
</fig>
</sec>
<sec id="s4-3">
<title>4.3 3D heatmap processing</title>
<p>The first step was to perform calibration in phase and frequency to address the mismatch caused by four radar transceivers on the cascade board. Calibration parameters vary from sensor to sensor, and the dataset has those parameters provided. We performed the calibration with the existing ColoRadar development toolkit.</p>
<p>After the calibration, we performed the post-processing using fast Fourier transforms in range, Doppler, and angle dimensions with a velocity compensation algorithm to avoid Doppler ambiguity caused by the movement of the radar in MIMO (<xref ref-type="bibr" rid="B5">Bechter et al., 2017</xref>). In post-processing, the MIMO ADC data are passed to a two-dimensional fast Fourier transform to obtain the range-Doppler heatmap, followed by a phased array angle processing module to obtain the azimuth and elevation. The processed data were organized into discrete 3D bins with two values for each bin (intensity and Doppler velocity). We do not use the Doppler velocity in the input data. The scan dimensions are (elevation &#x3d; 32, azimuth &#x3d; 128, range &#x3d; 128), which are the parameter settings used to collect the dataset (<xref ref-type="bibr" rid="B26">Kramer et al., 2022</xref>). A heatmap scan is shown in <xref ref-type="fig" rid="F1">Figure 1</xref>, representing the bird&#x27;s-eye view (top view) of the 3D heatmap shown in <xref ref-type="fig" rid="F3">Figure 3</xref>. <xref ref-type="fig" rid="F3">Figure 3</xref> shows the heatmap data for all elevation layers with range and azimuth.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>3D heatmap data: here, higher reflection intensities show the regions with possibility of landmarks (Cartesian plot (range, azimuth, and elevation are in meters) for simple visualization).</p>
</caption>
<graphic xlink:href="frsip-03-1198205-g003.tif"/>
</fig>
</sec>
<sec id="s4-4">
<title>4.4 Ground truth calculation</title>
<p>The dataset provided ground-truth poses in the sensor rig frame at a frame rate of 10 FPS. To perform radar-based ego-motion estimation, the ground truth needed to be in the radar sensor frame. Since radar has a lower data frequency (5 FPS), we located the ground-truth instances for the radar timestamps and converted them from the body frame to the sensor frame using the static transform provided in the dataset.</p>
<p>We then use the following equation (see (<xref ref-type="bibr" rid="B29">Lynch and Park, 2017</xref>) for more details) to calculate ground-truth twist from consecutive pose transformations:<disp-formula id="e10">
<mml:math id="m13">
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:mtable class="matrix">
<mml:mtr>
<mml:mtd columnalign="center">
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c9;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mtd>
<mml:mtd columnalign="center">
<mml:msub>
<mml:mrow>
<mml:mi>V</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="center">
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd columnalign="center">
<mml:mn>1</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>T</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>&#x307;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
</mml:math>
<label>(10)</label>
</disp-formula>where T(i) is the transformation at time index i and [<italic>&#x3c9;</italic>
<sub>
<italic>i</italic>
</sub>] is the skew symmetric matrix containing the angular velocity. <inline-formula id="inf4">
<mml:math id="m14">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>&#x307;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> is the time derivative of <italic>T</italic>(<italic>i</italic>) and is calculated approximately using <inline-formula id="inf5">
<mml:math id="m15">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>&#x307;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>T</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mi>d</mml:mi>
<mml:mi>t</mml:mi>
</mml:math>
</inline-formula>.</p>
</sec>
<sec id="s4-5">
<title>4.5 Training</title>
<p>We used a pair of radar heatmaps as training samples and corresponding linear and angular velocities as ground truth. While processing the data for training, we kept the samples in temporal order using the timestamps for each sequence. All the input ground-truth labels have been normalized between 0 and 1 for stable network training. We trained four networks (as described in <xref ref-type="sec" rid="s3">Section 3</xref>): 3DCNN &#x2b; FC, 3DCNN &#x2b; SA &#x2b; FC, 3DCNN &#x2b; CA &#x2b; FC, and 3DCNN &#x2b; Transformer &#x2b; FC. These networks have been trained on the same dataset with similar hyperparameters. They were trained for 50 epochs on the Nvidia RTX 3080 platform with Adam optimizer (<xref ref-type="bibr" rid="B25">Kingma and Ba, 2014</xref>) and learning rate 10<sup>&#x2013;4</sup>. We chose 8000 heatmap pairs from four environments. From these samples, we chose 80% data for training and rest for validation and testing (10% each). In the feature extractor, we used dropout (<xref ref-type="bibr" rid="B35">Srivastava et al., 2014</xref>) (0.2 until eighth layer, 0.5 in last layer) in all layers to avoid overfitting by randomly dropping the weights. For stable and faster training, we used batch normalization in the 3DCNN (<xref ref-type="bibr" rid="B21">Ioffe and Szegedy, 2015</xref>) in each layer. We also used LeakyReLu (<xref ref-type="bibr" rid="B42">Xu et al., 2015</xref>) with 0.01 hyperparameter as an activation function in layers. This activation function is good for avoiding the vanishing gradient problem in network architecture with a higher number of layers. The number of parameters and layers for our ego-velocity regressor and models with attention mechanism are explained in <xref ref-type="sec" rid="s3-2-2">3.2.2</xref> and shown in <xref ref-type="fig" rid="F4">Figure 4</xref>. We chose a small subset of data and did not use the whole dataset.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Proposed method for estimating linear and angular velocities from consecutive radar scans (<italic>S</italic>
<sub>
<italic>i</italic>
</sub>, <italic>S</italic>
<sub>
<italic>i</italic>&#x2b;1</sub>); each radar scan has intensity values distributed in 3D space (el &#x3d; elevation, az &#x3d; azimuth, and r &#x3d; range). Feature extractor is 3DCNN, which learns the features from the radar data; these features are passed to a transformer encoder and then to a set of linear and angular velocities obtained through an ego-velocity regressor.</p>
</caption>
<graphic xlink:href="frsip-03-1198205-g004.tif"/>
</fig>
</sec>
</sec>
<sec sec-type="results" id="s5">
<title>5 Results</title>
<p>We used the following training datasets in our experiments:<list list-type="simple">
<list-item>
<p>&#x2022; <bold>Mixed data:</bold> indoor structured (two sequences), indoor unstructured (one sequence), and outdoor (one sequence), low speed</p>
</list-item>
<list-item>
<p>&#x2022; <bold>Indoor data:</bold> indoor structured (two sequences) and indoor unstructured (one sequence)</p>
</list-item>
</list>and RMSE (root mean squared error) metric, defined by<disp-formula id="e11">
<mml:math id="m16">
<mml:mi>R</mml:mi>
<mml:mi>M</mml:mi>
<mml:mi>S</mml:mi>
<mml:mi>E</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:msqrt>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="normal">&#x3a3;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msup>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">pred</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>X</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>g</mml:mi>
<mml:mi>t</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:msqrt>
<mml:mo>,</mml:mo>
</mml:math>
<label>(11)</label>
</disp-formula>where <italic>N</italic> is the number of the data samples. We calculate the RMSE for each element of the twist separately. In the tables, the linear velocity (<italic>V</italic>
<sub>
<italic>x</italic>
</sub>, <italic>V</italic>
<sub>
<italic>y</italic>
</sub>, <italic>V</italic>
<sub>
<italic>z</italic>
</sub>) errors are in meter per second, and the angular velocity (<italic>&#x3c9;</italic>
<sub>
<italic>x</italic>
</sub>, <italic>&#x3c9;</italic>
<sub>
<italic>y</italic>
</sub>, <italic>&#x3c9;</italic>
<sub>
<italic>z</italic>
</sub>) errors are in degree per second.</p>
<p>We evaluated our models with the following experiments:<list list-type="simple">
<list-item>
<p>&#x2022; <bold>Trained and tested on mixed data:</bold> results reported in <xref ref-type="table" rid="T2">Table 2</xref> and explained in <xref ref-type="sec" rid="s5-1">Section 5.1</xref>
</p>
</list-item>
<list-item>
<p>&#x2022; <bold>Trained and tested on mixed&#x2013;tested stationary and moving separately:</bold> results in <xref ref-type="table" rid="T5">Tables 5</xref> and <xref ref-type="table" rid="T4">4</xref>, and explained in <xref ref-type="sec" rid="s5-2">5.2</xref>
</p>
</list-item>
<list-item>
<p>&#x2022; <bold>Trained on mixed low speed&#x2013;tested on mixed high speed:</bold> results in <xref ref-type="table" rid="T6">Table 6</xref> and explained in <xref ref-type="sec" rid="s5-3-2">5.3.2</xref>
</p>
</list-item>
<list-item>
<p>&#x2022; <bold>Trained on indoor data&#x2013;tested on outdoor data:</bold> results in <xref ref-type="table" rid="T3">Table 3</xref> and explained in <xref ref-type="sec" rid="s5-3-1">5.3.1</xref>
</p>
</list-item>
</list>
</p>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>Average RMSE errors for each test sequence (trained and tested on the mixed dataset). Smallest errors per sequence are marked in bold.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">Test sequence</th>
<th align="center">Method</th>
<th align="center">
<bold>
<italic>V</italic>
</bold>
<sub>
<bold>
<italic>x</italic>
</bold>
</sub>
<bold>, m/s</bold>
</th>
<th align="center">
<bold>
<italic>V</italic>
</bold>
<sub>
<bold>
<italic>y</italic>
</bold>
</sub>
<bold>, m/s</bold>
</th>
<th align="center">
<bold>
<italic>V</italic>
</bold>
<sub>
<bold>
<italic>z</italic>
</bold>
</sub>
<bold>, m/s</bold>
</th>
<th align="center">
<bold>
<italic>&#x3c9;</italic>
</bold>
<sub>
<bold>
<italic>x</italic>
</bold>
</sub>
<bold>, deg/s</bold>
</th>
<th align="center">
<bold>
<italic>&#x3c9;</italic>
</bold>
<sub>
<bold>
<italic>y</italic>
</bold>
</sub>
<bold>, deg/s</bold>
</th>
<th align="center">
<bold>
<italic>&#x3c9;</italic>
</bold>
<sub>
<bold>
<italic>z</italic>
</bold>
</sub>
<bold>, deg/s</bold>
</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">
<bold>Outdoor</bold>
</td>
<td align="center">
<bold>3DCNN &#x2b; Transformer &#x2b; FC</bold>
</td>
<td align="center">
<bold>0.048</bold>
</td>
<td align="center">
<bold>0.037</bold>
</td>
<td align="center">
<bold>0.120</bold>
</td>
<td align="center">
<bold>0.084</bold>
</td>
<td align="center">0.140</td>
<td align="center">
<bold>0.048</bold>
</td>
</tr>
<tr>
<td align="left"/>
<td align="center">
<bold>3DCNN &#x2b; CA &#x2b; FC</bold>
</td>
<td align="center">0.061</td>
<td align="center">0.052</td>
<td align="center">0.167</td>
<td align="center">0.108</td>
<td align="center">0.135</td>
<td align="center">0.082</td>
</tr>
<tr>
<td align="left"/>
<td align="center">
<bold>3DCNN &#x2b; SA &#x2b; FC</bold>
</td>
<td align="center">0.049</td>
<td align="center">0.040</td>
<td align="center">0.144</td>
<td align="center">0.097</td>
<td align="center">
<bold>0.096</bold>
</td>
<td align="center">0.063</td>
</tr>
<tr>
<td align="left"/>
<td align="center">
<bold>3DCNN &#x2b; FC</bold>
</td>
<td align="center">0.055</td>
<td align="center">0.044</td>
<td align="center">0.127</td>
<td align="center">0.097</td>
<td align="center">0.124</td>
<td align="center">0.075</td>
</tr>
<tr>
<td align="center">
<bold>ECR</bold>
</td>
<td align="center">
<bold>3DCNN &#x2b; Transformer &#x2b; FC</bold>
</td>
<td align="center">
<bold>0.048</bold>
</td>
<td align="center">
<bold>0.038</bold>
</td>
<td align="center">0.084</td>
<td align="center">0.161</td>
<td align="center">0.156</td>
<td align="center">
<bold>0.047</bold>
</td>
</tr>
<tr>
<td align="left"/>
<td align="center">
<bold>3DCNN &#x2b; CA &#x2b; FC</bold>
</td>
<td align="center">0.078</td>
<td align="center">0.072</td>
<td align="center">0.149</td>
<td align="center">0.128</td>
<td align="center">0.254</td>
<td align="center">0.118</td>
</tr>
<tr>
<td align="left"/>
<td align="center">
<bold>3DCNN &#x2b; SA &#x2b; FC</bold>
</td>
<td align="center">0.069</td>
<td align="center">0.147</td>
<td align="center">0.221</td>
<td align="center">0.277</td>
<td align="center">0.505</td>
<td align="center">0.125</td>
</tr>
<tr>
<td align="left"/>
<td align="center">
<bold>3DCNN &#x2b; FC</bold>
</td>
<td align="center">0.052</td>
<td align="center">0.040</td>
<td align="center">
<bold>0.082</bold>
</td>
<td align="center">
<bold>0.071</bold>
</td>
<td align="center">
<bold>0.074</bold>
</td>
<td align="center">0.055</td>
</tr>
<tr>
<td align="center">
<bold>Hallway</bold>
</td>
<td align="center">
<bold>3DCNN &#x2b; Transformer &#x2b; FC</bold>
</td>
<td align="center">
<bold>0.064</bold>
</td>
<td align="center">
<bold>0.072</bold>
</td>
<td align="center">
<bold>0.150</bold>
</td>
<td align="center">
<bold>0.122</bold>
</td>
<td align="center">0.225</td>
<td align="center">
<bold>0.098</bold>
</td>
</tr>
<tr>
<td align="left"/>
<td align="center">
<bold>3DCNN &#x2b; CA &#x2b; FC</bold>
</td>
<td align="center">0.080</td>
<td align="center">0.097</td>
<td align="center">0.153</td>
<td align="center">0.185</td>
<td align="center">
<bold>0.162</bold>
</td>
<td align="center">0.141</td>
</tr>
<tr>
<td align="left"/>
<td align="center">
<bold>3DCNN &#x2b; SA &#x2b; FC</bold>
</td>
<td align="center">0.103</td>
<td align="center">0.108</td>
<td align="center">0.308</td>
<td align="center">0.369</td>
<td align="center">0.551</td>
<td align="center">0.197</td>
</tr>
<tr>
<td align="left"/>
<td align="center">
<bold>3DCNN &#x2b; FC</bold>
</td>
<td align="center">0.088</td>
<td align="center">0.100</td>
<td align="center">0.164</td>
<td align="center">0.186</td>
<td align="center">0.163</td>
<td align="center">0.129</td>
</tr>
<tr>
<td align="center">
<bold>ARPG</bold>
</td>
<td align="center">
<bold>3DCNN &#x2b; Transformer &#x2b; FC</bold>
</td>
<td align="center">
<bold>0.054</bold>
</td>
<td align="center">
<bold>0.037</bold>
</td>
<td align="center">
<bold>0.128</bold>
</td>
<td align="center">
<bold>0.113</bold>
</td>
<td align="center">
<bold>0.119</bold>
</td>
<td align="center">
<bold>0.069</bold>
</td>
</tr>
<tr>
<td align="left"/>
<td align="center">
<bold>3DCNN &#x2b; CA &#x2b; FC</bold>
</td>
<td align="center">0.078</td>
<td align="center">0.072</td>
<td align="center">0.149</td>
<td align="center">0.128</td>
<td align="center">0.254</td>
<td align="center">0.118</td>
</tr>
<tr>
<td align="left"/>
<td align="center">
<bold>3DCNN &#x2b; SA &#x2b; FC</bold>
</td>
<td align="center">0.072</td>
<td align="center">0.075</td>
<td align="center">0.188</td>
<td align="center">0.198</td>
<td align="center">0.395</td>
<td align="center">0.121</td>
</tr>
<tr>
<td align="left"/>
<td align="center">
<bold>3DCNN &#x2b; FC</bold>
</td>
<td align="center">0.082</td>
<td align="center">0.087</td>
<td align="center">0.153</td>
<td align="center">0.123</td>
<td align="center">0.230</td>
<td align="center">0.112</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T3" position="float">
<label>TABLE 3</label>
<caption>
<p>Generalization test (performance of models trained on indoor data and tested on distribution data, outdoor night sequence (Aspen)). Smallest errors per velocity component are marked in bold.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center"/>
<th align="center">3DCNN &#x2b; transformer &#x2b; FC</th>
<th align="center">3DCNN &#x2b; CA &#x2b; FC</th>
<th align="center">3DCNN &#x2b; SA &#x2b; FC</th>
<th align="center">3DCNN &#x2b; FC</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">
<bold>
<italic>V</italic>
</bold>
<sub>
<bold>
<italic>x</italic>
</bold>
</sub>
<bold>, m/s</bold>
</td>
<td align="center">
<bold>0.050</bold>
</td>
<td align="center">0.053</td>
<td align="center">0.052</td>
<td align="center">0.052</td>
</tr>
<tr>
<td align="center">
<bold>
<italic>V</italic>
</bold>
<sub>
<bold>
<italic>y</italic>
</bold>
</sub>
<bold>, m/s</bold>
</td>
<td align="center">
<bold>0.037</bold>
</td>
<td align="center">0.052</td>
<td align="center">0.040</td>
<td align="center">0.041</td>
</tr>
<tr>
<td align="center">
<bold>
<italic>V</italic>
</bold>
<sub>
<bold>
<italic>Z</italic>
</bold>
</sub>
<bold>, m/s</bold>
</td>
<td align="center">
<bold>0.11</bold>
</td>
<td align="center">0.17</td>
<td align="center">0.137</td>
<td align="center">0.140</td>
</tr>
<tr>
<td align="center">
<bold>
<italic>&#x3c9;</italic>
</bold>
<sub>
<bold>
<italic>x</italic>
</bold>
</sub>
<bold>, deg/s</bold>
</td>
<td align="center">0.134</td>
<td align="center">
<bold>0.130</bold>
</td>
<td align="center">0.190</td>
<td align="center">0.191</td>
</tr>
<tr>
<td align="center">
<bold>
<italic>&#x3c9;</italic>
</bold>
<sub>
<bold>
<italic>y</italic>
</bold>
</sub>
<bold>, deg/s</bold>
</td>
<td align="center">0.136</td>
<td align="center">
<bold>0.130</bold>
</td>
<td align="center">0.341</td>
<td align="center">0.350</td>
</tr>
<tr>
<td align="center">
<bold>
<italic>&#x3c9;</italic>
</bold>
<sub>
<bold>
<italic>z</italic>
</bold>
</sub>
<bold>, deg/s</bold>
</td>
<td align="center">
<bold>0.064</bold>
</td>
<td align="center">0.080</td>
<td align="center">0.10</td>
<td align="center">0.12</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The results of these experiments are presented and discussed in the following paragraphs.</p>
<sec id="s5-1">
<title>5.1 Trained and tested on mixed data</title>
<p>We took four sequences for evaluating the mixed (indoor &#x2b; outdoor) training. Those sequences were selected from ARPG, ECR, Hallway, and outdoor environments. We chose three sequences for training and one for testing from each environment. We used the models trained on the mixed datasets; we ran each model on all four test sequences and compared the predicted values for all six ego-velocity components with the ground truth calculated (see <xref ref-type="sec" rid="s4-4">Section 4.4</xref>). On <xref ref-type="table" rid="T2">Table 2</xref>, all four test sequences were collected by a person walking with the sensor rig. Sequence length for ARPG, Hallway, and outdoor environments is 100&#x2013;120&#xa0;s. ECR sequence length is 276&#xa0;s. On the outdoor test sequence, significant differences were observed in the performance of models. From the results in <xref ref-type="table" rid="T2">Table 2</xref>, the model that uses transformer layers in the velocity regressor (3DCNN &#x2b; Transformer &#x2b; FC) provides a lower RMSE error than other models. In the other two indoor sequences&#x2014;Hallway and ARPG&#x2014;we also see similar performance. On the ECR sequence, where the indoor environment was different in structure and contained less training data, the transformer model (3DCNN &#x2b; Transformer &#x2b; FC) performed worse than other models for angular velocity components. We used boxplot for visualization, where the values are plotted as a distribution of errors in each element of ego-velocity for all models from <xref ref-type="table" rid="T2">Table 2</xref>; <xref ref-type="fig" rid="F5">Figure 5</xref> represents linear ego-velocities, and <xref ref-type="fig" rid="F6">Figure 6</xref> represents angular ego-velocities. These plots show the errors as boxes where mean values are shown as green triangles, the green line as median, and lower and upper boundaries show the standard deviation for each ego-velocity element. We observed that model C (3DCNN &#x2b; SA &#x2b; FC) clearly performs worse than other models for some velocity components. The main differences are for the velocity components that do not vary significantly across the selected dataset&#x2014;for example, linear velocity in x and z dimensions (<italic>Vx</italic>, <italic>Vz</italic>) and angular velocity in x and y dimensions (<italic>&#x3c9;</italic>
<sub>
<italic>x</italic>
</sub>, <italic>&#x3c9;</italic>
<sub>
<italic>y</italic>
</sub>). Model C uses the simplest attention mechanism, just one layer of self-attention, and can thus cause more errors, especially when the actual values of ego-velocity components are small&#x2014;for example, <italic>&#x3c9;</italic>
<sub>
<italic>y</italic>
</sub>, <italic>&#x3c9;</italic>
<sub>
<italic>z</italic>
</sub> &#x3c; 0.5&#x2009;deg/sec.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>RMSE error distribution for each liner velocity component (trained and tested on the mixed dataset). Models include <bold>(A)</bold> 3DCNN &#x2b; Transformer &#x2b; FC, <bold>(B)</bold> 3DCNN &#x2b; CA &#x2b; FC, <bold>(C)</bold> 3DCNN &#x2b; SA &#x2b; FC, and <bold>(D)</bold> 3DCNN &#x2b; FC). Each box accounts for errors from all four test sequences. The green line is the median, and green triangle is the mean value.</p>
</caption>
<graphic xlink:href="frsip-03-1198205-g005.tif"/>
</fig>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption>
<p>RMSE error distribution for each angular velocity component (trained and tested on the mixed dataset). Models include <bold>(A)</bold> 3DCNN &#x2b; Transformer &#x2b; FC, <bold>(B)</bold> 3DCNN &#x2b; CA &#x2b; FC, <bold>(C)</bold> 3DCNN &#x2b; SA &#x2b; FC, and <bold>(D)</bold> 3DCNN &#x2b; FC). Each box accounts for errors from all four test sequences. The green line is the median, and the green triangle is the mean value.</p>
</caption>
<graphic xlink:href="frsip-03-1198205-g006.tif"/>
</fig>
</sec>
<sec id="s5-2">
<title>5.2 Evaluation of static and dynamic parts of the sequence</title>
<p>In <xref ref-type="sec" rid="s5-1">Section 5.1</xref>, we evaluated the whole length of the sequences. However, in each data sequence, the sensor platform was stationary for a short period at the beginning and end of the sequence. To test the accuracy of the models in the static and dynamic parts of sequence, we evaluated the models separately in two cases: 1) evaluating only the static part of the sequences where the sensor platform was not moving, and 2) evaluating the part of the sequences where the sensor platform was moving. <xref ref-type="table" rid="T4">Tables 4</xref> and <xref ref-type="table" rid="T5">5</xref> present the results for the static and moving parts of the sequences, respectively. Results are reported as the mean and standard deviation of the RMSE for all four sequences. Results show that the static part prediction contained smaller errors than the moving part, where predictions have higher errors for all the models. In the case of the moving platform, the self-attention model (3DCNN &#x2b; SA &#x2b; FC) performed better than other models. However, in the static parts of the sequences, it predicts values with larger errors than other models. This experiment helps us understand the source of errors for the self-attention model in <xref ref-type="sec" rid="s5-1">Section 5.1</xref>&#x2014;the small values of velocity components having small variations in the training set.</p>
<table-wrap id="T4" position="float">
<label>TABLE 4</label>
<caption>
<p>Evaluation on only the static part of the sequence (performance of the models trained on indoor data and tested on mixed test sequences (indoor &#x2b; outdoor), and mean and standard deviation of RMSE errors are presented). Smallest errors per velocity component are marked in bold.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">Values</th>
<th align="center">3DCNN &#x2b; transformer &#x2b; FC</th>
<th align="center">3DCNN &#x2b; CA &#x2b; FC</th>
<th align="center">3DCNN &#x2b; SA &#x2b; FC</th>
<th align="center">3DCNN &#x2b; FC</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">
<bold>
<italic>V</italic>
</bold>
<sub>
<bold>
<italic>x</italic>
</bold>
</sub>
<bold>, m/s</bold>
</td>
<td align="center">
<bold>0.066, 0.0006</bold>
</td>
<td align="center">0.080, 0.0048</td>
<td align="center">0.108, 0.0083</td>
<td align="center">0.520, 0.0240</td>
</tr>
<tr>
<td align="center">
<bold>
<italic>V</italic>
</bold>
<sub>
<bold>
<italic>y</italic>
</bold>
</sub>
<bold>, m/s</bold>
</td>
<td align="center">0.120, 0.0073</td>
<td align="center">0.092, 0.0093</td>
<td align="center">
<bold>0.066, 0.0032</bold>
</td>
<td align="center">0.352, 0.0012</td>
</tr>
<tr>
<td align="center">
<bold>
<italic>V</italic>
</bold>
<sub>
<bold>
<italic>Z</italic>
</bold>
</sub>
<bold>, m/s</bold>
</td>
<td align="center">0.190, 0.0199</td>
<td align="center">
<bold>0.137, 0.0167</bold>
</td>
<td align="center">0.310, 0.1061</td>
<td align="center">0.513, 0.0728</td>
</tr>
<tr>
<td align="center">
<bold>
<italic>&#x3c9;</italic>
</bold>
<sub>
<bold>
<italic>x</italic>
</bold>
</sub>
<bold>, deg/s</bold>
</td>
<td align="center">0.137, 0.0054</td>
<td align="center">
<bold>0.126, 0.0082</bold>
</td>
<td align="center">0.400, 0.2258</td>
<td align="center">0.351, 0.0071</td>
</tr>
<tr>
<td align="center">
<bold>
<italic>&#x3c9;</italic>
</bold>
<sub>
<bold>
<italic>y</italic>
</bold>
</sub>
<bold>, deg/s</bold>
</td>
<td align="center">0.230, 0.0355</td>
<td align="center">0.139, 0.0154</td>
<td align="center">0.597, 0.3760</td>
<td align="center">
<bold>0.122, 0.0064</bold>
</td>
</tr>
<tr>
<td align="center">
<bold>
<italic>&#x3c9;</italic>
</bold>
<sub>
<bold>
<italic>z</italic>
</bold>
</sub>
<bold>, deg/s</bold>
</td>
<td align="center">0.126, 0.0175</td>
<td align="center">
<bold>0.124, 0.0143</bold>
</td>
<td align="center">0.177, 0.0225</td>
<td align="center">0.5416, 0.0018</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T5" position="float">
<label>TABLE 5</label>
<caption>
<p>Evaluation on only the moving part of the sequence (performance of models trained on indoor data and tested on mixed test sequences (indoor &#x2b; outdoor), and mean and standard deviation of RMSE errors are presented. Smallest errors per velocity component are marked in bold.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">Values</th>
<th align="center">3DCNN &#x2b; transformer &#x2b; FC</th>
<th align="center">3DCNN &#x2b; CA &#x2b; FC</th>
<th align="center">3DCNN &#x2b; SA &#x2b; FC</th>
<th align="center">3DCNN &#x2b; FC</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">
<bold>
<italic>V</italic>
</bold>
<sub>
<bold>
<italic>x</italic>
</bold>
</sub>
<bold>, m/s</bold>
</td>
<td align="center">0.156, 0.0363</td>
<td align="center">0.14, 0.0217</td>
<td align="center">
<bold>0.076, 0.0024</bold>
</td>
<td align="center">0.436, 0.1332</td>
</tr>
<tr>
<td align="center">
<bold>
<italic>V</italic>
</bold>
<sub>
<bold>
<italic>y</italic>
</bold>
</sub>
<bold>, m/s</bold>
</td>
<td align="center">0.135, 0.0234</td>
<td align="center">0.225, 0.0638</td>
<td align="center">
<bold>0.092, 0.0050</bold>
</td>
<td align="center">0.291, 0.0542</td>
</tr>
<tr>
<td align="center">
<bold>
<italic>V</italic>
</bold>
<sub>
<bold>
<italic>Z</italic>
</bold>
</sub>
<bold>, m/s</bold>
</td>
<td align="center">0.301, 0.1338</td>
<td align="center">0.260, 0.0735</td>
<td align="center">
<bold>0.215, 0.0313</bold>
</td>
<td align="center">0.383, 0.0978</td>
</tr>
<tr>
<td align="center">
<bold>
<italic>&#x3c9;</italic>
</bold>
<sub>
<bold>
<italic>x</italic>
</bold>
</sub>
<bold>, deg/s</bold>
</td>
<td align="center">0.27, 0.0727</td>
<td align="center">0.323, 0.1221</td>
<td align="center">
<bold>0.194, 0.0371</bold>
</td>
<td align="center">0.334, 0.0201</td>
</tr>
<tr>
<td align="center">
<bold>
<italic>&#x3c9;</italic>
</bold>
<sub>
<bold>
<italic>y</italic>
</bold>
</sub>
<bold>, deg/s</bold>
</td>
<td align="center">
<bold>0.137, 0.0079</bold>
</td>
<td align="center">0.221, 0.0501</td>
<td align="center">0.309, 0.1072</td>
<td align="center">0.375, 0.1029</td>
</tr>
<tr>
<td align="center">
<bold>
<italic>&#x3c9;</italic>
</bold>
<sub>
<bold>
<italic>z</italic>
</bold>
</sub>
<bold>, deg/s</bold>
</td>
<td align="center">0.235, 0.0812</td>
<td align="center">0.296, 0.1077</td>
<td align="center">
<bold>0.166, 0.0154</bold>
</td>
<td align="center">0.469, 0.1451</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s5-3">
<title>5.3 Distribution shift</title>
<p>To evaluate the generalizability of the models, we tested two different distribution shifts: a) using a different environment for training and testing while maintaining the same sensor platform speed and b) utilizing a high-speed test sequence for the model trained on low-speed training data. We observed that changes in velocity pose significant challenges to generalization. However, we found that an environmental change has only minimal impact.</p>
<sec id="s5-3-1">
<title>5.3.1 Trained on indoor data&#x2013;tested on outdoor data</title>
<p>In the first distribution shift test, we assessed the models&#x2019; ability to generalize to a different environment. In this case, we trained the models using indoor data and tested them using an outdoor sequence. <xref ref-type="table" rid="T3">Table 3</xref> presents the RMSE of all the models. The 3DCNN &#x2b; Transformer &#x2b; FC model demonstrated superior performance for all the velocity components except <italic>&#x3c9;</italic>
<sub>
<italic>x</italic>
</sub> and <italic>&#x3c9;</italic>
<sub>
<italic>y</italic>
</sub>, where 3DCNN &#x2b; CA &#x2b; FC performed slightly better. The results show little difference in errors compared to the mixed evaluation. This indicates that the models are not considerably impacted by the change in environment, implying that they have learned transferable features. This therefore indicates that the models have the ability to capture features that can be applied across different environments. Moving forward, we intend to investigate the utilization of these learned features for additional tasks such as mapping.</p>
</sec>
<sec id="s5-3-2">
<title>5.3.2 Trained on mixed low speed&#x2013;tested on mixed high speed</title>
<p>In the second distribution shift test, the performance of the models trained on the mixed dataset was evaluated on a relatively high-speed test sequence. Note that all the mixed dataset sequences used in the training were low-speed (recorded while walking). The high-speed test sequence was collected by a person moving on an electric skateboard. <xref ref-type="table" rid="T6">Table 6</xref> presents the results. None of the trained models showed satisfactory performance in this experiment. This result indicates that the models need retraining if they are to be used in scenarios with significantly different speeds or types of motion. Our model is learning the ego-motion based on the transformation between similar features extracted from a pair of heatmaps. In the higher-speed sequences, the platform travels greater distances between the instances of data, causing larger displacements between the features in the heatmap pairs. The dataset we used has only one environment with a higher platform speed, which is insufficient for training a model. Further exploration can be performed to train the models with self-supervised schemes for better generalization across different environments and varying speed settings.</p>
<table-wrap id="T6" position="float">
<label>TABLE 6</label>
<caption>
<p>Generalization test (performance of the models trained on low-speed data and tested on distribution data, high speed sequence (Longboard)). Smallest errors per velocity component are marked in bold.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">Values</th>
<th align="center">3DCNN &#x2b; transformer &#x2b; FC</th>
<th align="center">3DCNN &#x2b; CA &#x2b; FC</th>
<th align="center">3DCNN &#x2b; SA &#x2b; FC</th>
<th align="center">3DCNN &#x2b; FC</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">
<bold>
<italic>V</italic>
</bold>
<sub>
<bold>
<italic>x</italic>
</bold>
</sub> <bold>m/s</bold>
</td>
<td align="center">
<bold>0.31</bold>
</td>
<td align="center">0.43</td>
<td align="center">0.35</td>
<td align="center">0.530</td>
</tr>
<tr>
<td align="center">
<bold>
<italic>V</italic>
</bold>
<sub>
<bold>
<italic>y</italic>
</bold>
</sub> <bold>m/s</bold>
</td>
<td align="center">
<bold>0.29</bold>
</td>
<td align="center">0.52</td>
<td align="center">
<bold>0.29</bold>
</td>
<td align="center">0.329</td>
</tr>
<tr>
<td align="center">
<bold>
<italic>V</italic>
</bold>
<sub>
<bold>
<italic>Z</italic>
</bold>
</sub> <bold>m/s</bold>
</td>
<td align="center">
<bold>0.067</bold>
</td>
<td align="center">0.17</td>
<td align="center">0.094</td>
<td align="center">0.077</td>
</tr>
<tr>
<td align="center">
<bold>
<italic>&#x3c9;</italic>
</bold>
<sub>
<bold>
<italic>x</italic>
</bold>
</sub> <bold>deg/s</bold>
</td>
<td align="center">
<bold>0.197</bold>
</td>
<td align="center">0.257</td>
<td align="center">0.274</td>
<td align="center">0.255</td>
</tr>
<tr>
<td align="center">
<bold>
<italic>&#x3c9;</italic>
</bold>
<sub>
<bold>
<italic>y</italic>
</bold>
</sub> <bold>deg/s</bold>
</td>
<td align="center">
<bold>0.728</bold>
</td>
<td align="center">1.359</td>
<td align="center">1.247</td>
<td align="center">1.228</td>
</tr>
<tr>
<td align="center">
<bold>
<italic>&#x3c9;</italic>
</bold>
<sub>
<bold>
<italic>z</italic>
</bold>
</sub> <bold>deg/s</bold>
</td>
<td align="center">0.107</td>
<td align="center">0.112</td>
<td align="center">
<bold>0.105</bold>
</td>
<td align="center">0.110</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
</sec>
<sec sec-type="conclusion" id="s6">
<title>6 Conclusion</title>
<p>We presented an end-to-end ego-velocity estimation method from high-resolution radar data. We avoided the heavy processing of radar data to obtain point clouds, which is computationally expensive and causes loss of useful information. Our proposed architecture consists of a 3DCNN based on FlowNet capturing the features associated with motion and an attention mechanism for the selection of significant features for regression. We tested three attention architectures and compared them with the option without attention, as explained in <xref ref-type="sec" rid="s5">Section 5</xref> We trained and evaluated the models on a subset of a publicly available ColoRadar dataset and studied the effect of distribution shift. Although the performance does not degrade greatly when transferring models from indoor to outdoor, the generalizability is rather poor in the varying speed experiment. Our training and evaluation settings have shown that use of transformer encoder layers can improve the performance of end-to-end radar-based ego-motion estimation using deep neural networks. It could be better with an increased amount of data. Future work will explore the applicability of this method in other high-level tasks like mapping and SLAM.</p>
</sec>
</body>
<back>
<sec sec-type="data-availability" id="s7">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/supplementary material; further inquiries can be directed to the corresponding author.</p>
</sec>
<sec id="s8">
<title>Author contributions</title>
<p>Idea and conceptualization: PR and RG; experimental design and evaluation: PR and NS; data visualization and result representation: PR and NS; implementation: PR; draft writing: PR; final manuscript: PR, NS, and RG; supervision: RG and NS; funding acquisition, resources, and project administration: RG. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec id="s9">
<title>Funding</title>
<p>This project was supported by the European Union&#x2019;s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement No. 858101 and by the Academy of Finland (project320 no. 336357, PROFI 6&#x2014;TAU Imaging Research Platform).</p>
</sec>
<sec sec-type="COI-statement" id="s10">
<title>Conflict of interest</title>
<p>The author declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
<p>Author SR declared that they were an editorial board member of Frontiers at the time of submission. This had no impact on the peer review process and the final decision.</p>
</sec>
<sec sec-type="disclaimer" id="s11">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Adolfsson</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Magnusson</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Alhashimi</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Lilienthal</surname>
<given-names>A. J.</given-names>
</name>
<name>
<surname>Andreasson</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Cfear radarodometry - conservative filtering for efficient and accurate radar odometry</article-title>,&#x201d; in <conf-name>IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</conf-name>. <pub-id pub-id-type="doi">10.1109/IROS51168.2021.9636253</pub-id>
</citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Almalioglu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Turan</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>C. X.</given-names>
</name>
<name>
<surname>Trigoni</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Markham</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Milli-rio: Ego-motion estimation with low-cost millimetre-wave radar</article-title>. <source>IEEE Sensors J.</source> <volume>21</volume>, <fpage>3314</fpage>&#x2013;<lpage>3323</lpage>. <pub-id pub-id-type="doi">10.1109/JSEN.2020.3023243</pub-id>
</citation>
</ref>
<ref id="B3">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Barnes</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Gadd</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Murcutt</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Newman</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Posner</surname>
<given-names>I.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>The oxford radar robotcar dataset: A radar extension to the oxford robotcar dataset</article-title>,&#x201d; in <conf-name>IEEE International Conference on Robotics and Automation (ICRA)</conf-name>.</citation>
</ref>
<ref id="B4">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Barnes</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Weston</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Posner</surname>
<given-names>I.</given-names>
</name>
</person-group> (<year>2019</year>). <source>Masking by moving: Learning distraction-free radar odometry from pose information</source>. <publisher-name>ArXiv</publisher-name>.</citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bechter</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Roos</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Waldschmidt</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Compensation of motion-induced phase errors in tdm mimo radars</article-title>. <source>IEEE Microw. Wirel. Components Lett.</source> <volume>27</volume>, <fpage>1164</fpage>&#x2013;<lpage>1166</lpage>. <pub-id pub-id-type="doi">10.1109/LMWC.2017.2751301</pub-id>
</citation>
</ref>
<ref id="B6">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Besl</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>McKay</surname>
<given-names>N. D.</given-names>
</name>
</person-group> (<year>1992</year>). &#x201c;<article-title>A method for registration of 3-d shapes</article-title>,&#x201d; in <conf-name>IEEE Transactions on Pattern Analysis and Machine Intelligence</conf-name>. <pub-id pub-id-type="doi">10.1109/34.121791</pub-id>
</citation>
</ref>
<ref id="B7">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Burnett</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Yoon</surname>
<given-names>D. J.</given-names>
</name>
<name>
<surname>Schoellig</surname>
<given-names>A. P.</given-names>
</name>
<name>
<surname>Barfoot</surname>
<given-names>T. D.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Radar odometry combining probabilistic estimation and unsupervised feature learning</article-title>,&#x201d; in <source>Robotics: Science and systems (RSS)</source>.</citation>
</ref>
<ref id="B8">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Burnett</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Yoon</surname>
<given-names>D. J.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>A. Z.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>S.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <source>Boreas: A multi-season autonomous driving dataset</source>. <comment>
<italic>arxiv preprint</italic> (2022)</comment>.</citation>
</ref>
<ref id="B9">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Cen</surname>
<given-names>S. H.</given-names>
</name>
<name>
<surname>Newman</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Precise ego-motion estimation with millimeter-wave radar under diverse and challenging conditions</article-title>,&#x201d; in <conf-name>IEEE International Conference on Robotics and Automation (ICRA)</conf-name>. <pub-id pub-id-type="doi">10.1109/ICRA.2018.8460687</pub-id>
</citation>
</ref>
<ref id="B10">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Clark</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Durrant-Whyte</surname>
<given-names>H. F.</given-names>
</name>
</person-group> (<year>1998</year>). &#x201c;<article-title>Autonomous land vehicle navigation using millimeter wave radar</article-title>,&#x201d; in <conf-name>IEEE International Conference on Robotics and Automation(ICRA)</conf-name>. <pub-id pub-id-type="doi">10.1109/ROBOT.1998.681411</pub-id>
</citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Costante</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Mancini</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Valigi</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Ciarfuglia</surname>
<given-names>T. A.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Exploring representation learning with cnns for frame-to-frame ego-motion estimation</article-title>. <source>IEEE Robotics Automation Lett.</source> <volume>1</volume>, <fpage>18</fpage>&#x2013;<lpage>25</lpage>. <pub-id pub-id-type="doi">10.1109/LRA.2015.2505717</pub-id>
</citation>
</ref>
<ref id="B12">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Daniel</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Phippen</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Hoare</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Stove</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Cherniakov</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Gashinova</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Low-thz radar, lidar and optical imaging through artificially generated fog</article-title>,&#x201d; in <conf-name>IET International Conference on Radar Systems</conf-name>.</citation>
</ref>
<ref id="B13">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Dickmann</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Klappstein</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Hahn</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Appenrodt</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Bloecher</surname>
<given-names>H. L.</given-names>
</name>
<name>
<surname>Werber</surname>
<given-names>K.</given-names>
</name>
<etal/>
</person-group> (<year>2016</year>). &#x201c;<article-title>Automotive radar the key technology for autonomous driving: From detection and ranging to environmental understanding</article-title>,&#x201d; in <conf-name>IEEE Radar Conference</conf-name> (<publisher-name>RadarConf</publisher-name>). <pub-id pub-id-type="doi">10.1109/RADAR.2016.7485214</pub-id>
</citation>
</ref>
<ref id="B14">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Dosovitskiy</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Beyer</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Kolesnikov</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Weissenborn</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Zhai</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Unterthiner</surname>
<given-names>T.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). &#x201c;<article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>,&#x201d; in <conf-name>International Conference on Learning Representations</conf-name> (<publisher-name>ICLR</publisher-name>).</citation>
</ref>
<ref id="B15">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Engels</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Heidenreich</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Zoubir</surname>
<given-names>A. M.</given-names>
</name>
<name>
<surname>Jondral</surname>
<given-names>F. K.</given-names>
</name>
<name>
<surname>Wintermantel</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Advances in automotive radar: A framework on computationally efficient high-resolution frequency estimation</article-title>,&#x201d; in <conf-name>IEEE Signal Processing Magazine</conf-name>.</citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Geiger</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Lenz</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Stiller</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Urtasun</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>Vision meets robotics: The kitti dataset</article-title>. <source>Int. J. Robotics Res. (IJRR)</source> <volume>32</volume>, <fpage>1231</fpage>&#x2013;<lpage>1237</lpage>. <pub-id pub-id-type="doi">10.1177/0278364913491297</pub-id>
</citation>
</ref>
<ref id="B17">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Ghabcheloo</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Siddiqui</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Complete odometry estimation of a vehicle using single automotive radar and a gyroscope</article-title>,&#x201d; in <conf-name>IEEE Mediterranean Conference on Control and Automation (MED)</conf-name>.</citation>
</ref>
<ref id="B18">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Heller</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Petrov</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Yarovoy</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2021</year>). <source>A novel approach to vehicle pose estimation using automotive radar</source>. <comment>
<italic>arxiv preprint</italic>
</comment>. <pub-id pub-id-type="doi">10.48550/ARXIV.2107.09607</pub-id>
</citation>
</ref>
<ref id="B19">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Hess</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Kohler</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Rapp</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Andor</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Real-time loop closure in 2d lidar slam</article-title>,&#x201d; in <conf-name>IEEE International Conference on Robotics and Automation (ICRA)</conf-name>. <pub-id pub-id-type="doi">10.1109/ICRA.2016.7487258</pub-id>
</citation>
</ref>
<ref id="B20">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Holder</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Hellwig</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Winner</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Real-time pose graph slam based on radar</article-title>,&#x201d; in <conf-name>IEEE Intelligent Vehicles Symposium</conf-name>. <pub-id pub-id-type="doi">10.1109/IVS.2019.8813841</pub-id>
<issue>IV</issue>
</citation>
</ref>
<ref id="B21">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Ioffe</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Szegedy</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2015</year>). &#x201c;<article-title>Batch normalization: Accelerating deep network training by reducing internal covariate shift</article-title>,&#x201d; in <conf-name>International Conference on Machine Learning</conf-name> (<publisher-name>ICML</publisher-name>).</citation>
</ref>
<ref id="B22">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Kellner</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Barjenbruch</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Klappstein</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Dickmann</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Dietmayer</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2013</year>). &#x201c;<article-title>Instantaneous ego-motion estimation using Doppler radar</article-title>,&#x201d; in <conf-name>IEEE International Conference on Intelligent Transportation Systems (ITSC)</conf-name>. <pub-id pub-id-type="doi">10.1109/ITSC.2013.6728341</pub-id>
</citation>
</ref>
<ref id="B23">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Kellner</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Barjenbruch</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Klappstein</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Dickmann</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Dietmayer</surname>
<given-names>K. C. J.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Instantaneous ego-motion estimation using multiple Doppler radars</article-title>,&#x201d; in <conf-name>IEEE International Conference on Robotics and Automation (ICRA)</conf-name>.</citation>
</ref>
<ref id="B24">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Kim</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>Y. S.</given-names>
</name>
<name>
<surname>Cho</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Jeong</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Mulran: Multimodal range dataset for urban place recognition</article-title>,&#x201d; in <conf-name>IEEE International Conference on Robotics and Automation (ICRA)</conf-name>.</citation>
</ref>
<ref id="B25">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Kingma</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Ba</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Adam: A method for stochastic optimization</article-title>,&#x201d; in <conf-name>International Conference on Learning Representations</conf-name>.</citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kramer</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Harlow</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Williams</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Heckman</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Coloradar: The direct 3d millimeter wave radar dataset</article-title>. <source>Int. J. Robotics Res.</source> <volume>41</volume>, <fpage>351</fpage>&#x2013;<lpage>360</lpage>. <pub-id pub-id-type="doi">10.1177/02783649211068535</pub-id>
</citation>
</ref>
<ref id="B27">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Cai</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Saputra</surname>
<given-names>M. R. U.</given-names>
</name>
<name>
<surname>Dai</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>C. X.</given-names>
</name>
<name>
<surname>Markham</surname>
<given-names>A.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <source>Odombeyondvision: An indoor multi-modal multi-platform odometry dataset beyond the visible spectrum</source>. <comment>
<italic>arXiv preprint</italic>
</comment>. <pub-id pub-id-type="doi">10.48550/ARXIV.2206.01589</pub-id>
</citation>
</ref>
<ref id="B28">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Lu</surname>
<given-names>C. X.</given-names>
</name>
<name>
<surname>Saputra</surname>
<given-names>M. R. U.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Almalioglu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>de Gusmao</surname>
<given-names>P. P.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>C.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). &#x201c;<article-title>milliego: single-chip mmwave radar aided egomotion estimation via deep sensor fusion</article-title>,&#x201d; in <conf-name>ACM Conference on Embedded Networked Sensor Systems (SenSys)</conf-name>.</citation>
</ref>
<ref id="B29">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Lynch</surname>
<given-names>K. M.</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>F. C.</given-names>
</name>
</person-group> (<year>2017</year>). <source>Modern robotics: Mechanics, planning, and control</source> <edition>1st edn</edition>. <publisher-loc>USA</publisher-loc>: <publisher-name>Cambridge University Press</publisher-name>.</citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Magnusson</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Lilienthal</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Duckett</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2007</year>). <article-title>Scan registration for autonomous mining vehicles using 3d-ndt</article-title>. <source>J. Field Robotics</source> <volume>24</volume>, <fpage>803</fpage>&#x2013;<lpage>827</lpage>. <pub-id pub-id-type="doi">10.1002/rob.20204</pub-id>
</citation>
</ref>
<ref id="B31">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Rohling</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>1983</year>). &#x201c;<article-title>Radar cfar thresholding in clutter and multiple target situations</article-title>,&#x201d; in <conf-name>IEEE transactions on aerospace and electronic systems</conf-name>, <fpage>608</fpage>&#x2013;<lpage>621</lpage>. <pub-id pub-id-type="doi">10.1109/taes.1983.309350</pub-id>
</citation>
</ref>
<ref id="B32">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Shan</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Englot</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Lego-loam: Lightweight and ground-optimized lidar odometry and mapping on variable terrain</article-title>,&#x201d; in <conf-name>IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</conf-name>.</citation>
</ref>
<ref id="B33">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Shan</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Englot</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Meyers</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Ratti</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Daniela</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Lio-sam: Tightly-coupled lidar inertial odometry via smoothing and mapping</article-title>,&#x201d; in <conf-name>IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</conf-name>.</citation>
</ref>
<ref id="B34">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Sheeny</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>De Pellegrin</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Mukherjee</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Ahrabian</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Wallace</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2020</year>). <source>Radiate: A radar dataset for automotive perception</source>. <comment>
<italic>arXiv preprint arXiv:2010.09076</italic>
</comment>.</citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Srivastava</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Hinton</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Krizhevsky</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Sutskever</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Salakhutdinov</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>Dropout: A simple way to prevent neural networks from overfitting</article-title>. <source>J. Mach. Learn. Res</source>.</citation>
</ref>
<ref id="B36">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Steiner</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Hammouda</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Waldschmidt</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Ego-motion estimation using distributed single-channel radar sensors</article-title>,&#x201d; in <conf-name>IEEE MTT-S International Conference on Microwaves for Intelligent Mobility (ICMIM)</conf-name>. <pub-id pub-id-type="doi">10.1109/ICMIM.2018.8443509</pub-id>
</citation>
</ref>
<ref id="B37">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Swami</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Jain</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Goswami</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Chitnis</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Dubey</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Chaudhari</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>High performance automotive radar signal processing on ti&#x2019;s tda3x platform</article-title>,&#x201d; in <conf-name>IEEE Radar Conference (RadarConf)</conf-name>. <pub-id pub-id-type="doi">10.1109/RADAR.2017.7944409</pub-id>
</citation>
</ref>
<ref id="B38">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Tran</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Bourdev</surname>
<given-names>L. D.</given-names>
</name>
<name>
<surname>Fergus</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Torresani</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Paluri</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2014</year>). <source>C3D: Generic features for video analysis</source>. <comment>
<italic>arxiv preprint</italic>
</comment>.</citation>
</ref>
<ref id="B39">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Vaswani</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Shazeer</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Parmar</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Uszkoreit</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Gomez</surname>
<given-names>A. N.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). &#x201c;<article-title>Attention is all you need</article-title>,&#x201d; in <source>Advances in neural information processing systems</source> (<publisher-name>NurIPS</publisher-name>).</citation>
</ref>
<ref id="B40">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Clark</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Wen</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Trigoni</surname>
<given-names>N.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks</article-title>,&#x201d; in <conf-name>IEEE International Conference on Robotics and Automation (ICRA)</conf-name>. <pub-id pub-id-type="doi">10.1109/ICRA.2017.7989236</pub-id>
</citation>
</ref>
<ref id="B41">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Woo</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Kweon</surname>
<given-names>I. S.</given-names>
</name>
</person-group> (<year>2018</year>). <source>Cbam: Convolutional block attention module</source>. <comment>
<italic>arxiv preprint</italic>
</comment>.</citation>
</ref>
<ref id="B42">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Xu</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2015</year>). <source>Empirical evaluation of rectified activations in convolutional network</source>. <comment>
<italic>arxiv preprint</italic>
</comment>.</citation>
</ref>
<ref id="B43">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Yang</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>von Stumberg</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Cremers</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>D3vo: Deep depth, deep pose and deep uncertainty for monocular visual odometry</article-title>,&#x201d; in <conf-name>IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</conf-name>. <pub-id pub-id-type="doi">10.1109/CVPR42600.2020.00136</pub-id>
</citation>
</ref>
<ref id="B44">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Singh</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Loam: Lidar odometry and mapping in real-time</article-title>,&#x201d; in <conf-name>Robotics: Science and Systems Conference</conf-name> (<publisher-name>RSS</publisher-name>).</citation>
</ref>
<ref id="B45">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Qian</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Fang</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>A lidar odometry for outdoor mobile robots using ndt based scan matching in gps-denied environments</article-title>,&#x201d; in <conf-name>IEEE International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER)</conf-name>. <pub-id pub-id-type="doi">10.1109/CYBER.2017.8446588</pub-id>
</citation>
</ref>
</ref-list>
</back>
</article>