<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="review-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Robot. AI</journal-id>
<journal-title>Frontiers in Robotics and AI</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Robot. AI</abbrev-journal-title>
<issn pub-type="epub">2296-9144</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">1212070</article-id>
<article-id pub-id-type="doi">10.3389/frobt.2024.1212070</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Robotics and AI</subject>
<subj-group>
<subject>Review</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>A survey on 3D object detection in real time for autonomous driving</article-title>
<alt-title alt-title-type="left-running-head">Cabrera et al.</alt-title>
<alt-title alt-title-type="right-running-head">
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/frobt.2024.1212070">10.3389/frobt.2024.1212070</ext-link>
</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Contreras</surname>
<given-names>Marcelo</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/2293642/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Jain</surname>
<given-names>Aayush</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/2293423/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Bhatt</surname>
<given-names>Neel P.</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/2293428/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Banerjee</surname>
<given-names>Arunava</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1858308/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Hashemi</surname>
<given-names>Ehsan</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1875443/overview"/>
</contrib>
</contrib-group>
<aff id="aff1">
<sup>1</sup>
<institution>University of Alberta</institution>, <addr-line>Edmonton</addr-line>, <addr-line>AB</addr-line>, <country>Canada</country>
</aff>
<aff id="aff2">
<sup>2</sup>
<institution>Indian Institute of Technology Kharagpur</institution>, <addr-line>Kharagpur</addr-line>, <addr-line>West Bengal</addr-line>, <country>India</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1715004/overview">Shuo Cheng</ext-link>, The University of Tokyo, Japan</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/2538090/overview">Weiguo Pan</ext-link>, Beijing Union University, China</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1235387/overview">Ciprian Alecsandru</ext-link>, Concordia University, Canada</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Ehsan Hashemi, <email>ehashemi@ualberta.ca</email>
</corresp>
</author-notes>
<pub-date pub-type="epub">
<day>06</day>
<month>03</month>
<year>2024</year>
</pub-date>
<pub-date pub-type="collection">
<year>2024</year>
</pub-date>
<volume>11</volume>
<elocation-id>1212070</elocation-id>
<history>
<date date-type="received">
<day>25</day>
<month>04</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>19</day>
<month>02</month>
<year>2024</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2024 Contreras, Jain, Bhatt, Banerjee and Hashemi.</copyright-statement>
<copyright-year>2024</copyright-year>
<copyright-holder>Contreras, Jain, Bhatt, Banerjee and Hashemi</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>This survey reviews advances in 3D object detection approaches for autonomous driving. A brief introduction to 2D object detection is first discussed and drawbacks of the existing methodologies are identified for highly dynamic environments. Subsequently, this paper reviews the state-of-the-art 3D object detection techniques that utilizes monocular and stereo vision for reliable detection in urban settings. Based on depth inference basis, learning schemes, and internal representation, this work presents a method taxonomy of three classes: model-based and geometrically constrained approaches, end-to-end learning methodologies, and hybrid methods. There is highlighted segment for current trend of multi-view detectors as end-to-end methods due to their boosted robustness. Detectors from the last two kinds were specially selected to exploit the autonomous driving context in terms of geometry, scene content and instances distribution. To prove the effectiveness of each method, 3D object detection datasets for autonomous vehicles are described with their unique features, e. g., varying weather conditions, multi-modality, multi camera perspective and their respective metrics associated to different difficulty categories. In addition, we included multi-modal visual datasets, i. e., V2X that may tackle the problems of single-view occlusion. Finally, the current research trends in object detection are summarized, followed by a discussion on possible scope for future research in this domain.</p>
</abstract>
<kwd-group>
<kwd>3D object detection</kwd>
<kwd>autonomous navigation</kwd>
<kwd>visual navigation</kwd>
<kwd>robot perception</kwd>
<kwd>automated driving systems (ADS)</kwd>
<kwd>visual-aided decision</kwd>
</kwd-group>
<contract-sponsor id="cn001">Natural Sciences and Engineering Research Council of Canada<named-content content-type="fundref-id">10.13039/501100000038</named-content>
</contract-sponsor>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Robotic Control Systems</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1 Introduction</title>
<p>Automated Driving Systems (ADS) and Advanced Driver-Assistance Systems (ADAS) with robust controls are primarily deployed with the intention to reduce human errors in perception and decision-making while enhancing traffic flow and transportation safety in emergency cases and hand-over scenarios (<xref ref-type="bibr" rid="B4">Bengler et al., 2014</xref>; <xref ref-type="bibr" rid="B113">Wang et al., 2017</xref>; <xref ref-type="bibr" rid="B99">Schwarting et al., 2018</xref>; <xref ref-type="bibr" rid="B18">Chen et al., 2019</xref>). To this end, ADS represents a significant enhancement in life quality by reducing pollution emissions due to efficiency in construction and driving, travel conformity, and increased productivity that relies on mobility and, consequently, powers regional economies (<xref ref-type="bibr" rid="B31">Greenblatt and Shaheen, 2015</xref>; <xref ref-type="bibr" rid="B121">Williams et al., 2020</xref>; <xref ref-type="bibr" rid="B83">Silva et al., 2022</xref>). On the contrary, it also raised substantial social concerns about policy-making, industry standards, equality of accessibility to unrepresented social groups (i.e., third-world countries, gender or income), insurance costs, and labor-hand reduction with biased accessibility for education to adapt to newer positions (<xref ref-type="bibr" rid="B7">Bissell et al., 2018</xref>; <xref ref-type="bibr" rid="B100">Shahedi et al., 2023</xref>). Still, the social studies in this matter only covered a narrow spike of the whole problem; researchers claimed a need for a more holistic view due to evidence of only attention in the first two topics (<xref ref-type="bibr" rid="B7">Bissell et al., 2018</xref>). Besides, information on light and noise contamination of ADS is sparse and current emissions reports may changed under denser traffic flow after the including of AVs (<xref ref-type="bibr" rid="B83">Silva et al., 2022</xref>). We encourage a more profound analysis of the implications of broadly adopting AVs as the primary transport means or inside a hybrid scheme with non-autonomous vehicles. One of the most significant worries about this technology is its security and reliability (<xref ref-type="bibr" rid="B84">Othman, 2021</xref>).</p>
<p>ADS can only function safely and effectively if they have access to reliable perception and increased environmental awareness (<xref ref-type="bibr" rid="B39">Hu et al., 2019</xref>; <xref ref-type="bibr" rid="B134">Zhang et al., 2022a</xref>; <xref ref-type="bibr" rid="B35">Hashemi et al., 2022</xref>). In this regard, ADS and their control systems utilize multi-modal sensory data (from stereo or monocular cameras, light detection and ranging (LiDAR), radars, and global navigation satellite systems, GNSS) to 1) achieve semantic information about their surroundings for motion planning 2) identify various static and dynamic objects on the road, pedestrians, etc., 3) estimate their states (e.g., position, heading, and velocity) and 4) to predict trajectories of these objects for safety-critical scenarios (<xref ref-type="bibr" rid="B42">Ji et al., 2018</xref>; <xref ref-type="bibr" rid="B77">Marzbani et al., 2019</xref>; <xref ref-type="bibr" rid="B79">Mohammadbagher et al., 2020</xref>; <xref ref-type="bibr" rid="B5">Bhatt et al., 2022</xref>; <xref ref-type="bibr" rid="B6">Bhatt et al., 2023</xref>). An unreliable identification of street objects and road signs may lead to catastrophic outcomes and thus, the object detection task is of fundamental importance for safe operation, decision making and controls in autonomous driving (<xref ref-type="bibr" rid="B13">Chen et al., 2021</xref>; <xref ref-type="bibr" rid="B34">Gupta et al., 2021</xref>). One of the primary reasons behind the failure of object detection in perceptually degraded conditions (i.e., extreme lighting and weather conditions such as snow, hail, ice storms) and adversarial ones is the limitation of sensory data which necessitates multi-modal data fusion (<xref ref-type="bibr" rid="B78">Michaelis et al., 2019</xref>; <xref ref-type="bibr" rid="B37">Hnewa and Radha, 2021</xref>) Moreover, inconsistency in layout of motorways presents additional complexity for reliable identification of spatial constraints for motion planning in highly dynamic environments; for instance, vehicles in urban areas parked in an arbitrary orientation hinder vehicles from following well-defined driving lanes. Lastly, there always remains a high possibility of occlusion where objects block each other&#x2019;s view resulting in either partial or complete concealment of the objects. Despite these challenges due to perceptually-degraded conditions and highly dynamic environments, there has been substantial progress in camera-based object detection and state estimation approaches to enhance perception and situational awareness in autonomous driving (<xref ref-type="bibr" rid="B94">Ranft and Stiller, 2016</xref>; <xref ref-type="bibr" rid="B1">Arnold et al., 2019</xref>; <xref ref-type="bibr" rid="B20">Cui et al., 2019</xref>; <xref ref-type="bibr" rid="B34">Gupta et al., 2021</xref>) In this regard, visual-based 2D or 3D object detection methodologies in the literature falls into 3 main categories of learning-based, geometrical or model-based, and hybrid approaches. Geometrically constrained model broadly exploit common scenery in AV, e. g., scale inference from road distance to the camera or triangulation between multiple vehicle detections. The hybrid methods aim to fuse the progress made by end-to-end detectors, which are not necessarily designed for AD applications, with the unique features from the first one. Significant research works have been conducted with focus on visual-based 2D object detection (<xref ref-type="bibr" rid="B28">Geiger et al., 2013</xref>; <xref ref-type="bibr" rid="B81">Mukhtar et al., 2015</xref>; <xref ref-type="bibr" rid="B86">Pendleton et al., 2017</xref>; <xref ref-type="bibr" rid="B49">Ku et al., 2018</xref>) for autonomous navigation. For object detection in the case of AVs, a conventional pipeline consists of <italic>segmentation</italic> (such as via voxel clustering (<xref ref-type="bibr" rid="B2">Azim and Aycard, 2014</xref>) and graph-segmentation methods (<xref ref-type="bibr" rid="B112">Wang et al., 2012</xref>)), <italic>feature extraction</italic> using probabilistic feature-based voxels and <italic>classification</italic> based on various state-of-the-art classifiers, such as YOLOv7 (<xref ref-type="bibr" rid="B111">Wang et al., 2022</xref>), EfficientDet (<xref ref-type="bibr" rid="B108">Tan et al., 2020</xref>) and Swin ViT. Traditional approaches have optimized each of these stages individually (<xref ref-type="bibr" rid="B80">Mousavian et al., 2017</xref>; <xref ref-type="bibr" rid="B26">G&#xe4;hlert et al., 2020</xref>), while recent end-to-end learning frameworks, which derive a region of interest (ROI) for feature extraction, tend to optimize the whole pipeline (<xref ref-type="bibr" rid="B53">Li et al., 2019</xref>; <xref ref-type="bibr" rid="B65">Liu et al., 2019</xref>;<xref ref-type="bibr" rid="B72">Liu et al., 2020</xref>).</p>
<p>To this end, this work presents an overview of visual-based object detection methods in the context of autonomous driving (AD). The performance of state-of-the-art detection models proposed in the literature is evaluated using popular datasets and well established metrics. This is followed by a thorough review of monocular and stereo camera-based object detection methods. Finally, research gaps and possible directions for future research are identified. The rest of the paper is organized as follows: In <xref ref-type="sec" rid="s2">Section 2</xref>, 2D object detection and challenges are elaborated on; this is followed by a detailed discussion on the recent progress on 3D object detection in <xref ref-type="sec" rid="s4">Section 4</xref>. Finally the future trends and the concluding remarks are summarized in <xref ref-type="sec" rid="s5">Section 5</xref> and <xref ref-type="sec" rid="s4">Section 4</xref> respectively.</p>
</sec>
<sec id="s2">
<title>2 Two-dimensional object detection</title>
<p>Object detection initially started as a classification and instance localization problem in 2D images for automated driving systems which are equipped with multi-modal sensory measurement units (as shown in <xref ref-type="fig" rid="F1">Figure 1</xref>). Detection models employed handcrafted features by Histogram of gradients (HOG), Scale-invariant feature transformation (SIFT), or Oriented fast and rotated BRIEF (ORB) and passed them through a linear classifier, i.e., Support vector machines (SVM). However, with the advent of deep learning methodologies, better solutions which exploited spatial and semantic information under several variances, including scale, translation, and rotation were explored. These algorithms fall into 2 main categories: two-stage detectors and one-stage detectors. For a broader explanation in several deep learning concepts used in the detection context, be referred to (<xref ref-type="bibr" rid="B44">Jiao et al., 2019</xref>; <xref ref-type="bibr" rid="B136">Zhao et al., 2019</xref>).</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>A hybrid electric vehicle at the NODE lab equipped with multi-modal sensors and data fusion systems for perception, motion planning, autonomous navigation, and controls in perceptually-degraded conditions.</p>
</caption>
<graphic xlink:href="frobt-11-1212070-g001.tif"/>
</fig>
<p>
<bold>Two-stage detectors:</bold> Two stage detectors are composed of a Region Proposal Network (RPN) and a Region of Interest Pooling (RoI-Pool) (<xref ref-type="bibr" rid="B23">Du et al., 2020</xref>) and have demonstrated high accuracy values in well-known datasets, such as MS COCO (<xref ref-type="bibr" rid="B63">Lin et al., 2014</xref>) and PASCAL VOC (<xref ref-type="bibr" rid="B38">Hoiem et al., 2009</xref>). These detectors are also capable of performing enhanced detail extraction even in small-size regions (<xref ref-type="bibr" rid="B41">Jana and Mohanta, 2022</xref>). The two-stage detectors work by first sending the images to a Convolutional Neural Network (CNN) backbone that extracts a feature map, similar to how human vision focuses on local salient details of images, and then, the RPN slides a window over the map to obtain fixed feature regions. Finally, the RoI-Pool layer samples the regions and reduces their dimensionality without a considerable loss of features and sends them t a SVM for classification and to a regressor for bounding box coordinate prediction (<xref ref-type="bibr" rid="B57">Li et al., 2017</xref>; <xref ref-type="bibr" rid="B126">Xie et al., 2021</xref>). The pioneering work of this approach, R-CNN (<xref ref-type="bibr" rid="B30">Girshick et al., 2014</xref>) employed a selective search algorithm for RPN, but it extracts the features from proposed image regions and not from proposed feature maps. Its enhanced version, fast R-CNN (<xref ref-type="bibr" rid="B29">Girshick, 2015</xref>), followed the aforementioned order since applying the RPN on the feature maps and later classifying them is faster thanks to compression. Later models, such as faster R-CNN (<xref ref-type="bibr" rid="B97">Ren et al., 2016</xref>) or mask R-CNN (<xref ref-type="bibr" rid="B36">He et al., 2017</xref>), aimed to improve the feature extraction section or the classification head, in terms of accuracy, context generalization, and timing. More recent methods had extended the original R-CNN framework to include components from Visual Transformers (ViT) as they have a current trend to held top positions in MS-COCO classification task. For instance, (<xref ref-type="bibr" rid="B61">Liang et al., 2022a</xref>), proposed an improved sparse R-CNN that exploit the sparsity in the region proposal and feed that information to attention units to focus on relevant global visual details rather than local ones such as with convolutional layers as they process the whole image at once. They proved the benefits of the method by testing for traffic sign detection. On the different manner, (<xref ref-type="bibr" rid="B54">Li et al., 2022</xref>), replace the ResNet50 backbone with a transformer version of EfficientNet known as EfficientFormer that could account for &#x2b;3.9 AP score while alleviates intensive hardware usage. Some applications for object detection that look to accomplish robust prediction in real traffic scenarios took another path rather than extending backbones or adding slight changes to R-CNN. (<xref ref-type="bibr" rid="B24">Du et al., 2022</xref>). introduced an unknown-aware hierarchical object detection which incorporates <italic>a priori</italic> knowledge to distinguish between known classes and unknown classes that could be possible part of a higher taxonomy such as bicycles is a two-wheeled vehicle and vehicle itself is a class. Lastly, a hybrid detector was presented by (<xref ref-type="bibr" rid="B47">Khan et al., 2023</xref>) where the YOLO detection head was plugged with the results to RoI pooling stage from a faster R-CNN, thus eliminating the ROI proposals and reducing the computation overhead considerably while improving faster R-CNN results.</p>
<p>
<bold>One-stage detectors:</bold> Two stage detectors give enhanced object detection, however, it slows down the overall detection process considerably (<xref ref-type="bibr" rid="B11">Carranza-Garc&#xed;a et al., 2020</xref>). Thus, an alternative approach, which achieves reasonably good detection with enhanced computational efficiency for real-time applications and safety-critical motion planning and decision making tasks, needed to be explored specifically for autonomous driving. As a viable alternative, one-stage detectors have emerged which work by introducing a single end-to-end (all layers are trained together) CNN to predict the bounding box class and coordinates. One of the pioneers of this strategy was You Only Look Once (YOLO) (<xref ref-type="bibr" rid="B96">Redmon et al., 2016</xref>) which divides the image into a grid and obtains the bounding box from each cell through regression. At the initial phase, the bounding boxes are anchor boxes with predefined sizes and they are used to tile the whole image. Depending on the results of class probability and Intersection-of-Union (IOU) scores, with respect to the ground truth annotation, the regressor refines the anchor boxes through manipulation of their center offsets. Although YOLO paved the pathway for numerous novel approaches in the domain of one-stage detectors, its major shortcoming is that it suffers localization accuracy for small objects, in comparison to two-stage detectors. Thus YOLO was further modified to reduce the accuracy gap between the two-stage and one-stage methods. The latest version, YOLOv8 (<xref ref-type="bibr" rid="B46">Jocher et al., 2023</xref>), has a pyramidal feature backbone for multi-scale detection and does not use anchor boxes but directly predicts the center of the bounding box. The method then applies mosaic augmentation to improve the training performance. Some of the other relevant one stage detectors which have been used in practce include, DCNv2 (<xref ref-type="bibr" rid="B115">Wang et al., 2020</xref>), which uses deformable CNN (DCN) to adapt to different geometric variations not contemplated by the fixed square kernel common in convolutions, and RetinaNet (<xref ref-type="bibr" rid="B115">Wang et al., 2020</xref>), which proposed a robust loss function to address false negatives due to the imbalance in the dataset between background and labeled classes. (<xref ref-type="bibr" rid="B75">Lyu et al., 2022</xref>). introduced RTMDet as a revision of YOLO detector with extensive modifications in taxonomy to account for real time inference, i. e., replacing convolutions with large-kernel depth-wise convolutions followed by point wise. The model achieves both increase in speed and mAP score. The ViT detectors have also took notoriety for one-stage detectors specially as these networks usually do not include RoI proposals to rather just have an accurate end-to-end prediction, though they tend to be slower than one-stage convolutional approaches. (<xref ref-type="bibr" rid="B71">Liu et al., 2021a</xref>). is capable of extracting features at various scales due to a shifted window scheme that simultaneously limits the self-attention computation which grows exponentially when the network gets deeper. (<xref ref-type="bibr" rid="B22">Ding et al., 2022</xref>). enhances SwinViT by introducing dual attention units that process spatial and channel tokens for a better understanding of global and local context respectively. Regarding autonomous driving applications, (<xref ref-type="bibr" rid="B62">Liang et al., 2022b</xref>), proposed incorporating attention units in a new lightweight backbone called GhostNet to considerable reduce mode size and increase inference speed. They tested the method along several augmentation techniques aim to have a more robust traffic sign detection under light condition changes. In a similar manner, DetecFormer (<xref ref-type="bibr" rid="B60">Liang et al., 2022c</xref>) was introduced by fusing local and global information in a global context encoder with the same purpose of traffic scene detection.</p>
</sec>
<sec id="s3">
<title>3 Datasets and evaluation metrics</title>
<p>With the advent of autonomous navigation in highly dynamic urban environments and under various weather conditions for ADS and connected autonomous driving, the intelligent transportation and machine vision research communities have continuously cultivated large datasets in the context of object detection for autonomous vehicles. The rapid takeoff of these datasets has been a major factor for the emergence of deep learning methods. This section summarizes 8 publicly accessible datasets for 3D object detection: KITTI (<xref ref-type="bibr" rid="B28">Geiger et al., 2013</xref>), nuScenes (<xref ref-type="bibr" rid="B10">Caesar et al., 2020</xref>), Waymo Open (<xref ref-type="bibr" rid="B107">Sun et al., 2020</xref>), Canadian Adverse Driving Conditions (CADC) (<xref ref-type="bibr" rid="B89">Pitropov et al., 2020</xref>), Boreas (<xref ref-type="bibr" rid="B9">Burnett et al., 2023</xref>), DAIR-V2X (<xref ref-type="bibr" rid="B132">Yu et al., 2022</xref>), A9 (<xref ref-type="bibr" rid="B142">Zimmer et al., 2023</xref>), ROPE 3D (<xref ref-type="bibr" rid="B130">Ye et al., 2022</xref>) datasets.</p>
<p>
<bold>Scene coverage:</bold> The KITTI dataset provides 50 scenes captured in Karlsruhe, Germany across 8 classes in which vehicles, pedestrians, and cyclists are taken into account for online evaluation out of the 8 classes. The height of the 2D bounding boxes, the level of occlusion, and the degree of truncation are factors that are taken into account in determining the 3 difficulty categories, namely, easy, moderate, and hard. While, nuScenes captures 1<italic>k</italic> sequences from Boston and Singapore across 23 classes, only 10 classes are considered for evaluation. In addition, Waymo Open contains 1,150 sequences with 4 classes captured in Phoenix and San Francisco and similar to KITTI, there are 3 testing categories. As a pioneer, KITTI has had a significant impact establishing the standard for data collecting, protocol, and benchmark. The nuScenes and Waymo Open datasets both collect data throughout the day in a variety of weather and illumination conditions. Frequently, class imbalance is a problem that affects real data collection. As reported in (<xref ref-type="bibr" rid="B91">Qian et al., 2022</xref>), for the nuScenes and KITTI datasets, 50% of the categories account for 6.9% of the annotations indicating a long tail distribution.</p>
<p>Both CADC and Boreas are highly focused on adverse driving conditions and cover a wider spectrum of harsh weather conditions in comparison to the aforementioned datasets. The CADC dataset tracks 12.94 km of driving along 75 driving scenarios over 3 days in the Canadian Waterloo region during March 2018 and February 2019. A key difference in the CADC dataset is that their images were captured in numerous winter weather conditions and specific perceptually-degraded circumstances. Each sequence were given under different snowfall levels (e.g., light, medium, and extreme). The driving sequences were recorded with 8 cameras, 4 facing forward and 4 backward. The Boreas dataset involves 350 km of driving over 44 sequences in Toronto, Canada between 2020 and 2021 recorded in 2 repeated routes. The weather conditions change from day to night, snow to rain, and cloudy scenes. The dataset has a diversity of seasonal-variations over an extended period of time to further generalize the learning process. Thus, it seeks to provide opportunities for robust 3D detection under long-term weather changes for the same scenarios in highly dynamic urban driving conditions. We make a special distinction for V2X (Vehicle to everything) or V2I (Vehicle to infrastructure) datasets which provide multiple road views, i. e., front car and an elevated view due to the fact that visual multi-modality tackle occlusion and enhance detection robustness. One of the pioneers on this kind is DAIR-V2X which covers 10 km of city roads, 10 km of highway, 28 intersections, and 38 km<sup>2</sup> of driving regions with diverse weather and lighting variations from the camera view and a pole elevated view. Each scene was recorded for 20 s to capture dense traffic flow and the driving of the experiment car on a unique intersection. In contrast, A9 provides higher scenery complexity as it contemplates various driving maneuvers, such as left and right turns, overtaking, and U-turns in different road locations throughout Munich, Germany; though only from infrastructure view. ROPE3D went even further achieving a broader generalization by collecting data in different weather conditions, illuminations and traffic density. In addition, authors took consideration on having spread distribution over annotations and depth of coarse-categories.</p>
<p>
<bold>Dataset size:</bold> The KITTI dataset is one of the most popular datasets for use in autonomous driving and navigation. It contains 200<italic>k</italic> boxes which are manually annotated in 15<italic>k</italic> frames. Among this, it has 7,481 samples for training and 7,518 for testing, respectively. The training data has been further split into 3,712 samples for training and 3,769 for validation. In addition, the 1.4<italic>M</italic> labelled boxes in the nuScenes dataset are from 40<italic>k</italic> frames with 28,130, 6,019, and 6,008 frames used for training, validation, and testing respectively. Moreover, 112<italic>M</italic> boxes are annotated in Waymo Open dataset from 200<italic>k</italic> frames with 122,200, 30,407, and 40,077 for training, validation, and testing respectively. These datasets does not include annotations for testing and are rather internally evaluated. The CADC dataset consists of 7,000 instances expanded to 56,000 images from 8 cameras. Its 308,079 labels have been divided in 10 classes, including cars (281,941 labels), trucks (20,411 labels), buses (4,867 labels), bicycles (785 labels) and horses and buggies (75 label). Each class has a set of attributes which gives further semantic details, for instance transit bus for class of bus. The Boreas dataset consists of 37 training scenes and 16 test scenes that in conjunction contain 326,180 3D box annotations. The dataset contains vehicles, cyclists, pedestrians and miscellaneous (transit buses, trucks, streetcars, and trains). DAIR-V2X contains 71,254 camera frames with 40% and 60% from infrastructure and vehicle respectively and covers 10 classes including pedestrians, cars, buses and cyclists. A9 dataset collected 4.8 k images with 57.4 k manually labeled 3D boxes purely from infrastructure view. As other datasets, it covers 10 classes for classification task. At last, ROPE 3D holds 50 k images with the huge increment of 1.5 M 3D annotations over the last two datasets. It also has a slight increase in classification difficult as it includes 13 classes with labels such as <italic>unknown-unmovable</italic>, <italic>unknown-movable</italic> or <italic>traffic cone</italic>.</p>
<p>
<bold>Evaluation metrics:</bold> Similar to 2D object detection, 3D object detection methods employ Average Precision AP as standard benchmark. The standard AP metric is first established, followed by the AP variants that have adopted considerations for a series of predictions <inline-formula id="inf1">
<mml:math id="m1">
<mml:mfenced open="{" close="}">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:math>
</inline-formula> that are listed in decreasing order of confidence score <italic>s</italic>
<sub>
<italic>i</italic>
</sub>. A prediction <bold>y</bold>
<sub>
<italic>i</italic>
</sub> (bounding box and class) is regarded as a true positive if the ratio of the intersection of the are covered by the prediction bounding box <italic>B</italic> and its ground truth correspondence, known as the Intersection over Union (IoU), exceeds a predetermined threshold; otherwise, it is regarded as a false positive. The AP score is defined as the area of the region beneath the precision-recall curve, which graphically resembles a zigzag pattern. Since it is challenging to determine the area under the curve numerically, Interpolated <inline-formula id="inf2">
<mml:math id="m2">
<mml:msub>
<mml:mrow>
<mml:mfenced open="" close="|">
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> was introduced by PASCAL VOC (<xref ref-type="bibr" rid="B25">Everingham et al., 2010</xref>) as a numerical approximation. It is formulated as the mean precision calculated for <italic>N</italic> levels from a recall subset <italic>R</italic>, given as<disp-formula id="e1">
<mml:math id="m3">
<mml:msub>
<mml:mrow>
<mml:mfenced open="" close="|">
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mstyle displaystyle="true">
<mml:munder>
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:munder>
</mml:mstyle>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mtext>interpolate&#x2009;</mml:mtext>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:math>
<label>(1)</label>
</disp-formula>where <italic>r</italic> takes values from a evenly-spaced set of N numbers [0, 0.1, 0.2, &#x2026; 1] which follows a decreasing trend in P-R curve. Consider that the interpolated function of <italic>P</italic>(<italic>r</italic>) must be evaluated such as the precision for recall <italic>r</italic> is the maximum value for all recall values <italic>r</italic>&#x2032; greater than the reference <italic>r</italic> recall. It should be mentioned that the CADC dataset has not published its 3D detection benchmark yet.</p>
<sec id="s3-1">
<title>3.1 KITTI benchmark</title>
<p>The metric used for detection benchmarking in KITTI is the interpolated AP<sub>11</sub> metric. The considered scores in leaderboard ranking are test AP from bird eye view (BEV) detection and 3D detection. The evaluation for car, pedestrian, and cyclist accounts for different IoU thresholds in AP calculation. The passenger vehicle class uses 0.7 and the others 0.5 because of the occlusion frequency of each class. Changes were applied to the amount of recall levels, from 11 levels [0, 1/10, 2/10, &#x2026; , 1] to 40 levels [1/40, 2/40, 3/40, &#x2026; , 1] with recall level 0 being removed as proposed by (<xref ref-type="bibr" rid="B102">Simonelli et al., 2019b</xref>).</p>
</sec>
<sec id="s3-2">
<title>3.2 nuScenes benchmark</title>
<p>The official evaluation statistic for nuScenes is the nuScenes Detection Score (NDS), which is a set of mean average errors in translation, size, orientation, attribute, and velocity given by:<disp-formula id="e2">
<mml:math id="m4">
<mml:mtext>NDS</mml:mtext>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:mn>5</mml:mn>
<mml:mtext>mAP</mml:mtext>
<mml:mo>&#x2b;</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munder>
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mtext>mTP</mml:mtext>
<mml:mo>&#x2208;</mml:mo>
<mml:mtext>TP</mml:mtext>
</mml:mrow>
</mml:munder>
</mml:mstyle>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>min</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mtext>mTP</mml:mtext>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
</mml:math>
<label>(2)</label>
</disp-formula>where, mAP indicates mean Average Precision and TP is the set of the 5 mean true positive metrics calculated for each class. The mAP is calculated over <italic>C</italic> classes and <italic>D</italic> distance thresholds of values [0.5,1,2,4] meters. While obtaining the AP and before computing means, any operational point with precision or recall less than 10% is discarded.</p>
</sec>
<sec id="s3-3">
<title>3.3 Boreas benchmark</title>
<p>This dataset follows the KITTI dataset scheme. For a passenger vehicle, a 70% overlap threshold in mAP calculation count is considered as true positive and 50% for pedestrians.</p>
</sec>
<sec id="s3-4">
<title>3.4 Waymo benchmark</title>
<p>The Waymo Open dataset proposes a heading version of AP called APH which incorporates a heading function with respect to the recall similar to the calculation of area under the curve of the Precision/Recall (PR) plot. Each true positive is weighted by heading accuracy defined between <inline-formula id="inf3">
<mml:math id="m5">
<mml:mi>min</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mfenced open="|" close="|">
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2a;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:mn>2</mml:mn>
<mml:mi>&#x3c0;</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mfenced open="|" close="|">
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2a;</mml:mo>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>/</mml:mo>
<mml:mi>&#x3c0;</mml:mi>
</mml:math>
</inline-formula>. Here, <italic>&#x3b8;</italic> and <italic>&#x3b8;</italic>&#x2a; indicate the predicted azimuth angle and the corresponding ground truth, within [&#x2212;<italic>&#x3c0;</italic>, <italic>&#x3c0;</italic>]. Similar to AP, APH is normalized in range of [0,1]. To obtain the recall gap, Hungarian matching is performed for the prediction above a specified threshold. A threshold of 0.7 for vehicles and 0.5 for pedestrians is used. This matching is used for calculation of precision and recall. If in the APH calculation, the recall gap is above the default value of 0.05, more operation points are added to avoid over-estimation.</p>
</sec>
<sec id="s3-5">
<title>3.5 DAIR-V2X benchmark</title>
<p>It employs the same AP score as (<xref ref-type="bibr" rid="B25">Everingham et al., 2010</xref>) and introduces the transmission cost as the average send bytes between infrastructure and vehicle. In V2X detection it is common to have two separate detectors that interact which each other at different stages, may be early or late fusion. Since there is physical separation between the two data stations, it is highly relevant to minimize the data transmission between the stations. Nonetheless, there is a trade-off between efficient transmission and lost information so it is valuable to compare AP score and transmission cost together across different detectors.</p>
</sec>
<sec id="s3-6">
<title>3.6 A9 benchmark</title>
<p>It does not provide an official metric for 3D detection task.</p>
</sec>
<sec id="s3-7">
<title>3.7 ROPE 3D benchmark</title>
<p>It not only adopts the AP<sub>40</sub> variant from (<xref ref-type="bibr" rid="B103">Simonelli et al., 2019c</xref>), but also computes similarity scores for ground center (ACS), orientation (AOS), ground occupancy area (AAS), four ground point distance (AGS) and fuses them. Assume S &#x3d; (ACS &#x2b; AOS &#x2b; AAS &#x2b; AGS)/4, their fuse score known as ROPE<sub>
<italic>score</italic>
</sub> is equal to:<disp-formula id="equ1">
<mml:math id="m6">
<mml:msub>
<mml:mrow>
<mml:mtext>ROPE</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">score</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c9;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mi>A</mml:mi>
<mml:mi>P</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c9;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c9;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3c9;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:math>
</disp-formula>where the weights <italic>&#x3c9;</italic>
<sub>1</sub> &#x3d; 8 and <italic>&#x3c9;</italic>
<sub>2</sub> &#x3d; 2, thus giving a higher importance to AP detection metric.</p>
<p>Please, refer to (<xref ref-type="bibr" rid="B130">Ye et al., 2022</xref>) for greater detail on those.</p>
</sec>
</sec>
<sec id="s4">
<title>4 Three-dimensional object detection</title>
<p>The 3D object detection problem has an added level of complexity as compared to 2D detection, since it localizes the objects with respect to the camera and identifies the orientation/heading through fitting of 3D bounding boxes. The detector input can be monocular or stereo data, where each kind of detector leverages its input in different ways. An overview of 3D object detection methodologies is shown in <xref ref-type="fig" rid="F2">Figure 2</xref> where the classification is in three main categories: model based methods (which leverage geometrical constraints applicable in autonomous driving context), end-to-end learning, and hybrid methods.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>The structure of existing 3D object detection methodologies (having the same input of monocular or stereo images and output of the 3D detection header): <bold>(A)</bold> Methods using geometrical constraints use ROI features from backbone output or combine them with 2D bounding boxes to fit constraints on loss function or space projection. <bold>(B)</bold> End-to-end learning methods update all layer parameters using backpropagation. This method is categorized depending on utilization of an ROI or feature pyramid network regression with an optimal 2D detection. <bold>(C)</bold> Hybrid methods combine depth estimation from a standalone pretrained network and a change of representation to leverage detailed features for 3D detection. The 3D backbone can be from existing methods for LiDAR, BEV or Voxel points.</p>
</caption>
<graphic xlink:href="frobt-11-1212070-g002.tif"/>
</fig>
<p>Additionally, two classification diagrams are provided in <xref ref-type="fig" rid="F3">Figures 3</xref>, <xref ref-type="fig" rid="F4">4</xref> for monocular and stereo visions, respectively.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Taxonomy of monocular 3D object detection frameworks: <italic>i</italic>) Geometric methods consider spatial relationships between several objects and perspective consistency; <italic>ii</italic>) The end-to-end learning framework is categorized based on their utilization of internal features; and <italic>iii</italic>) Hybrid methods were classified by 3D representation and its augmentation with other techniques such as segmentation or 2D detection.</p>
</caption>
<graphic xlink:href="frobt-11-1212070-g003.tif"/>
</fig>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Taxonomy of stereo 3D object detection approaches. None-geometrical methods are widely utilized for stereo vision based 3D object detection since previously trained depth estimators or end-to-end depth cost volume achieve better results compared with geometric methods (utilizing in stereo camera). For the remaining categories, the inner classification remains the same as monocular 3D object detection frameworks.</p>
</caption>
<graphic xlink:href="frobt-11-1212070-g004.tif"/>
</fig>
<sec id="s4-1">
<title>4.1 Model based approaches</title>
<p>This subsection discusses model based approaches which mainly use monocular 3D detectors and geometric constraints in street view to make accurate depth estimation or directly utilize box regression. Stereo detectors mostly relay information through end-to-end learning depth networks or more complex 3D representations instead of leveraging geometric information. The constraints are formulated explicitly in a custom layer or loss function, i.e., epipolar or projection model constraints. Also, they can be integrated in the form of geometric projections or transformations. To feed the geometric formulation stage, the detector backbone passes its features from a set of ROIs or fuse those with 2D object detection predictions, as shown in the first block of <xref ref-type="fig" rid="F2">Figure 2</xref>. Its output is delivered to a detection header for the 3D box prediction. The numerical results regarding computational speed and AP for different classes and categories are reported in <xref ref-type="table" rid="T1">Table 1</xref> for a detailed comparison. In (<xref ref-type="bibr" rid="B98">Roddick et al., 2018</xref>), the authors proposed OFT-Net (Orthographic feature transformation), which is a network that projects multi-scale features from a ResNet-18 backbone into a 3D orthographic space using the camera&#x2019;s intrinsic parameters. The new projection gives a better representation of the 3D space than the pinhole projection since it is robust to appearance and scale distortions due to poor depth inference. Besides, the rest of the pipeline follows the standard classification task by bounding box regression and NMS. The authors of (<xref ref-type="bibr" rid="B82">Naiden et al., 2019</xref>) extended the Faster R-CNN head to predict 3D bounding box dimensions and angles apart from the 2D detection task. A least-square problem is formulated by 3D geometric constraints. The system of equations considers projection matrix constraints and enforces the 3D box edges to fit inside 2D box sides. A closed-form solution to this problem is given in (<xref ref-type="bibr" rid="B82">Naiden et al., 2019</xref>) where the box translation vectors are determined. The 3D parameters and 2D initial estimation are then passed to a ShifNet network for refinement through a newly proposed Volume Displacement Loss (VDL) that aims to find the translation which optimizes IoU between two 3D predicted boxes while fixing depth and angle.A 3D anchor preprocessing scheme and a custom layer called Ground-Aware Convolution (GAC) module is proposed in (<xref ref-type="bibr" rid="B69">Liu et al., 2021b</xref>) that infers depth via perspective projection from the camera to the ground since ground-level is a metric reference and is known <italic>a priori</italic> by car dimensions. Their basis is that ground-awareness gives enough clues for depth estimation and leverages it to do 3D detection. The preprocessing stage backpropagates the anchors to 3D space and rejects those far from ground-level. Afterward, the filtering results feed the GAC module to estimate vertical offsets from the ground and then compute depth priors with perspective geometry. Its output is a depth feature map included in a 3D regression head. In (<xref ref-type="bibr" rid="B59">Lian et al., 2022a</xref>) the effect of 4 geometric manipulation augmentations in 3D detection training has been explored. They studied the inaccuracy of depth estimation under different object positions and sizes. In order to make 3D detectors more robust to these geometrical distortions, they proposed random cropping, random scale, moving camera, and copy-paste augmentations. The first 2 methods are widely popular for 2D detection community. The third technique utilizes depth to translate the instance pixels to the augmented image. The copy-paste processing samples an object, takes it out of its context and pastes it into another region. It ensures geometrical consistency for the object in terms of position and angle concerning the background scene. Including these augmentations in the training stage, enhanced the accuracy of previous detectors. Further, local perspective distortion on 3D objects to infer depth and global yaw angle without using camera parameters has been explored in (<xref ref-type="bibr" rid="B140">Zhu et al., 2023</xref>). They introduced the concept of keyedges and keyedges-ratio to parameterize 3D bounding box vertical edges locally. The keyedges-ratio is used directly in regressing depth and yaw. From each box, they get 4 keyedges and subsequently 4 depth predictions which they fuse using an uncertainty-based operation. Subsequently, the local-perspective results are merged with global perspective effects in other monocular 3D detectors such as MonoFlex (<xref ref-type="bibr" rid="B32">Gu et al., 2022</xref>). Hence, both image perspectives enrich the extracted visual content for detection.</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Geometrical constrained model based methods comparison table. Best results are highlighted in <bold>bold</bold> font. The AP scores for car category were calculated considering IOU (Intersection of Union) of 70%, as required for submission to KITTI oficial evaluation.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="3" align="center">Method</th>
<th rowspan="3" align="center">Source</th>
<th rowspan="3" align="center">FPS</th>
<th rowspan="3" align="center">Camera input</th>
<th colspan="3" align="center">KITTI dataset validation set (AP<sub>3<italic>D</italic>
</sub>/AP<sub>
<italic>BEV</italic>
</sub>)</th>
</tr>
<tr>
<th colspan="3" align="center">Cars</th>
</tr>
<tr>
<th align="center">Easy</th>
<th align="center">Moderate</th>
<th align="center">Hard</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">OFT-Net</td>
<td align="center">
<xref ref-type="bibr" rid="B98">Roddick et al. (2018)</xref>
</td>
<td align="center">-</td>
<td align="center">Mono</td>
<td align="center">4.07/11.06</td>
<td align="center">3.27/8.79</td>
<td align="center">3.29/8.91</td>
</tr>
<tr>
<td align="center">GS3D</td>
<td align="center">
<xref ref-type="bibr" rid="B50">Li et al. (2019b)</xref>
</td>
<td align="center">0.23</td>
<td align="center">Mono</td>
<td align="center">13.46/-</td>
<td align="center">10.97/-</td>
<td align="center">10.38/-</td>
</tr>
<tr>
<td align="center">MonoPair</td>
<td align="center">
<xref ref-type="bibr" rid="B17">Chen et al. (2020a)</xref>
</td>
<td align="center">17.54</td>
<td align="center">Mono</td>
<td align="center">16.28/24.12</td>
<td align="center">12.30/18.17</td>
<td align="center">10.42/15.76</td>
</tr>
<tr>
<td align="center">ShiftNet</td>
<td align="center">
<xref ref-type="bibr" rid="B82">Naiden et al. (2019)</xref>
</td>
<td align="center">3.86</td>
<td align="center">Mono</td>
<td align="center">13.84/18.61</td>
<td align="center">11.29/14.71</td>
<td align="center">11.08/13.57</td>
</tr>
<tr>
<td align="center">VisualDet3D</td>
<td align="center">
<xref ref-type="bibr" rid="B69">Liu et al. (2021b)</xref>
</td>
<td align="center">20</td>
<td align="center">Mono</td>
<td align="center">23.63/-</td>
<td align="center">16.16/-</td>
<td align="center">12.06/-</td>
</tr>
<tr>
<td align="center">MonoFlex</td>
<td align="center">
<xref ref-type="bibr" rid="B32">Gu et al. (2022)</xref>
</td>
<td align="center">33.33</td>
<td align="center">Mono</td>
<td align="center">21.75/29.60</td>
<td align="center">14.94/20.68</td>
<td align="center">13.07/17.81</td>
</tr>
<tr>
<td align="center">CenterNet &#x2b; GeoAug</td>
<td align="center">
<xref ref-type="bibr" rid="B59">Lian et al. (2022a)</xref>
</td>
<td align="center">33.3</td>
<td align="center">Mono</td>
<td align="center">24.53/-</td>
<td align="center">17.23/-</td>
<td align="center">14.32/-</td>
</tr>
<tr>
<td align="center">MoNet3D</td>
<td align="center">
<xref ref-type="bibr" rid="B137">Zhou et al. (2020)</xref>
</td>
<td align="center">27.85</td>
<td align="center">Mono</td>
<td align="center">22.73/27.48</td>
<td align="center">16.73/21.80</td>
<td align="center">15.55/17.86</td>
</tr>
<tr>
<td align="center">MonoGround</td>
<td align="center">
<xref ref-type="bibr" rid="B92">Qin and Li (2022)</xref>
</td>
<td align="center">33.3</td>
<td align="center">Mono</td>
<td align="center">25.24/32.68</td>
<td align="center">18.69/24.79</td>
<td align="center">15.58/20.56</td>
</tr>
<tr>
<td align="center">MonoEdge</td>
<td align="center">
<xref ref-type="bibr" rid="B140">Zhu et al. (2023a)</xref>
</td>
<td align="center">-</td>
<td align="center">Mono</td>
<td align="center">
<bold>25.66/33.71</bold>
</td>
<td align="center">
<bold>18.89/25.35</bold>
</td>
<td align="center">
<bold>16.10/22.18</bold>
</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4-2">
<title>4.2 End-to-end learning based methods</title>
<p>End-to-end learning refers to the capability of updating all the parameters in a network with a single loss function such that backpropagation takes place from the head up to the network backbone. In consequence, the learned representations of depth, geometry, or 3D space overall gets embedded in all network layers and this results in minimization of time and enhanced prediction accuracy. A comparison between the state-of-the-art end-to-end learning methods is presented in <xref ref-type="table" rid="T2">Table 2</xref>. The work in (<xref ref-type="bibr" rid="B3">Bao et al., 2020</xref>) falls under a category where an end-to-end trainable monocular detector is designed to work effectively even without learning dense depth maps. The working principle behind this method involves projecting the grid coordinates from the 2D box to 3D space followed by developing an object-aware voting model. Such voting models use appearance attention and distribution of geometric projection to find proposals for the 3D centroid, thereby, facilitating object localization. Another end-to-end approach in (<xref ref-type="bibr" rid="B72">Liu et al., 2020</xref>), predicts the 3D bounding boxes by combining single key-point estimates and regressed 3D variables. The advantage of this method is that it also works on a multi-step disentangling approach resulting in improved convergence of training and detection accuracy. In (<xref ref-type="bibr" rid="B138">Zhou et al., 2021</xref>), the camera pose is captured to propose a detector free from extrinsic perturbation. This framework is capable of predicting the extrinsic parameters of the camera through effective detection of change in the horizon as well as through the use of vanishing point. Further, a converter is designed to enable the 3D detector to work independent of any extrinsic parameter variations.</p>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>End-to-end learning methods comparison. Best results are highlighted in <bold>bold</bold> font. The superscript <sup>&#x22c6;</sup> in results correspond to test set scores (since only those scores were available). The AP scores for car category were calculated considering IOU (Intersection of Union) of 70% while for pedestrians and cyclists categories was 50%, as required for submission to KITTI oficial evaluation.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="3" align="center">Method</th>
<th rowspan="3" align="center">Source</th>
<th rowspan="3" align="center">FPS</th>
<th rowspan="3" align="center">Camera input</th>
<th colspan="9" align="center">KITTI dataset validation set (AP<sub>3<italic>D</italic>
</sub>/AP<sub>
<italic>BEV</italic>
</sub>)</th>
</tr>
<tr>
<th colspan="3" align="center">Cars</th>
<th colspan="3" align="center">Pedestrians</th>
<th colspan="3" align="center">Cyclists</th>
</tr>
<tr>
<th align="center">Easy</th>
<th align="center">Moderate</th>
<th align="center">Hard</th>
<th align="center">Easy</th>
<th align="center">Moderate</th>
<th align="center">Hard</th>
<th align="center">Easy</th>
<th align="center">Moderate</th>
<th align="center">Hard</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">FQNet</td>
<td align="center">
<xref ref-type="bibr" rid="B65">Liu et al. (2019)</xref>
</td>
<td align="center">-</td>
<td align="center">Monocular</td>
<td align="center">5.98/-</td>
<td align="center">5.50/-</td>
<td align="center">4.75/-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">&#x2013;</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">MLF-3D</td>
<td align="center">
<xref ref-type="bibr" rid="B128">Xu and Chen (2018)</xref>
</td>
<td align="center">-</td>
<td align="center">Monocular</td>
<td align="center">10.53/-</td>
<td align="center">5.69/-</td>
<td align="center">5.39/-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">DST3D</td>
<td align="center">
<xref ref-type="bibr" rid="B123">Wu et al. (2022)</xref>
</td>
<td align="center">12.5</td>
<td align="center">Monocular</td>
<td align="center">13.46/17.33</td>
<td align="center">11.28/14.83</td>
<td align="center">11.06/14.18</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">OACV</td>
<td align="center">
<xref ref-type="bibr" rid="B3">Bao et al. (2020)</xref>
</td>
<td align="center">-</td>
<td align="center">Monocular</td>
<td align="center">13.65/20.65</td>
<td align="center">11.47/16.35</td>
<td align="center">10.70/14.21</td>
<td align="center">11.5/13.10</td>
<td align="center">10.93/12.33</td>
<td align="center">10.04/11.70</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">SMOKE</td>
<td align="center">
<xref ref-type="bibr" rid="B72">Liu et al. (2020)</xref>
</td>
<td align="center">33.33</td>
<td align="center">Monocular</td>
<td align="center">14.76/19.99</td>
<td align="center">12.85/15.61</td>
<td align="center">11.50/15.28</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">&#x2013;</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">MonoEF</td>
<td align="center">
<xref ref-type="bibr" rid="B138">Zhou et al. (2021)</xref>
</td>
<td align="center">33.3</td>
<td align="center">Monocular</td>
<td align="center">21.29/29.03<sup>&#x22c6;</sup>
</td>
<td align="center">13.87/19.7<sup>&#x22c6;</sup>
</td>
<td align="center">11.71/17.26</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">MonoRUn</td>
<td align="center">
<xref ref-type="bibr" rid="B12">Chen et al. (2021b)</xref>
</td>
<td align="center">14.29</td>
<td align="center">Monocular</td>
<td align="center">20.02/-</td>
<td align="center">14.65/-</td>
<td align="center">12.61/-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">MonoDIS</td>
<td align="center"/>
<td align="center">-</td>
<td align="center">Monocular</td>
<td align="center">18.05/24.26</td>
<td align="center">14.98/18.43</td>
<td align="center">13.42/16.95</td>
<td align="center">10.79/11.04</td>
<td align="center">10.39/10.94</td>
<td align="center">9.22/10.59</td>
<td align="center">5.27/5.52</td>
<td align="center">4.55/4.66</td>
<td align="center">4.55/4.55</td>
</tr>
<tr>
<td align="center">TLNet</td>
<td align="center">
<xref ref-type="bibr" rid="B93">Qin et al. (2019)</xref>
</td>
<td align="center">-</td>
<td align="center">Stereo</td>
<td align="center">18.15/29.22</td>
<td align="center">14.26/21.88</td>
<td align="center">13.72/18.83</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">OPA-3D</td>
<td align="center">
<xref ref-type="bibr" rid="B105">Su et al. (2023)</xref>
</td>
<td align="center">25</td>
<td align="center">Monocular</td>
<td align="center">19.40/25.51</td>
<td align="center">24.97/33.80</td>
<td align="center">16.59/22.13</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">MonoDETR &#x2b; ADD</td>
<td align="center">
<xref ref-type="bibr" rid="B133">Zhang et al. (2023)</xref>
</td>
<td align="center">-</td>
<td align="center">Monocular</td>
<td align="center">25.30/34.14</td>
<td align="center">16.64/23.49</td>
<td align="center">14.90/21.24</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">M3D-RPN</td>
<td align="center">
<xref ref-type="bibr" rid="B8">Brazil and Liu (2019)</xref>
</td>
<td align="center">6.21</td>
<td align="center">Monocular</td>
<td align="center">20.27/25.94</td>
<td align="center">17.06/21.18</td>
<td align="center">15.21/17.90</td>
<td align="center">-</td>
<td align="center">11.28/11.60</td>
<td align="center">&#x2013;</td>
<td align="center">-</td>
<td align="center">10.01/10.13</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">MonoDTR</td>
<td align="center">
<xref ref-type="bibr" rid="B40">Huang et al. (2022)</xref>
</td>
<td align="center">27</td>
<td align="center">Monocular</td>
<td align="center">24.52/33.33</td>
<td align="center">18.57/25.35</td>
<td align="center">15.51/21.68</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">M3DSSD</td>
<td align="center">
<xref ref-type="bibr" rid="B74">Luo et al. (2021)</xref>
</td>
<td align="center">-</td>
<td align="center">Monocular</td>
<td align="center">26.95/44.42</td>
<td align="center">18.68/29.69</td>
<td align="center">15.82/24.60</td>
<td align="left"/>
<td align="left"/>
<td align="left"/>
<td align="left"/>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="center">MonoCon</td>
<td align="center"/>
<td align="center">38.7</td>
<td align="center">Monocular</td>
<td align="center">26.33/34.65</td>
<td align="center">19.01/25.39</td>
<td align="center">15.98/21.93</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">D4LCN</td>
<td align="center">
<xref ref-type="bibr" rid="B21">Ding et al. (2020)</xref>
</td>
<td align="center">-</td>
<td align="center">Monocular</td>
<td align="center">26.97/-</td>
<td align="center">21.71/-</td>
<td align="center">18.22/-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">DD3D</td>
<td align="center">
<xref ref-type="bibr" rid="B85">Park et al. (2021)</xref>
</td>
<td align="center">-</td>
<td align="center">Monocular</td>
<td align="center">23.22/30.98<sup>&#x22c6;</sup>
</td>
<td align="center">16.34/22.56<sup>&#x22c6;</sup>
</td>
<td align="center">20.03/14.20<sup>&#x22c6;</sup>
</td>
<td align="center">13.91/15.90<sup>&#x22c6;</sup>
</td>
<td align="center">9.30/10.85<sup>&#x22c6;</sup>
</td>
<td align="center">8.05/8.05<sup>&#x22c6;</sup>
</td>
<td align="center">2.39/3.20<sup>&#x22c6;</sup>
</td>
<td align="center">1.52/1.99<sup>&#x22c6;</sup>
</td>
<td align="center">1.31/1.79<sup>&#x22c6;</sup>
</td>
</tr>
<tr>
<td align="center">ProGen-Net</td>
<td align="center">
<xref ref-type="bibr" rid="B110">ul Haq et al. (2022)</xref>
</td>
<td align="center">7.63</td>
<td align="center">Monocular</td>
<td align="center">31.3/37.1</td>
<td align="center">25.9/32.7</td>
<td align="center">20.7/23.5</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">YOLOStereo3D</td>
<td align="center">
<xref ref-type="bibr" rid="B68">Liu et al. (2021c)</xref>
</td>
<td align="center">12.5</td>
<td align="center">Stereo</td>
<td align="center">65.68/-</td>
<td align="center">41.25/-</td>
<td align="center">30.42/-</td>
<td align="center">28.49/-</td>
<td align="center">19.75/-</td>
<td align="center">16.48/-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">Stereo RCNN</td>
<td align="center">
<xref ref-type="bibr" rid="B53">Li et al. (2019a)</xref>
</td>
<td align="center">3.57</td>
<td align="center">Stereo</td>
<td align="center">54.11/68.50</td>
<td align="center">36.69/48.30</td>
<td align="center">31.07/41.47</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">IDA-3D</td>
<td align="center">
<xref ref-type="bibr" rid="B87">Peng et al. (2020)</xref>
</td>
<td align="center">-</td>
<td align="center">Stereo</td>
<td align="center">54.97/70.68</td>
<td align="center">37.45/50.21</td>
<td align="center">32.23/42.93<sup>&#x22c6;</sup>
</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">PseudoLiDAR-E2E</td>
<td align="center">
<xref ref-type="bibr" rid="B90">Qian et al. (2020)</xref>
</td>
<td align="center">2.04</td>
<td align="center">Stereo</td>
<td align="center">71.1/82.7</td>
<td align="center">51.7/65.7</td>
<td align="center">46.7/58.4</td>
<td align="center">
<bold>32.3/35.7</bold>
</td>
<td align="center">
<bold>24.9/27.8</bold>
</td>
<td align="center">
<bold>21.5/23.4</bold>
</td>
<td align="center">
<bold>38.4/42.8</bold>
</td>
<td align="center">
<bold>24.1/26.2</bold>
</td>
<td align="center">
<bold>22.7/24.5</bold>
</td>
</tr>
<tr>
<td align="center">DSGN</td>
<td align="center">
<xref ref-type="bibr" rid="B16">Chen et al. (2020b)</xref>
</td>
<td align="center">1.47</td>
<td align="center">Stereo</td>
<td align="center">72.32/83.24</td>
<td align="center">54.27/63.91</td>
<td align="center">47.71/57.83</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">3DOP</td>
<td align="center">
<xref ref-type="bibr" rid="B14">Chen et al. (2017)</xref>
</td>
<td align="center">0.83</td>
<td align="center">Stereo</td>
<td align="center">
<bold>90.43/-</bold>
</td>
<td align="center">
<bold>68.90/-</bold>
</td>
<td align="center">
<bold>62.22/-</bold>
</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>For a well-known end-to-end framework called MonoRUn, an uncertainty-aware reconstruction network is designed in order to regress the pixel-related 3D object coordinates, and for the training, the predicted 3D coordinates are projected back on to an image plane in (<xref ref-type="bibr" rid="B12">Chen et al., 2021</xref>). An approach that uses a disentangling transformation for losses in detection, along with generating a confidence score based on self-supervised learning is proposed in (<xref ref-type="bibr" rid="B102">Simonelli et al., 2019</xref>) which do not need class labels. While (<xref ref-type="bibr" rid="B133">Zhang et al., 2023</xref>) introduces a framework that transforms into a depth-aware detection process and represents 3D object candidates through set queries. Then, an attention encoder based on depth is utilized to produce a non-local depth embedding from the image which was provided as input. Further, a depth-guided decoder is then used for inter-query and query-scene depth feature interactions leading to adaptive estimates of each object query. Leveraging the geometric relationship between the 2D and 3D outlook while enabling 3D boxes to use convolutional features produced in image-space; an object detection algorithm is proposed in (<xref ref-type="bibr" rid="B8">Brazil and Liu, 2019</xref>). Depth-aware convolutional layers are also designed in this work which enables location-specific feature development, in turn improving the understanding of 3D scenes.</p>
<p>An end-to-end depth-aware transformer network for 3D object identification in monocular vision consisting of feature enhancement and transformer model is proposed in (<xref ref-type="bibr" rid="B40">Huang et al., 2022</xref>). While the authors of (<xref ref-type="bibr" rid="B74">Luo et al., 2021</xref>), introduce an approach where first a shape alignment is carried out followed by the center alignment. This combined with an attention block to extract depth features improves the overall performance of the proposed algorithm. Learned auxiliary monocular contexts are utilized in (<xref ref-type="bibr" rid="B67">Liu et al., 2022</xref>), which uses 3 components, namely, a feature backbone based on Deep Neural Network (DNN), learning parameters using regression head branches, and learning auxiliary contexts using regression head branches. A single-stage detector that benefits from pre-training of depth, and with efficient transfer of information between the estimated depth and detection, while allowing scaling of the unlabeled pre-training data is proposed in (<xref ref-type="bibr" rid="B85">Park et al., 2021</xref>). Regressing the dimensions along with the orientation through the use of an anchor-based approach, such that a 3D proposal can be constructed is introduced in (<xref ref-type="bibr" rid="B65">Liu et al., 2019</xref>).</p>
<p>It is not only through the use of monocular cameras that end-to-end approaches have been found to provide efficient 3D object detection but with stereo cameras as well. In (<xref ref-type="bibr" rid="B93">Qin et al., 2019</xref>), 3D anchors are employed to design correspondences at the object level, in between stereo images. This enables DNNs to effectively learn and in turn detect the object of interest in 3D space. Incorporating inference structure as well as knowledge gathered in real-time, a 1-stage detector is proposed with a stereo matching module which is lightweight as discussed in (<xref ref-type="bibr" rid="B68">Liu et al., 2021c</xref>). A method, called Stereo R-CNN (<xref ref-type="bibr" rid="B53">Li et al., 2019</xref>), is known to associate and detect objects in either side of images simultaneously. Extra branches are also added in order to predict the dimensions of objects and sparse key-points. These are then combined with the 2D left-right boxes for obtaining a coarse 3D object bounding box. Finally, the accurate 3D bounding box is recovered through a region-based photometric alignment using left and right RoIs. In (<xref ref-type="bibr" rid="B87">Peng et al., 2020</xref>), only RGB images are taken and 3D bounding boxes are annotated as the training data. As an all important factor, the depth estimations are considered and an Instance-Depth Aware module is introduced to predict the depth of the centre of the bounding box. A framework which is based on the differentiable Change of Representation modules and which trains the entire PL pipeline end-to-end is proposed in (<xref ref-type="bibr" rid="B90">Qian et al., 2020</xref>). Based on how representations of 3D scenario prediction is to take place, a method called Deep Stereo Geometry Network was proposed in (<xref ref-type="bibr" rid="B16">Chen et al., 2020b</xref>). This approach detects 3D objects on a differentiable volumetric representation thus encoding 3D geometric structure for 3D regular space. Another methodology of object detection that works by minimizing an energy function and encodes priors of object sizes, defines object placement on the ground plane in addition to several depth informed features while utilizing CNN is discussed in (<xref ref-type="bibr" rid="B14">Chen et al., 2017</xref>). We make an special mention to the multi-view end-to-end object detectors because they showed a significant boost in robustness in terms of adversarial attacks and poor depth representations (<xref ref-type="bibr" rid="B124">Xie et al., 2023</xref>; <xref ref-type="bibr" rid="B141">Zhu et al., 2023</xref>; <xref ref-type="bibr" rid="B43">Jiang et al., 2023</xref>) states that the polar coordinates suits as a more natural 3D world representation in bird&#x2019;s eye view thus they proposed a cross attention based Polar detection head where they re-parametrized projection models and grid structure to use polar coordinates. As input, the model uses 6 cameras views that sweep the car&#x2019;s polar view. (<xref ref-type="bibr" rid="B56">Li et al., 2023a</xref>). studied extensively the deficiencies inside depth modules in current multi-view 3D object detector and introduced BEVDepth which is trained with supervision module from LiDAR point cloud to apply corrections in the predicted depth distribution of each view. This accomplishes more accurate depth predictions, avoids depth overfitting and helps to obtain better BEV semantics inference. In (<xref ref-type="bibr" rid="B55">Li et al., 2023b</xref>), the authors rejected ViT based detectors due to their internal quadratic operations, i. e., cross attention. They rather proposed a fully convolutional multi-view detector which reports similar AP as (<xref ref-type="bibr" rid="B56">Li et al., 2023a</xref>) but with an increase of 3 times in inference speed. They managed to implemented pure convolutional depth estimation, fusion module and BEV encoder thus obtaining a linear computational cost. (<xref ref-type="bibr" rid="B127">Xiong et al., 2023</xref>). explored prioritizing local feature in camera view rather than global ones because using them for learning view transformation was trickier due to inaccuracies in extrinsic parameters. Their network called CAPE employed feature-guided key position embedding for local features and a query position encoder for global ones to later fuse both in a single encoder.</p>
</sec>
<sec id="s4-3">
<title>4.3 Hybrid approaches</title>
<p>Before end-to-end learning methods could reach performance comparable to LiDAR detectors, exploiting new 3D representations such as Pseudo-LiDAR or BEV were proposed in literature and applied in practice to have finer feature extraction and reduce the performance gap. Moreover, the detection head was also inspired in LiDAR 3D detection or other frameworks and the list of these methods are summarized in <xref ref-type="table" rid="T3">Table 3</xref>. In that sense, hybrid methods aim to exploit previously proposed methods, from model-based or end-to-end learning methods, as the depth estimators and then introduce an internal change of representation to exploit 3D detectors originally designed for other frameworks such as LiDAR detection. As the depth network were already trained and achieved a reasonable performance, researchers only need to focus in defining the appropriate representation and its conversion for detection. One such hybrid approach is proposed in (<xref ref-type="bibr" rid="B114">Wang et al., 2021</xref>) where the authors have utilized a lightweight strategy for obtaining learned coordinate representations. An approach by which the localization can be enhanced and which introduces confidence-aware loss is used for prediction. Such hybrid approaches have been found to be up to the task of effectively and efficiently tackling the problem of localization in literature and in practice. Using a similar hybrid approach in (<xref ref-type="bibr" rid="B95">Reading et al., 2021</xref>), the predicted depth distribution is used in order to project the feature information in 3D space. Then through the use of bird&#x2019;s-eye-view projection with a single-stage detector, the output detection is obtained. In another approach (<xref ref-type="bibr" rid="B116">Wang et al., 2020</xref>) the data distribution is analysed continued by a scan of the interactions in the background and foreground, followed by a separated depth estimation based on ForeSeE method for estimating their respective depths. While in (<xref ref-type="bibr" rid="B135">Zhang et al., 2022b</xref>) using DNNs a pair-wise distance is exploited for obtaining the similarity of dimensions so that the proposed model has the option of exploiting the inter-object information to learn further for more effective dimension estimation.</p>
<table-wrap id="T3" position="float">
<label>TABLE 3</label>
<caption>
<p>Hybrid methods comparison table. Best results are highlighted in <bold>bold</bold> font. The superscript <sup>&#x22c6;</sup> in results correspond to test set scores (since only those scores were available). The AP scores for car category were calculated considering IoU (Intersection of Union) of 70% while for pedestrians and cyclists categories was 50%, as required for submission to KITTI oficial evaluation.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="3" align="center">Method</th>
<th rowspan="3" align="center">Source</th>
<th rowspan="3" align="center">FPS</th>
<th rowspan="3" align="center">Camera input</th>
<th colspan="9" align="center">KITTI dataset validation set (AP<sub>3<italic>D</italic>
</sub>/AP<sub>
<italic>BEV</italic>
</sub>)</th>
</tr>
<tr>
<th colspan="3" align="center">Cars</th>
<th colspan="3" align="center">Pedestrians</th>
<th colspan="3" align="center">Cyclists</th>
</tr>
<tr>
<th align="center">Easy</th>
<th align="center">Moderate</th>
<th align="center">Hard</th>
<th align="center">Easy</th>
<th align="center">Moderate</th>
<th align="center">Hard</th>
<th align="center">Easy</th>
<th align="center">Moderate</th>
<th align="center">Hard</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">PCT</td>
<td align="center">
<xref ref-type="bibr" rid="B114">Wang et al. (2021a)</xref>
</td>
<td align="center">22</td>
<td align="center">Monocular</td>
<td align="center">13.37/19.03</td>
<td align="center">21.00/29.65</td>
<td align="center">11.31/15.92</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">CaDNN</td>
<td align="center">
<xref ref-type="bibr" rid="B95">Reading et al. (2021)</xref>
</td>
<td align="center">-</td>
<td align="center">Monocular</td>
<td align="center">19.17/-<sup>&#x22c6;</sup>
</td>
<td align="center">13.41/-<sup>&#x22c6;</sup>
</td>
<td align="center">11.46/-<sup>&#x22c6;</sup>
</td>
<td align="center">12.87/-<sup>&#x22c6;</sup>
</td>
<td align="center">8.14/-<sup>&#x22c6;</sup>
</td>
<td align="center">6.76/-<sup>&#x22c6;</sup>
</td>
<td align="center">7/-<sup>&#x22c6;</sup>
</td>
<td align="center">3.41/-<sup>&#x22c6;</sup>
</td>
<td align="center">3.3/-<sup>&#x22c6;</sup>
</td>
</tr>
<tr>
<td align="center">ForeSeE-PL</td>
<td align="center">
<xref ref-type="bibr" rid="B116">Wang et al. (2020b)</xref>
</td>
<td align="center">-</td>
<td align="center">Monocular</td>
<td align="center">15.0/23.4</td>
<td align="center">12.5/17.4</td>
<td align="center">12.0/15.9</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">GUPNet &#x2b; DimEmb</td>
<td align="center">
<xref ref-type="bibr" rid="B135">Zhang et al. (2022b)</xref>
</td>
<td align="center">32.15</td>
<td align="center">Monocular</td>
<td align="center">23.62/32.82<sup>&#x22c6;</sup>
</td>
<td align="center">16.10/21.98<sup>&#x22c6;</sup>
</td>
<td align="center">13.41/18.70<sup>&#x22c6;</sup>
</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">MonoJSG</td>
<td align="center">
<xref ref-type="bibr" rid="B58">Lian et al. (2022b)</xref>
</td>
<td align="center">23.81</td>
<td align="center">Monocular</td>
<td align="center">24.69/32.59<sup>&#x22c6;</sup>
</td>
<td align="center">16.14/21.26<sup>&#x22c6;</sup>
</td>
<td align="center">13.64/18.18<sup>&#x22c6;</sup>
</td>
<td align="center">11.02/-<sup>&#x22c6;</sup>
</td>
<td align="center">7.49/-<sup>&#x22c6;</sup>
</td>
<td align="center">6.41/-<sup>&#x22c6;</sup>
</td>
<td align="center">5.45/-<sup>&#x22c6;</sup>
</td>
<td align="center">3.21/-<sup>&#x22c6;</sup>
</td>
<td align="center">2.57/-<sup>&#x22c6;</sup>
</td>
</tr>
<tr>
<td align="center">GUPNet</td>
<td align="center">
<xref ref-type="bibr" rid="B73">Lu et al. (2021)</xref>
</td>
<td align="center">29.4</td>
<td align="center">Monocular</td>
<td align="center">22.76/31.07</td>
<td align="center">16.64/22.94</td>
<td align="center">13.72/19.75</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">SGM3D</td>
<td align="center">
<xref ref-type="bibr" rid="B139">Zhou et al. (2022)</xref>
</td>
<td align="center">33</td>
<td align="center">Stereo/Monocular</td>
<td align="center">25.96/34.10</td>
<td align="center">17.81/23.62</td>
<td align="center">15.11/20.49</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">FGMF-AC</td>
<td align="center">
<xref ref-type="bibr" rid="B64">Liu et al. (2022b)</xref>
</td>
<td align="center">6.25</td>
<td align="center">Monocular</td>
<td align="center">29.67/37.70</td>
<td align="center">22.96/26.99</td>
<td align="center">18.97/24.29</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">Pseudo-Mono</td>
<td align="center">
<xref ref-type="bibr" rid="B109">Tao et al. (2023)</xref>
</td>
<td align="center">-</td>
<td align="center">Mono</td>
<td align="center">27.41/35.84<sup>&#x22c6;</sup>
</td>
<td align="center">18.57/23.67<sup>&#x22c6;</sup>
</td>
<td align="center">16.16/20.19<sup>&#x22c6;</sup>
</td>
<td align="center">29.26/36.11</td>
<td align="center">22.15/28.04</td>
<td align="center">19.27/23.90</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">AM3D</td>
<td align="center">
<xref ref-type="bibr" rid="B76">Ma et al. (2019)</xref>
</td>
<td align="center">-</td>
<td align="center">Monocular</td>
<td align="center">32.23/-</td>
<td align="center">21-09/-</td>
<td align="center">17.26/-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">Mono PseudoLiDAR</td>
<td align="center">
<xref ref-type="bibr" rid="B118">Wang et al. (2022b)</xref>
</td>
<td align="center">-</td>
<td align="center">Monocular</td>
<td align="center">32.4/42.5</td>
<td align="center">21.4/29.1</td>
<td align="center">17.3/24.7</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">PseudoLiDAR</td>
<td align="center">
<xref ref-type="bibr" rid="B117">Wang et al. (2019)</xref>
</td>
<td align="center">1</td>
<td align="center">Stereo</td>
<td align="center">59.4/72.8</td>
<td align="center">39.8/51.8</td>
<td align="center">33.5/44.0</td>
<td align="center">33.8/41.3</td>
<td align="center">27.4/34.9</td>
<td align="center">24.0/30.1</td>
<td align="center">41.3/47.6</td>
<td align="center">25.2/29.9</td>
<td align="center">24.9/27.0</td>
</tr>
<tr>
<td align="center">SIDE</td>
<td align="center">
<xref ref-type="bibr" rid="B88">Peng et al. (2022)</xref>
</td>
<td align="center">3.85</td>
<td align="center">Stereo</td>
<td align="center">61.22/72.75</td>
<td align="center">44.46/53.71</td>
<td align="center">37.15/46.16</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">SAS3D</td>
<td align="center">
<xref ref-type="bibr" rid="B27">Gao et al. (2023)</xref>
</td>
<td align="center">35.71</td>
<td align="center">Stereo</td>
<td align="center">65.26/77.48</td>
<td align="center">47.07/58.41</td>
<td align="center">39.62/49.95</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">BirdGAN</td>
<td align="center">
<xref ref-type="bibr" rid="B104">Srivastava et al. (2019)</xref>
</td>
<td align="center">-</td>
<td align="center">Monocular</td>
<td align="center">58.26/-</td>
<td align="center">42.48/-</td>
<td align="center">40.72/-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">PLUMENet-S</td>
<td align="center">
<xref ref-type="bibr" rid="B119">Wang et al. (2021b)</xref>
</td>
<td align="center">12.5</td>
<td align="center">Stereo</td>
<td align="center">-/74.4</td>
<td align="center">-/61.7</td>
<td align="center">-/55.8</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">ZoomNet</td>
<td align="center">
<xref ref-type="bibr" rid="B129">Xu et al. (2020)</xref>
</td>
<td align="center">-</td>
<td align="center">Stereo</td>
<td align="center">62.96/78.68</td>
<td align="center">50.47/66.19</td>
<td align="center">43.63/57.60</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">PseudoLiDAR &#x2b; Geo</td>
<td align="center">
<xref ref-type="bibr" rid="B52">Li et al. (2022b)</xref>
</td>
<td align="center">28.57</td>
<td align="center">Stereo</td>
<td align="center">68.23/78.77</td>
<td align="center">48.34/59.01</td>
<td align="center">44.84/55.51</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">PseudoLiDAR&#x2b;&#x2b;</td>
<td align="center">
<xref ref-type="bibr" rid="B131">You et al. (2019)</xref>
</td>
<td align="center">11.1</td>
<td align="center">Stereo</td>
<td align="center">67.9/82.0</td>
<td align="center">50.1/64.0</td>
<td align="center">45.3/57.3</td>
<td align="center">53.6/63.7</td>
<td align="center">44.4/53.8</td>
<td align="center">38.1/46.8</td>
<td align="center">60.8/65.7</td>
<td align="center">40.8/45.8</td>
<td align="center">38.0/42.8</td>
</tr>
<tr>
<td align="center">CDN-DSGN</td>
<td align="center">
<xref ref-type="bibr" rid="B16">Chen et al. (2020b)</xref>
</td>
<td align="center">-</td>
<td align="center">Stereo</td>
<td align="center">74.5/83.3<sup>&#x22c6;</sup>
</td>
<td align="center">54.2/66.2<sup>&#x22c6;</sup>
</td>
<td align="center">46.4/57.7<sup>&#x22c6;</sup>
</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">Disp R-CNN</td>
<td align="center">
<xref ref-type="bibr" rid="B106">Sun et al. (2020b)</xref>
</td>
<td align="center">2.59</td>
<td align="center">Stereo</td>
<td align="center">70.18/83.29</td>
<td align="center">54.72/66.18</td>
<td align="center">46.99/57.60</td>
<td align="center">
<bold>43.87/50.70</bold>
</td>
<td align="center">36.26/38.33</td>
<td align="center">
<bold>29.81/33.50</bold>
</td>
<td align="center">
<bold>55.98/61.60</bold>
</td>
<td align="center">33.46/36.89</td>
<td align="center">
<bold>29.51/35.07</bold>
</td>
</tr>
<tr>
<td align="center">CG-Stereo</td>
<td align="center">
<xref ref-type="bibr" rid="B51">Li et al. (2020)</xref>
</td>
<td align="center">1.76</td>
<td align="center">Stereo</td>
<td align="center">76.17/87.31</td>
<td align="center">57.82/68.69</td>
<td align="center">54.63/65.80</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">LIGA-Stereo</td>
<td align="center">
<xref ref-type="bibr" rid="B33">Guo et al. (2021)</xref>
</td>
<td align="center">2.86</td>
<td align="center">Stereo</td>
<td align="center">
<bold>84.92/89.35</bold>
</td>
<td align="center">67.06/77.26</td>
<td align="center">
<bold>63.80/69.05</bold>
</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">-</td>
</tr>
<tr>
<td align="center">DSGN&#x2b;&#x2b;</td>
<td align="center">
<xref ref-type="bibr" rid="B15">Chen et al. (2022a)</xref>
</td>
<td align="center">5.62</td>
<td align="center">Stereo</td>
<td align="left"/>
<td align="center">
<bold>69.12/78.93</bold>
</td>
<td align="left"/>
<td align="center">-</td>
<td align="center">
<bold>42.44/50.06</bold>
</td>
<td align="center">-</td>
<td align="center">-</td>
<td align="center">
<bold>42.48/45.77</bold>
</td>
<td align="center">-</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>To benefit from the advantages of DNNs as well as the imposition of geometric constraints at the pixel level, the object depth estimation problem is re-formulated as a refinement problem in (<xref ref-type="bibr" rid="B58">Lian et al., 2022b</xref>). To reduce the feature degradation brought on by depth estimation errors, virtual image features are created using a disparity-wise dynamic convolution with dynamic kernels taken from the disparity feature map in (<xref ref-type="bibr" rid="B19">Chen et al., 2022</xref>). A separate module to convert the input data from a 2D plane to a 3D point cloud space for a better input representation is explored in (<xref ref-type="bibr" rid="B76">Ma et al., 2019</xref>). This is followed by the use of PointNet backbone net to conduct 3D detection to determine the positions, dimensions, and orientations of the objects in 3D space. A multi-modal feature fusion module to include the complementary RGB cues into the produced point cloud representation in order to improve the point cloud&#x2019;s capacity to discriminate is also investigated in (<xref ref-type="bibr" rid="B76">Ma et al., 2019</xref>). Further, 2D object proposals are identified in the input image by using a pipeline of two-stage 3D detection methods, and a point cloud frustum from the pseudo-LiDAR for each proposal is extracted. Then each frustum&#x2019;s oriented 3D bounding box is found and ways through which the noise in the pseudo-LiDAR can be dealt with are also discussed in (<xref ref-type="bibr" rid="B120">Weng and Kitani, 2019</xref>). Utilizing current networks that operate directly on 3D data to conduct 3D object recognition and localization while also employing neural networks to convert 2D images to 3D representations has been discussed in (<xref ref-type="bibr" rid="B104">Srivastava et al., 2019</xref>).</p>
<p>The use of stereo cameras with learning based techniques to obtain a hybrid approach has been discussed in detail in literature. One such approach (<xref ref-type="bibr" rid="B117">Wang et al., 2019</xref>) converts the image-based depth maps to pseudo-LiDAR representations, which are fundamentally imitative of the LiDAR signal, taking into account the inner workings of convolutional neural networks. Using this representation, several LiDAR-based detection techniques that are already available can be exploited. The stereo 3D detector is a stereo-image based anchor-free 3D detection approach in which the instance-level depth information is investigated in (<xref ref-type="bibr" rid="B88">Peng et al., 2022</xref>) by creating the cost volume from ROIs of each item. Due to the information scarcity of local cost volume, match reweighting is applied in addition to structure-aware attention to enhance the concentration of depth information. It suggests a shape-aware non-uniform sampling approach to make use of the pertinent data from the object&#x2019;s exterior region. While utilizing trained neural networks to transform 2D images into 3D representations and using existing networks to operate directly on 3D data to produce better results is discussed in (<xref ref-type="bibr" rid="B104">Srivastava et al., 2019</xref>). A framework called ZoomNet is introduced in (<xref ref-type="bibr" rid="B129">Xu et al., 2020</xref>) for stereo imagery-based 3D detection that leverages a standard 2D item identification model and adaptive zooming to generate pairs of left-right bounding boxes. It also proposes the 3D fitting score to assess the 3D detection quality and the learning of component positions to increase resistance to occlusion.</p>
<p>A lightweight pseudo-LiDAR 3D detection system is proposed in (<xref ref-type="bibr" rid="B52">Li et al., 2022</xref>) that achieves responsiveness and accuracy by using Binary Neural Networks (BNNs) to increase the completeness of objects and their representation in 3D space. While a strategy in which a one-stage stereo-based 3D detection pipeline that simultaneously recognises 3D objects and calculates depth, closing the gap between semantic and depth information is discussed in (<xref ref-type="bibr" rid="B16">Chen et al., 2020b</xref>). Using a statistical shape model to produce dense disparity pseudo-ground-truth without LiDAR point clouds, broadening applicability and addressing the issue of lack of disparity annotation has been tackled in (<xref ref-type="bibr" rid="B106">Sun et al., 2020</xref>). To increase the efficiency of learning semantic features from indirect 3D supervision, a second 2D detection head was attached in (<xref ref-type="bibr" rid="B33">Guo et al., 2021</xref>), which enhanced the overall geometric and semantic representation. Also, depth-wise plane sweeping, dual-view stereo volume, and stereo-LiDAR Copy-Paste to lift 2D and 3D information to the stereo volume have been explored in (<xref ref-type="bibr" rid="B15">Chen et al., 2022</xref>). This is a multi-modal data editing technique to maintain cross-modal alignment and increase data effectiveness.</p>
</sec>
</sec>
<sec id="s5">
<title>5 Trends in reliable three-dimentional object detection</title>
<p>From the presented results and analysis in this work, it can be projected that the object detection community is moving towards employing hybrid methods on stereo vision that leverage Pseudo-LiDAR representation and infer depth through dedicated networks or combine these approaches with geometric constraints. The current LiDAR based state estimation and detection approaches have provided a wide array of hybrid techniques and has also provided the background needed to propose new detectors without manipulating LiDAR data. Considering the best results in terms of AP<sub>3<italic>D</italic>
</sub>/AP<sub>
<italic>BEV</italic>
</sub> and its corresponding methods, they run at least in 30 FPS (real-time) and <italic>SAS</italic>3<italic>D</italic> (39.62/49.95) is the most prominent framework of the hybrid branch. It outperforms its counterparts of geometric, MonoGround (15.58/20.56), and end-to-end, MonoCon, (15.98/21.93) branches. Interestingly, end-to-end methods do not generally perform better than the model-based approaches in terms of inference time. This gives the impression that end-to-end approaches will still need further improvements in the case of real-time applications, and more specifically for safety-critical scenarios in highly dynamic settings. In particular, end-to-end methods that rely on a two-branch structure tend to be have higher AP but slower inference, similarly as with 2D detection. Conversely, model-based detectors obtain rapid results due to levering depth through geometric scene and not dense feature map, but this is also its weakest point since this process is more sensitive to depth artifacts. This branch can be use for real-time applications with sufficient awareness on AP results. Lastly, hybrid-methods, expressly those based on Pseudo-LiDAR representation, heavily depend on previous depth estimator regarding both in detection quality and speed, in other words it is its major bottleneck.</p>
<p>Several critical detection aspects are yet to be tackled which include, failure of CNN to capture the finest texture details while only focusing on local visual information and limitations in its extraction capabilities. As a possible remedy to such issues, visual transformers (ViT) have been proposed as the deep learning structure to obtain global visual information and to keep long-term spatial structures due to its embedding mechanism.</p>
<p>Several transformer-based frameworks, have already been presented in this survey, which include, MonoDTR (<xref ref-type="bibr" rid="B40">Huang et al., 2022</xref>), DST3D (<xref ref-type="bibr" rid="B123">Wu et al., 2022</xref>), and MonoDETr (<xref ref-type="bibr" rid="B133">Zhang et al., 2023</xref>). These frameworks have been used for precise depth inference or serves as feature backbones. A key feature of such methods is its end-to-end learning capabilities and its scalability. Also, it is concerning that only a few approaches have categorically addressed the occlusion problem, considering that most street scenarios suffer from varying degrees of occlusion. The KITTI dataset contemplates occlusion for mAP calculation by considering categories of Easy, Moderate, and Hard. As another possible solution to this issue, the authors of (<xref ref-type="bibr" rid="B64">Liu et al., 2022</xref>; <xref ref-type="bibr" rid="B105">Su et al., 2023</xref>) have exploited the anti-occlusion loss function which fuses depth and semantic information and defines a confidence occlusion parameter inside the loss. However, further investigation and analytical analysis must be carried out to effectively tackle the issue of occlusion.</p>
<p>An alternative approach of designing a part-aware mechanism to extract features from non-occluded parts of the vehicle, i.e., wheels or car plates was undertaken by ZoomNet (<xref ref-type="bibr" rid="B129">Xu et al., 2020</xref>). Those parts can guide the pose prediction learning flow even with occluded instances. Also, robust detector design must also contemplate adversarial attacks, which represent a high potential safety threat to pedestrians and other drivers. The survey did an extensive report on how 3 types of attacks affect 3D detection. They decided to apply disturbances to class labels, object position, and orientation, inject patch noises to 2D bounding boxes and dynamically resized them depending on the target size of the object. The findings of the work in showed that: depth-estimation-free approaches are more sensitive to adversarial attacks, BEV representation only provides robustness in class perturbation, and temporal integration or multi-view could be integrated into current networks to mitigate adversarial attacks even further.</p>
<p>It has also been observed that sensor fusion pipelines have gained considerable attention after attaining good positions in KITTI testing score ranking. These methods can handle occlusion since they fuse features from multiple inputs. For instance, if a stereo camera setup suffers from occlusion, LiDAR features may be enough to complement visual aspects and predict the bounding (<xref ref-type="bibr" rid="B48">Kim et al., 2023</xref>), the authors alleviated the gap between image feature representation and LiDAR point cloud by fusing these in a voxel feature volume to infer 3D structure of the scene. While the authors of (<xref ref-type="bibr" rid="B122">Wu et al., 2023</xref>) suggested a reduction in the redundancy of virtual point clouds and proposed an increase in depth accuracy by fusing RGB and LiDAR data in a new operator called, VirConv (Virtual Sparse Convolution) which is based on a transformed refinement scheme. Furthermore, to primarily rely on visual information, multi-sensor fusion can be done under a collaborative or networked 3D object detection pipeline under considerations of bandwidth and network schemes discussed in the previous section. Altogether, the discussed aspects in this section is poised to be an integral part of the research interests in the following years to accomplish reliable 3D object detection in autonomous driving.</p>
</sec>
<sec sec-type="conclusion" id="s6">
<title>6 Conclusion</title>
<p>This survey presented an in depth discussion regarding 3D object detection for autonomous driving using stereo and monocular cameras. At first, the 2D object detection techniques were covered and their challenges were highlighted in urban settings, which in turn motivates 3D object detection techniques in real-time. A classification composed of model- and learning-based was then presented to reflect the different feature extraction and 3D structure learning. This was subsequently followed by a detailed comparison of real-time capabilities for each approach, through its inference time (FPS) and KITTI dataset validation results. Furthermore, a discussion was provided on depth inference foundation, learning schemes, and internal representation based taxonomy with three classes: geometrically limited, end-to-end learning, and hybrid methods. Further, assessment indicators have been included to emphasise the benefits and shortcomings of each category of these techniques. To summarize, this paper aimed to provide a comprehensive survey and quantitative comparisons with state-of-the-art 3D object detection methodologies and identified research gaps and potential future directions in visual-based 3D object detection approaches for autonomous driving. On top of the identified research trends and challenges, the authors encourage to put detailed focus to the social implications of AI usage in aspects of policy making that addresses security and job concerns, eco-friendliness of AVs, economical impact, and availability of this resource in unrepresented social groups and countries.</p>
</sec>
</body>
<back>
<sec id="s7">
<title>Author contributions</title>
<p>EH conceptualized the work, and was solely responsible for the funding acquisition and supervision. MC, AJ, NB, and AB formulated the methodology and performed the literature survey. AB and NP performed project administration. MC, AJ, and NP performed the investigations. MC, AJ, and NP wrote the original draft. EH, AB, NP, MC, and AJ were all responsible for reviewing and editing this work. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec sec-type="funding-information" id="s8">
<title>Funding</title>
<p>The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. This work was supported by the Natural Sciences and Engineering Research Council of Canada RGPIN-2020-05097, and in part by the New Frontiers in Research Fund Grant NFRFE-2022-00795.</p>
</sec>
<sec sec-type="COI-statement" id="s9">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s10">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Arnold</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Al-Jarrah</surname>
<given-names>O. Y.</given-names>
</name>
<name>
<surname>Dianati</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Fallah</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Oxtoby</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Mouzakitis</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>A survey on 3d object detection methods for autonomous driving applications</article-title>. <source>IEEE Trans. Intelligent Transp. Syst.</source> <volume>20</volume> (<issue>10</issue>), <fpage>3782</fpage>&#x2013;<lpage>3795</lpage>. <pub-id pub-id-type="doi">10.1109/tits.2019.2892405</pub-id>
</citation>
</ref>
<ref id="B2">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Azim</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Aycard</surname>
<given-names>O.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Layer-based supervised classification of moving objects in outdoor dynamic environment using 3d laser scanner</article-title>,&#x201d; in <source>2014 IEEE intelligent vehicles symposium proceedings</source> (<publisher-name>IEEE</publisher-name>), <fpage>1408</fpage>&#x2013;<lpage>1414</lpage>.</citation>
</ref>
<ref id="B3">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Bao</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Kong</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Object-aware centroid voting for monocular 3d object detection</article-title>,&#x201d; in <source>2020 IEEE/RSJ international conference on intelligent robots and systems (IROS)</source> (<publisher-name>IEEE</publisher-name>), <fpage>2197</fpage>&#x2013;<lpage>2204</lpage>.</citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bengler</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Dietmayer</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Farber</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Maurer</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Stiller</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Winner</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>Three decades of driver assistance systems: review and future perspectives</article-title>. <source>IEEE Intell. Transp. Syst. Mag.</source> <volume>6</volume> (<issue>4</issue>), <fpage>6</fpage>&#x2013;<lpage>22</lpage>. <pub-id pub-id-type="doi">10.1109/mits.2014.2336271</pub-id>
</citation>
</ref>
<ref id="B5">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Bhatt</surname>
<given-names>N. P.</given-names>
</name>
<name>
<surname>Khajepour</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Hashemi</surname>
<given-names>E.</given-names>
</name>
</person-group> (<year>2022</year>). &#x201c;<article-title>MPC-PF: social interaction aware trajectory prediction of dynamic objects for autonomous driving using potential fields</article-title>,&#x201d; in <source>2022 IEEE/RSJ international conference on intelligent robots and systems (IROS)</source>, <fpage>9837</fpage>&#x2013;<lpage>9844</lpage>.</citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bhatt</surname>
<given-names>N. P.</given-names>
</name>
<name>
<surname>Khajepour</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Hashemi</surname>
<given-names>E.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>MPC-PF: socially and spatially aware object trajectory prediction for autonomous driving systems using potential fields</article-title>. <source>IEEE Trans. Intelligent Transp. Syst.</source> <volume>24</volume>, <fpage>5351</fpage>&#x2013;<lpage>5361</lpage>. <pub-id pub-id-type="doi">10.1109/tits.2023.3243004</pub-id>
</citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bissell</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Birtchnell</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Elliott</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Hsu</surname>
<given-names>E. L.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Autonomous automobilities: the social impacts of driverless vehicles</article-title>. <source>Curr. Sociol.</source> <volume>68</volume> (<issue>1</issue>), <fpage>116</fpage>&#x2013;<lpage>134</lpage>. <pub-id pub-id-type="doi">10.1177/0011392118816743</pub-id>
</citation>
</ref>
<ref id="B8">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Brazil</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>M3d-rpn: monocular 3d region proposal network for object detection</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF international conference on computer vision</source>, <fpage>9287</fpage>&#x2013;<lpage>9296</lpage>.</citation>
</ref>
<ref id="B9">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Burnett</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Yoon</surname>
<given-names>D. J.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>A. Z.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>S.</given-names>
</name>
<etal/>
</person-group> (<year>2023</year>). <source>A multi-season autonomous driving dataset</source>.</citation>
</ref>
<ref id="B10">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Caesar</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Bankiti</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Lang</surname>
<given-names>A. H.</given-names>
</name>
<name>
<surname>Vora</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Liong</surname>
<given-names>V. E.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>Q.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). &#x201c;<article-title>nuscenes: a multimodal dataset for autonomous driving</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>, <fpage>11618</fpage>&#x2013;<lpage>11628</lpage>.</citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Carranza-Garc&#xed;a</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Torres-Mateo</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Lara-Ben&#xed;tez</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Garc&#xed;a-Guti&#xe9;rrez</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>On the performance of one-stage and two-stage object detectors in autonomous vehicles using camera data</article-title>. <source>Remote Sens.</source> <volume>13</volume> (<issue>1</issue>), <fpage>89</fpage>. <pub-id pub-id-type="doi">10.3390/rs13010089</pub-id>
</citation>
</ref>
<ref id="B12">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Tian</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Xiong</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2021b</year>). &#x201c;<article-title>Monorun: monocular 3d object detection by reconstruction and uncertainty propagation</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>, <fpage>10374</fpage>&#x2013;<lpage>10383</lpage>.</citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Cao</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Guo</surname>
<given-names>C.</given-names>
</name>
<etal/>
</person-group> (<year>2021a</year>). <article-title>Deep neural network based vehicle and pedestrian detection for autonomous driving: a survey</article-title>. <source>IEEE Trans. Intelligent Transp. Syst.</source> <volume>22</volume> (<issue>6</issue>), <fpage>3234</fpage>&#x2013;<lpage>3246</lpage>. <pub-id pub-id-type="doi">10.1109/tits.2020.2993926</pub-id>
</citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Kundu</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Fidler</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Urtasun</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>3d object proposals using stereo imagery for accurate object class detection</article-title>. <source>IEEE Trans. pattern analysis Mach. Intell.</source> <volume>40</volume> (<issue>5</issue>), <fpage>1259</fpage>&#x2013;<lpage>1272</lpage>. <pub-id pub-id-type="doi">10.1109/tpami.2017.2706685</pub-id>
</citation>
</ref>
<ref id="B15">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Jia</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2022a</year>). <source>Dsgn&#x2b;&#x2b;: exploiting visual-spatial relation for stereo-based 3d detectors</source>. <publisher-name>IEEE Transactions on Pattern Analysis and Machine Intelligence</publisher-name>.</citation>
</ref>
<ref id="B16">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Jia</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2020b</year>). &#x201c;<article-title>Dsgn: deep stereo geometry network for 3d object detection</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>, <fpage>12533</fpage>&#x2013;<lpage>12542</lpage>.</citation>
</ref>
<ref id="B17">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Tai</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2020a</year>). &#x201c;<article-title>Monopair: monocular 3d object detection using pairwise spatial relationships</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>, <fpage>12 093&#x2013;12</fpage>&#x2013;<lpage>102</lpage>.</citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Robust vehicle driver assistance control for handover scenarios considering driving performances</article-title>. <source>IEEE Trans. Syst. Man, Cybern. Syst.</source> <volume>51</volume>, <fpage>4160</fpage>&#x2013;<lpage>4170</lpage>. <pub-id pub-id-type="doi">10.1109/tsmc.2019.2931484</pub-id>
</citation>
</ref>
<ref id="B19">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>Y.-N.</given-names>
</name>
<name>
<surname>Dai</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Ding</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2022b</year>). &#x201c;<article-title>Pseudo-stereo for monocular 3d object detection in autonomous driving</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>, <fpage>887</fpage>&#x2013;<lpage>897</lpage>.</citation>
</ref>
<ref id="B20">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Cui</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Heng</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Yeo</surname>
<given-names>Y. C.</given-names>
</name>
<name>
<surname>Geiger</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Pollefeys</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Sattler</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Real-time dense mapping for self-driving vehicles using fisheye cameras</article-title>,&#x201d; in <source>2019 international conference on Robotics and automation (ICRA)</source> (<publisher-name>IEEE</publisher-name>), <fpage>6087</fpage>&#x2013;<lpage>6093</lpage>.</citation>
</ref>
<ref id="B21">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ding</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Huo</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Yi</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>Z.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). &#x201c;<article-title>Learning depth-guided convolutions for monocular 3d object detection</article-title>,&#x201d; in <source>2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR)</source>, <fpage>11669</fpage>&#x2013;<lpage>11678</lpage>.</citation>
</ref>
<ref id="B22">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ding</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Xiao</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Codella</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Luo</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2022</year>). <source>DAVIT: dual attention vision transformers</source>. <publisher-name>Cornell University</publisher-name>. <comment>[Online]. Available: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/2204.03645">http://arxiv.org/abs/2204.03645</ext-link>.</comment>
</citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Du</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Overview of two-stage object detection algorithms</article-title>. <source>J. Phys.</source> <volume>1544</volume> (<issue>1</issue>), <fpage>012033</fpage>. <pub-id pub-id-type="doi">10.1088/1742-6596/1544/1/012033</pub-id>
</citation>
</ref>
<ref id="B24">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Du</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Gozum</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2022</year>). <source>Unknown-aware object detection: learning what You don&#x2019;t know from videos in the wild</source>. <publisher-name>Cornell University</publisher-name>. <comment>[Online]. Available: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/2203.03800">http://arxiv.org/abs/2203.03800</ext-link>.</comment>
</citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Everingham</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Van Gool</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Williams</surname>
<given-names>C. K.</given-names>
</name>
<name>
<surname>Winn</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Zisserman</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>The pascal visual object classes (voc) challenge</article-title>. <source>Int. J. Comput. Vis.</source> <volume>88</volume> (<issue>2</issue>), <fpage>303</fpage>&#x2013;<lpage>338</lpage>. <pub-id pub-id-type="doi">10.1007/s11263-009-0275-4</pub-id>
</citation>
</ref>
<ref id="B26">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>G&#xe4;hlert</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Wan</surname>
<given-names>J.-J.</given-names>
</name>
<name>
<surname>Jourdan</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Finkbeiner</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Franke</surname>
<given-names>U.</given-names>
</name>
<name>
<surname>Denzler</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2020</year>). <source>Single-shot 3d detection of vehicles from monocular rgb images via geometry constrained keypoints in real-time</source>.</citation>
</ref>
<ref id="B27">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gao</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Cao</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Pang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Real-time stereo 3d car detection with shape-aware non-uniform sampling</article-title>. <source>IEEE Trans. Intelligent Transp. Syst.</source> <volume>24</volume>, <fpage>4027</fpage>&#x2013;<lpage>4037</lpage>. <pub-id pub-id-type="doi">10.1109/tits.2022.3220422</pub-id>
</citation>
</ref>
<ref id="B28">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Geiger</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Lenz</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Stiller</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Urtasun</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>Vision meets robotics: the kitti dataset</article-title>. <source>Int. J. Robotics Res.</source> <volume>32</volume> (<issue>11</issue>), <fpage>1231</fpage>&#x2013;<lpage>1237</lpage>. <pub-id pub-id-type="doi">10.1177/0278364913491297</pub-id>
</citation>
</ref>
<ref id="B29">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Girshick</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2015</year>). &#x201c;<article-title>Fast r-cnn</article-title>,&#x201d; in <source>Proceedings of the IEEE international conference on computer vision</source>, <fpage>1440</fpage>&#x2013;<lpage>1448</lpage>.</citation>
</ref>
<ref id="B30">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Girshick</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Donahue</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Darrell</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Malik</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2014</year>). <source>Rich feature hierarchies for accurate object detection and semantic segmentation</source>.</citation>
</ref>
<ref id="B31">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Greenblatt</surname>
<given-names>J. B.</given-names>
</name>
<name>
<surname>Shaheen</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Automated vehicles, On-Demand mobility, and environmental impacts</article-title>. <source>Curr. Sustainable/Renewable Energy Rep.</source> <volume>2</volume> (<issue>3</issue>), <fpage>74</fpage>&#x2013;<lpage>81</lpage>. <pub-id pub-id-type="doi">10.1007/s40518-015-0038-5</pub-id>
</citation>
</ref>
<ref id="B32">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Gu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Fan</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Cao</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Xiang</surname>
<given-names>Z.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). &#x201c;<article-title>Homography loss for monocular 3d object detection</article-title>,&#x201d; in <source>2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR)</source>, <fpage>1070</fpage>&#x2013;<lpage>1079</lpage>.</citation>
</ref>
<ref id="B33">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Guo</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Liga-stereo: learning lidar geometry aware representations for stereo-based 3d detector</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF international conference on computer vision</source>, <fpage>3153</fpage>&#x2013;<lpage>3163</lpage>.</citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gupta</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Anpalagan</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Guan</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Khwaja</surname>
<given-names>A. S.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Deep learning for object detection and scene perception in self-driving cars: survey, challenges, and open issues</article-title>. <source>Array</source> <volume>10</volume>, <fpage>100057</fpage>. <pub-id pub-id-type="doi">10.1016/j.array.2021.100057</pub-id>
</citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hashemi</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Qin</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Khajepour</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Slip-aware driver assistance path tracking and stability control</article-title>. <source>Control Eng. Pract.</source> <volume>118</volume>, <fpage>104958</fpage>. <pub-id pub-id-type="doi">10.1016/j.conengprac.2021.104958</pub-id>
</citation>
</ref>
<ref id="B36">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>He</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Gkioxari</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Doll&#xe1;r</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Girshick</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Mask r-cnn</article-title>,&#x201d; in <source>Proceedings of the IEEE international conference on computer vision</source>, <fpage>2961</fpage>&#x2013;<lpage>2969</lpage>.</citation>
</ref>
<ref id="B37">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hnewa</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Radha</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Object detection under rainy conditions for autonomous vehicles: a review of state-of-the-art and emerging techniques</article-title>. <source>IEEE Signal Process. Mag.</source> <volume>38</volume> (<issue>1</issue>), <fpage>53</fpage>&#x2013;<lpage>67</lpage>. <pub-id pub-id-type="doi">10.1109/msp.2020.2984801</pub-id>
</citation>
</ref>
<ref id="B38">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hoiem</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Divvala</surname>
<given-names>S. K.</given-names>
</name>
<name>
<surname>Hays</surname>
<given-names>J. H.</given-names>
</name>
</person-group> (<year>2009</year>). <article-title>Pascal voc 2008 challenge</article-title>. <source>World Lit. Today</source> <volume>24</volume>.</citation>
</ref>
<ref id="B39">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hu</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Guo</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Taghavifar</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Qin</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Na</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Rise-based integrated motion control of autonomous ground vehicles with asymptotic prescribed performance</article-title>. <source>IEEE Trans. Syst. Man, Cybern. Syst.</source> <volume>51</volume>, <fpage>5336</fpage>&#x2013;<lpage>5348</lpage>. <pub-id pub-id-type="doi">10.1109/tsmc.2019.2950468</pub-id>
</citation>
</ref>
<ref id="B40">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Huang</surname>
<given-names>K.-C.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>T.-H.</given-names>
</name>
<name>
<surname>Su</surname>
<given-names>H.-T.</given-names>
</name>
<name>
<surname>Hsu</surname>
<given-names>W. H.</given-names>
</name>
</person-group> (<year>2022</year>). &#x201c;<article-title>Monodtr: monocular 3d object detection with depth-aware transformer</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>, <fpage>4012</fpage>&#x2013;<lpage>4021</lpage>.</citation>
</ref>
<ref id="B41">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Jana</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Mohanta</surname>
<given-names>P. P.</given-names>
</name>
</person-group> (<year>2022</year>). <source>Recent trends in 2d object detection and applications in video event recognition</source>.</citation>
</ref>
<ref id="B42">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ji</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Na</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Lv</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Shared steering torque control for lane change assistance: a stochastic game-theoretic approach</article-title>. <source>IEEE Trans. Industrial Electron.</source> <volume>66</volume> (<issue>4</issue>), <fpage>3093</fpage>&#x2013;<lpage>3105</lpage>. <pub-id pub-id-type="doi">10.1109/tie.2018.2844784</pub-id>
</citation>
</ref>
<ref id="B43">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jiang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Miao</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>W.</given-names>
</name>
<etal/>
</person-group> (<year>2023</year>). <article-title>PolarFormer: multi-camera 3D object detection with polar transformer</article-title>. <source>Proc. AAAI Conf. Artif. Intell.</source> <volume>37</volume> (<issue>1</issue>), <fpage>1042</fpage>&#x2013;<lpage>1050</lpage>. <pub-id pub-id-type="doi">10.1609/aaai.v37i1.25185</pub-id>
</citation>
</ref>
<ref id="B44">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jiao</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Feng</surname>
<given-names>Z.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>A survey of deep learning-based object detection</article-title>. <source>IEEE Access</source> <volume>7</volume>, <fpage>128837</fpage>&#x2013;<lpage>28868</lpage>. <pub-id pub-id-type="doi">10.1109/access.2019.2939201</pub-id>
</citation>
</ref>
<ref id="B46">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Jocher</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Chaurasia</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Qiu</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2023</year>). <source>YOLO by ultralytics</source>. <comment>[Online]. Available: <ext-link ext-link-type="uri" xlink:href="https://github.com/ultralytics/ultralytics">https://github.com/ultralytics/ultralytics</ext-link>.</comment>
</citation>
</ref>
<ref id="B47">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Khan</surname>
<given-names>S. A.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>H. J.</given-names>
</name>
<name>
<surname>Lim</surname>
<given-names>H. S.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Enhancing object detection in Self-Driving cars using a hybrid approach</article-title>. <source>Electronics</source> <volume>12</volume> (<issue>13</issue>), <fpage>2768</fpage>. <pub-id pub-id-type="doi">10.3390/electronics12132768</pub-id>
</citation>
</ref>
<ref id="B48">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Kim</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Kum</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Choi</surname>
<given-names>J. W.</given-names>
</name>
</person-group> (<year>2023</year>). <source>3d dual-fusion: dual-domain dual-query camera-lidar fusion for 3d object detection</source>.</citation>
</ref>
<ref id="B49">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ku</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Mozifian</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Harakeh</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Waslander</surname>
<given-names>S. L.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Joint 3d proposal generation and object detection from view aggregation</article-title>,&#x201d; in <source>2018 IEEE/RSJ international conference on intelligent robots and systems (IROS)</source> (<publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x2013;<lpage>8</lpage>.</citation>
</ref>
<ref id="B50">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Ouyang</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Sheng</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Zeng</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2019b</year>). &#x201c;<article-title>Gs3d: an efficient 3d object detection framework for autonomous driving</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>, <fpage>1019</fpage>&#x2013;<lpage>1028</lpage>.</citation>
</ref>
<ref id="B51">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Ku</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Waslander</surname>
<given-names>S. L.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Confidence guided stereo 3d object detection with split depth estimation</article-title>,&#x201d; in <source>2020 IEEE/RSJ international conference on intelligent robots and systems (IROS)</source> (<publisher-name>IEEE</publisher-name>), <fpage>5776</fpage>&#x2013;<lpage>5783</lpage>.</citation>
</ref>
<ref id="B52">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Meng</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2022b</year>). &#x201c;<article-title>Real-time pseudo-lidar 3d object detection with geometric constraints</article-title>,&#x201d; in <source>2022 IEEE 25th international conference on intelligent transportation systems (ITSC)</source> (<publisher-name>IEEE</publisher-name>), <fpage>3298</fpage>&#x2013;<lpage>3303</lpage>.</citation>
</ref>
<ref id="B53">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2019a</year>). &#x201c;<article-title>Stereo r-cnn based 3d object detection for autonomous driving</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>, <fpage>7644</fpage>&#x2013;<lpage>7652</lpage>.</citation>
</ref>
<ref id="B54">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Geng</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Ye</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Evangelidis</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Tulyakov</surname>
<given-names>S.</given-names>
</name>
<etal/>
</person-group> (<year>2022a</year>). <source>EfficientFormer: vision transformers at MobileNet speed</source>. <publisher-name>Cornell University</publisher-name>.</citation>
</ref>
<ref id="B55">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Mengying</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Yeo</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Chai</surname>
<given-names>K.</given-names>
</name>
<etal/>
</person-group> (<year>2023b</year>). <article-title>Towards efficient 3D object detection in birds-eye-space for autonomous driving: a convolutional-only approach</article-title>. <source>26th IEEE Int. Conf. Intelligent Transp. Syst. (ITSC 2023)</source>, <fpage>9</fpage>. <pub-id pub-id-type="doi">10.1109/ITSC57777.2023.10422223</pub-id>
</citation>
</ref>
<ref id="B56">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zheng</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Z.</given-names>
</name>
</person-group> (<year>2023a</year>). &#x201c;<article-title>BEVDePth: acquisition of reliable depth for Multi-View 3D object detection</article-title>,&#x201d; in <source>Proceedings of the AAAI conference on artificial intelligence</source>, <volume>37</volume>, <fpage>1477</fpage>&#x2013;<lpage>1485</lpage>. <pub-id pub-id-type="doi">10.1609/aaai.v37i2.25233</pub-id>
</citation>
</ref>
<ref id="B57">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Peng</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Deng</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2017</year>). <source>Light-head r-cnn: in defense of two-stage object detector</source>. <comment>arXiv preprint arXiv:1711.07264</comment>.</citation>
</ref>
<ref id="B58">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Lian</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2022b</year>). &#x201c;<article-title>Monojsg: joint semantic and geometric cost volume for monocular 3d object detection</article-title>,&#x201d; in <source>2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR)</source>, <fpage>1060</fpage>&#x2013;<lpage>1069</lpage>.</citation>
</ref>
<ref id="B59">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Lian</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Ye</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Yao</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2022a</year>). &#x201c;<article-title>Exploring geometric consistency for monocular 3d object detection</article-title>,&#x201d; in <source>2022 IEEE/CVF conference on computer vision and pattern recognition</source> (<publisher-loc>New Orleans, United States</publisher-loc>: <publisher-name>CVPR</publisher-name>), <fpage>1675</fpage>&#x2013;<lpage>1684</lpage>.</citation>
</ref>
<ref id="B60">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liang</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Bao</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Pan</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Fan</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2022c</year>). <article-title>DetectFormer: category-Assisted transformer for traffic scene object detection</article-title>. <source>Sensors</source> <volume>22</volume> (<issue>13</issue>), <fpage>4833</fpage>. <pub-id pub-id-type="doi">10.3390/s22134833</pub-id>
</citation>
</ref>
<ref id="B61">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liang</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Bao</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Pan</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Pan</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>2022a</year>). <article-title>Traffic sign detection via improved sparse R-CNN for autonomous vehicles</article-title>. <source>J. Adv. Transp.</source> <volume>2022</volume>, <fpage>1</fpage>&#x2013;<lpage>16</lpage>. <pub-id pub-id-type="doi">10.1155/2022/3825532</pub-id>
</citation>
</ref>
<ref id="B62">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liang</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Bao</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Pan</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Pan</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>2022b</year>). <article-title>ALODAD: an Anchor-Free lightweight object detector for autonomous driving</article-title>. <source>IEEE Access</source> <volume>10</volume>, <fpage>40 701</fpage>&#x2013;<lpage>740 714</lpage>. <pub-id pub-id-type="doi">10.1109/access.2022.3166923</pub-id>
</citation>
</ref>
<ref id="B63">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Lin</surname>
<given-names>T.-Y.</given-names>
</name>
<name>
<surname>Maire</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Belongie</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Hays</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Perona</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Ramanan</surname>
<given-names>D.</given-names>
</name>
<etal/>
</person-group> (<year>2014</year>). &#x201c;<article-title>Microsoft coco: common objects in context</article-title>,&#x201d; in <source>European conference on computer vision</source> (<publisher-name>Springer</publisher-name>), <fpage>740</fpage>&#x2013;<lpage>755</lpage>.</citation>
</ref>
<ref id="B64">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>W.</given-names>
</name>
</person-group> (<year>2022b</year>). <article-title>Fine-grained multilevel fusion for anti-occlusion monocular 3d object detection</article-title>. <source>IEEE Trans. Image Process.</source> <volume>31</volume>, <fpage>4050</fpage>&#x2013;<lpage>4061</lpage>. <pub-id pub-id-type="doi">10.1109/tip.2022.3180210</pub-id>
</citation>
</ref>
<ref id="B65">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Tian</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Deep fitting degree scoring network for monocular 3d object detection</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>, <fpage>1057</fpage>&#x2013;<lpage>1066</lpage>.</citation>
</ref>
<ref id="B67">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Xue</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Learning auxiliary monocular contexts helps monocular 3d object detection</article-title>. <source>Proc. AAAI Conf. Artif. Intell.</source> <volume>36</volume> (<issue>2</issue>), <fpage>1810</fpage>&#x2013;<lpage>1818</lpage>. <pub-id pub-id-type="doi">10.1609/aaai.v36i2.20074</pub-id>
</citation>
</ref>
<ref id="B68">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2021c</year>). &#x201c;<article-title>Yolostereo3d: a step back to 2d for efficient stereo 3d detection</article-title>,&#x201d; in <source>2021 IEEE international conference on Robotics and automation (ICRA)</source> (<publisher-name>IEEE</publisher-name>), <fpage>13018</fpage>&#x2013;<lpage>13024</lpage>.</citation>
</ref>
<ref id="B69">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Yixuan</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2021b</year>). <article-title>Ground-aware monocular 3d object detection for autonomous driving</article-title>. <source>IEEE Robotics Automation Lett.</source> <volume>6</volume> (<issue>2</issue>), <fpage>919</fpage>&#x2013;<lpage>926</lpage>. <pub-id pub-id-type="doi">10.1109/lra.2021.3052442</pub-id>
</citation>
</ref>
<ref id="B71">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Cao</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Z.</given-names>
</name>
<etal/>
</person-group> (<year>2021a</year>). <source>Swin transformer: hierarchical vision transformer using shifted windows</source>. <publisher-name>Cornell University</publisher-name>. <comment>[Online]. Available: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/2103.14030">http://arxiv.org/abs/2103.14030</ext-link>.</comment>
</citation>
</ref>
<ref id="B72">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>T&#xf3;th</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Smoke: single-stage monocular 3d object detection via keypoint estimation</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops</source>, <fpage>996</fpage>&#x2013;<lpage>997</lpage>.</citation>
</ref>
<ref id="B73">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Lu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Chu</surname>
<given-names>Q.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). &#x201c;<article-title>Geometry uncertainty projection network for monocular 3d object detection</article-title>,&#x201d; in <source>2021 IEEE/CVF international conference on computer vision (ICCV)</source>, <fpage>3091</fpage>&#x2013;<lpage>3101</lpage>.</citation>
</ref>
<ref id="B74">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Luo</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Dai</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Shao</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Ding</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>M3dssd: monocular 3d single stage object detector</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>, <fpage>6145</fpage>&#x2013;<lpage>6154</lpage>.</citation>
</ref>
<ref id="B75">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Lyu</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Y.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <source>RTMDET: an empirical study of designing real-time object detectors</source>. <publisher-name>Cornell University</publisher-name>. <comment>[Online]. Available: <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2212.07784">https://arxiv.org/abs/2212.07784</ext-link>.</comment>
</citation>
</ref>
<ref id="B76">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ma</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Ouyang</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Fan</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Accurate monocular 3d object detection via color-embedded 3d reconstruction for autonomous driving</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF international conference on computer vision</source>, <fpage>6851</fpage>&#x2013;<lpage>6860</lpage>.</citation>
</ref>
<ref id="B77">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marzbani</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Khayyam</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>To</surname>
<given-names>C. N.</given-names>
</name>
<name>
<surname>Quoc</surname>
<given-names>&#xd0;. V.</given-names>
</name>
<name>
<surname>Jazar</surname>
<given-names>R. N.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Autonomous vehicles: autodriver algorithm and vehicle dynamics</article-title>. <source>IEEE Trans. Veh. Technol.</source> <volume>68</volume> (<issue>4</issue>), <fpage>3201</fpage>&#x2013;<lpage>3211</lpage>. <pub-id pub-id-type="doi">10.1109/tvt.2019.2895297</pub-id>
</citation>
</ref>
<ref id="B78">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Michaelis</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Mitzkus</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Geirhos</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Rusak</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Bringmann</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Ecker</surname>
<given-names>A. S.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Benchmarking robustness in object detection: autonomous driving when winter is coming</article-title>. <source>CoRR</source> <volume>1907</volume>&#x2013;<lpage>07484</lpage>. <comment>[Online]. Available: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1907.07484">http://arxiv.org/abs/1907.07484</ext-link>
</comment>. <pub-id pub-id-type="doi">10.48550/arXiv.1907.07484</pub-id>
</citation>
</ref>
<ref id="B79">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Mohammadbagher</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Bhatt</surname>
<given-names>N. P.</given-names>
</name>
<name>
<surname>Hashemi</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Fidan</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Khajepour</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Real-time pedestrian localization and state estimation using moving horizon estimation</article-title>,&#x201d; in <source>23rd intelligent transportation systems conference (ITSC)</source>.</citation>
</ref>
<ref id="B80">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Mousavian</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Anguelov</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Flynn</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Kosecka</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2017</year>). <source>3d bounding box estimation using deep learning and geometry</source>.</citation>
</ref>
<ref id="B81">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mukhtar</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Xia</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>T. B.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Vehicle detection techniques for collision avoidance systems: a review</article-title>. <source>IEEE Trans. intelligent Transp. Syst.</source> <volume>16</volume> (<issue>5</issue>), <fpage>2318</fpage>&#x2013;<lpage>2338</lpage>. <pub-id pub-id-type="doi">10.1109/tits.2015.2409109</pub-id>
</citation>
</ref>
<ref id="B82">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Naiden</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Paunescu</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Jeon</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Leordeanu</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Shift r-cnn: deep monocular 3d object detection with closed-form geometric constraints</article-title>,&#x201d; in <source>2019 IEEE international conference on image processing (ICIP)</source>, <fpage>61</fpage>&#x2013;<lpage>65</lpage>.</citation>
</ref>
<ref id="B84">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Othman</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Public acceptance and perception of autonomous vehicles: a comprehensive review</article-title>. <source>AI Ethics</source> <volume>1</volume> (<issue>3</issue>), <fpage>355</fpage>&#x2013;<lpage>387</lpage>. <pub-id pub-id-type="doi">10.1007/s43681-021-00041-8</pub-id>
</citation>
</ref>
<ref id="B85">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Park</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Ambrus</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Guizilini</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Gaidon</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Is pseudo-lidar needed for monocular 3d object detection?</article-title> <source>Proc. IEEE/CVF Int. Conf. Comput. Vis.</source>, <fpage>3142</fpage>&#x2013;<lpage>3152</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV48922.2021.00313</pub-id>
</citation>
</ref>
<ref id="B86">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pendleton</surname>
<given-names>S. D.</given-names>
</name>
<name>
<surname>Andersen</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Du</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Meghjani</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Eng</surname>
<given-names>Y. H.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). <article-title>Perception, planning, control, and coordination for autonomous vehicles</article-title>. <source>Machines</source> <volume>5</volume> (<issue>1</issue>), <fpage>6</fpage>. <pub-id pub-id-type="doi">10.3390/machines5010006</pub-id>
</citation>
</ref>
<ref id="B87">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Peng</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Pan</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Ida-3d: instance-depth-aware 3d object detection from stereo vision for autonomous driving</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>, <fpage>13012</fpage>&#x2013;<lpage>13021</lpage>.</citation>
</ref>
<ref id="B88">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Peng</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2022</year>). &#x201c;<article-title>Side: center-based stereo 3d detector with structure-aware instance depth estimation</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF winter conference on applications of computer vision</source>, <fpage>119</fpage>&#x2013;<lpage>128</lpage>.</citation>
</ref>
<ref id="B89">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pitropov</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Garcia</surname>
<given-names>D. E.</given-names>
</name>
<name>
<surname>Rebello</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Smart</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Czarnecki</surname>
<given-names>K.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Canadian adverse driving conditions dataset</article-title>. <source>Int. J. Robotics Res.</source> <volume>40</volume> (<issue>4-5</issue>), <fpage>681</fpage>&#x2013;<lpage>690</lpage>. <pub-id pub-id-type="doi">10.1177/0278364920979368</pub-id>
</citation>
</ref>
<ref id="B90">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Qian</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Garg</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>You</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Belongie</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Hariharan</surname>
<given-names>B.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). &#x201c;<article-title>End-to-end pseudo-lidar for image-based 3d object detection</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>, <fpage>5881</fpage>&#x2013;<lpage>5890</lpage>.</citation>
</ref>
<ref id="B91">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Qian</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Lai</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>3d object detection for autonomous driving: a survey</article-title>. <source>Pattern Recognit.</source> <volume>130</volume>, <fpage>108796</fpage>. <pub-id pub-id-type="doi">10.1016/j.patcog.2022.108796</pub-id>
</citation>
</ref>
<ref id="B92">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Qin</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2022</year>). &#x201c;<article-title>Monoground: detecting monocular 3d objects from the ground</article-title>,&#x201d; in <source>2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR)</source>, <fpage>3783</fpage>&#x2013;<lpage>3792</lpage>.</citation>
</ref>
<ref id="B93">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Qin</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Triangulation learning network: from monocular to stereo 3d object detection</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>, <fpage>7615</fpage>&#x2013;<lpage>7623</lpage>.</citation>
</ref>
<ref id="B94">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ranft</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Stiller</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>The role of machine vision for intelligent vehicles</article-title>. <source>IEEE Trans. Intelligent Veh.</source> <volume>1</volume> (<issue>1</issue>), <fpage>8</fpage>&#x2013;<lpage>19</lpage>. <pub-id pub-id-type="doi">10.1109/tiv.2016.2551553</pub-id>
</citation>
</ref>
<ref id="B95">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Reading</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Harakeh</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Chae</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Waslander</surname>
<given-names>S. L.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Categorical depth distribution network for monocular 3d object detection</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>, <fpage>8555</fpage>&#x2013;<lpage>8564</lpage>.</citation>
</ref>
<ref id="B96">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Redmon</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Divvala</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Girshick</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Farhadi</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2016</year>). <source>You only look once: unified, real-time object detection</source>.</citation>
</ref>
<ref id="B97">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ren</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>He</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Girshick</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2016</year>). <source>Faster r-cnn: towards real-time object detection with region proposal networks</source>.</citation>
</ref>
<ref id="B98">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Roddick</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Kendall</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Cipolla</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2018</year>). <source>Orthographic feature transform for monocular 3d object detection</source>. <comment>arXiv preprint arXiv:1811.08188</comment>.</citation>
</ref>
<ref id="B99">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schwarting</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Alonso-Mora</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Rus</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Planning and decision-making for autonomous vehicles</article-title>. <source>Annu. Rev. Control, Robotics, Aut. Syst.</source> <volume>1</volume>, <fpage>187</fpage>&#x2013;<lpage>210</lpage>. <pub-id pub-id-type="doi">10.1146/annurev-control-060117-105157</pub-id>
</citation>
</ref>
<ref id="B100">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shahedi</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Dadashpour</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Rezaei</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Barriers to the sustainable adoption of autonomous vehicles in developing countries: a multi-criteria decision-making approach</article-title>. <source>Heliyon</source> <volume>9</volume> (<issue>5</issue>), <fpage>e15975</fpage>. <pub-id pub-id-type="doi">10.1016/j.heliyon.2023.e15975</pub-id>
</citation>
</ref>
<ref id="B83">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Silva</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Cordera</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Gonz&#xe1;lez-Gonz&#xe1;lez</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Nogu&#xe9;s</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Environmental impacts of autonomous vehicles: a review of the scientific literature</article-title>. <source>Sci. Total Environ.</source> <volume>830</volume>, <fpage>154615</fpage>. <pub-id pub-id-type="doi">10.1016/j.scitotenv.2022.154615</pub-id>
</citation>
</ref>
<ref id="B102">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Simonelli</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Bulo</surname>
<given-names>S. R.</given-names>
</name>
<name>
<surname>Porzi</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>L&#xf3;pez-Antequera</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Kontschieder</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Disentangling monocular 3d object detection</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF international conference on computer vision</source>, <fpage>1991</fpage>&#x2013;<lpage>1999</lpage>.</citation>
</ref>
<ref id="B103">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Simonelli</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Bul&#xf2;</surname>
<given-names>S. R.</given-names>
</name>
<name>
<surname>Porzi</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>L&#xf3;pez-Antequera</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Kontschieder</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2019c</year>). <article-title>Disentangling monocular 3d object detection</article-title>. <source>CoRR</source> <volume>1905</volume>, <fpage>12365</fpage>. <comment>[Online]. Available: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/1905.12365">http://arxiv.org/abs/1905.12365</ext-link>
</comment>. <pub-id pub-id-type="doi">10.1109/TPAMI.2020.3025077</pub-id>
</citation>
</ref>
<ref id="B104">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Srivastava</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Jurie</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Sharma</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Learning 2d to 3d lifting for object detection in 3d for autonomous vehicles</article-title>,&#x201d; in <source>2019 IEEE/RSJ international conference on intelligent robots and systems (IROS)</source> (<publisher-name>IEEE</publisher-name>), <fpage>4504</fpage>&#x2013;<lpage>4511</lpage>.</citation>
</ref>
<ref id="B105">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Su</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Di</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhai</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Manhardt</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Rambach</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Busam</surname>
<given-names>B.</given-names>
</name>
<etal/>
</person-group> (<year>2023</year>). <article-title>Opa-3d: occlusion-aware pixel-wise aggregation for monocular 3d object detection</article-title>. <source>IEEE Robotics Automation Lett.</source> <volume>8</volume> (<issue>3</issue>), <fpage>1327</fpage>&#x2013;<lpage>1334</lpage>. <pub-id pub-id-type="doi">10.1109/lra.2023.3238137</pub-id>
</citation>
</ref>
<ref id="B106">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Sun</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Xie</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>X.</given-names>
</name>
<etal/>
</person-group> (<year>2020b</year>). &#x201c;<article-title>Disp r-cnn: stereo 3d object detection via shape prior guided instance disparity estimation</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>, <fpage>10545</fpage>&#x2013;<lpage>10554</lpage>.</citation>
</ref>
<ref id="B107">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Sun</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Kretzschmar</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Dotiwalla</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Chouard</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Patnaik</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Tsui</surname>
<given-names>P.</given-names>
</name>
<etal/>
</person-group> (<year>2020a</year>). &#x201c;<article-title>Scalability in perception for autonomous driving: Waymo open dataset</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>, <fpage>2446</fpage>&#x2013;<lpage>2454</lpage>.</citation>
</ref>
<ref id="B108">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Tan</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Pang</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Le</surname>
<given-names>Q. V.</given-names>
</name>
</person-group> (<year>2020</year>). <source>Efficientdet: scalable and efficient object detection</source>.</citation>
</ref>
<ref id="B109">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tao</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Cao</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>Z.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Pseudo-mono for monocular 3d object detection in autonomous driving</article-title>. <source>IEEE Trans. Circuits Syst. Video Technol.</source> <volume>33</volume>, <fpage>3962</fpage>&#x2013;<lpage>3975</lpage>. <pub-id pub-id-type="doi">10.1109/tcsvt.2023.3237579</pub-id>
</citation>
</ref>
<ref id="B110">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>ul Haq</surname>
<given-names>Q. M.</given-names>
</name>
<name>
<surname>Haq</surname>
<given-names>M. A.</given-names>
</name>
<name>
<surname>Ruan</surname>
<given-names>S.-J.</given-names>
</name>
<name>
<surname>Liang</surname>
<given-names>P.-J.</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>D.-Q.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>3d object detection based on proposal generation network utilizing monocular images</article-title>. <source>IEEE Consum. Electron. Mag.</source> <volume>11</volume> (<issue>5</issue>), <fpage>47</fpage>&#x2013;<lpage>53</lpage>. <pub-id pub-id-type="doi">10.1109/mce.2021.3059565</pub-id>
</citation>
</ref>
<ref id="B111">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>C.-Y.</given-names>
</name>
<name>
<surname>Bochkovskiy</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Liao</surname>
<given-names>H.-Y. M.</given-names>
</name>
</person-group> (<year>2022a</year>). <source>Yolov7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors</source>.</citation>
</ref>
<ref id="B112">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>D. Z.</given-names>
</name>
<name>
<surname>Posner</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Newman</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2012</year>). &#x201c;<article-title>What could move? finding cars, pedestrians and bicyclists in 3d laser data</article-title>,&#x201d; in <source>2012 IEEE international conference on Robotics and automation</source> (<publisher-name>IEEE</publisher-name>), <fpage>4038</fpage>&#x2013;<lpage>4044</lpage>.</citation>
</ref>
<ref id="B113">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Schnelle</surname>
<given-names>S. C.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>A gain-scheduling driver assistance trajectory-following algorithm considering different driver steering characteristics</article-title>. <source>IEEE Trans. Intelligent Transp. Syst.</source> <volume>18</volume> (<issue>5</issue>), <fpage>1097</fpage>&#x2013;<lpage>1108</lpage>. <pub-id pub-id-type="doi">10.1109/tits.2016.2598792</pub-id>
</citation>
</ref>
<ref id="B114">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>He</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2021a</year>). <article-title>Progressive coordinate transforms for monocular 3d object detection</article-title>. <source>Adv. Neural Inf. Process. Syst.</source> <volume>34</volume>, <fpage>13364</fpage>&#x2013;<lpage>13377</lpage>.</citation>
</ref>
<ref id="B115">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Shivanna</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Cheng</surname>
<given-names>D. Z.</given-names>
</name>
<name>
<surname>Jain</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Hong</surname>
<given-names>L.</given-names>
</name>
<etal/>
</person-group> (<year>2020a</year>). <article-title>Improved deep and cross network for feature cross learning in web-scale learning to rank systems</article-title>. <source>CoRR</source> <volume>2008</volume>, <fpage>13535</fpage>. <comment>[Online]. Available: <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2008.13535">https://arxiv.org/abs/2008.13535</ext-link>
</comment>. <pub-id pub-id-type="doi">10.1145/3442381.3450078</pub-id>
</citation>
</ref>
<ref id="B116">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Yin</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Kong</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2020b</year>). <article-title>Task-aware monocular depth estimation for 3d object detection</article-title>. <source>Proc. AAAI Conf. Artif. Intell.</source> <volume>34</volume> (<issue>07</issue>), <fpage>1785</fpage>&#x2013;<lpage>1797</lpage>. <pub-id pub-id-type="doi">10.1609/aaai.v34i07.6908</pub-id>
</citation>
</ref>
<ref id="B117">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Chao</surname>
<given-names>W.-L.</given-names>
</name>
<name>
<surname>Garg</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Hariharan</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Campbell</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Weinberger</surname>
<given-names>K. Q.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Pseudo-lidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>, <fpage>8445</fpage>&#x2013;<lpage>8453</lpage>.</citation>
</ref>
<ref id="B118">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Zuo</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Z.</given-names>
</name>
</person-group> (<year>2022b</year>). &#x201c;<article-title>Monocular 3d object detection based on pseudo-lidar point cloud for autonomous vehicles</article-title>,&#x201d; in <source>2022 41st Chinese control conference (CCC)</source>, <fpage>5469</fpage>&#x2013;<lpage>5474</lpage>.</citation>
</ref>
<ref id="B119">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Liang</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Urtasun</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2021b</year>). &#x201c;<article-title>Plumenet: efficient 3d object detection from stereo images</article-title>,&#x201d; in <source>2021 IEEE/RSJ international conference on intelligent robots and systems (IROS)</source> (<publisher-name>IEEE</publisher-name>), <fpage>3383</fpage>&#x2013;<lpage>3390</lpage>.</citation>
</ref>
<ref id="B120">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Weng</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Kitani</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Monocular 3d object detection with pseudo-lidar point cloud</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF international conference on computer vision workshops</source>.</citation>
</ref>
<ref id="B121">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Williams</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Das</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Fisher</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Assessing the sustainability implications of autonomous vehicles: recommendations for research community practice</article-title>. <source>Sustainability</source> <volume>12</volume> (<issue>5</issue>), <fpage>1902</fpage>. <pub-id pub-id-type="doi">10.3390/su12051902</pub-id>
</citation>
</ref>
<ref id="B122">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Wu</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Wen</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2023</year>). <source>Virtual sparse convolution for multimodal 3d object detection</source>.</citation>
</ref>
<ref id="B123">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Wu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2022</year>). &#x201c;<article-title>Dst3d: dla-swin transformer for single-stage monocular 3d object detection</article-title>,&#x201d; in <source>2022 IEEE intelligent vehicles symposium (IV)</source>, <fpage>411</fpage>&#x2013;<lpage>418</lpage>.</citation>
</ref>
<ref id="B124">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Xie</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Xie</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2023</year>). <source>On the adversarial robustness of camera-based 3D object detection</source>. <publisher-name>Cornell University</publisher-name>. <comment>[Online]. Available: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/2301.10766">http://arxiv.org/abs/2301.10766</ext-link>.</comment>
</citation>
</ref>
<ref id="B126">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Xie</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Cheng</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Yao</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Oriented r-cnn for object detection</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF international conference on computer vision</source>, <fpage>3520</fpage>&#x2013;<lpage>3529</lpage>.</citation>
</ref>
<ref id="B127">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Xiong</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Gong</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Ye</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Tan</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Wan</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Ding</surname>
<given-names>E.</given-names>
</name>
<etal/>
</person-group> (<year>2023</year>). &#x201c;<article-title>CAPE: camera view position embedding for multi-view 3D object detection</article-title>,&#x201d; in <source>2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR)</source>. <pub-id pub-id-type="doi">10.1109/cvpr52729.2023.02066</pub-id>
</citation>
</ref>
<ref id="B128">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Xu</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Z.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Multi-level fusion based 3d object detection from monocular images</article-title>,&#x201d; in <source>2018 IEEE/CVF conference on computer vision and pattern recognition</source>, <fpage>2345</fpage>&#x2013;<lpage>2353</lpage>.</citation>
</ref>
<ref id="B129">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Xu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Ye</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Tan</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Wen</surname>
<given-names>S.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Zoomnet: Part-aware adaptive zooming neural network for 3d object detection</article-title>. <source>Proc. AAAI Conf. Artif. Intell.</source> <volume>34</volume> (<issue>07</issue>), <fpage>12556</fpage>&#x2013;<lpage>12564</lpage>. <pub-id pub-id-type="doi">10.1609/aaai.v34i07.6945</pub-id>
</citation>
</ref>
<ref id="B130">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ye</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Shu</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>G.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <source>Rope3D: TheRoadside Perception Dataset for autonomous driving and monocular 3D object Detection task</source>. <publisher-name>Cornell University</publisher-name>. <comment>[Online]. Available: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/2203.13608">http://arxiv.org/abs/2203.13608</ext-link>.</comment>
</citation>
</ref>
<ref id="B131">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>You</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Chao</surname>
<given-names>W.-L.</given-names>
</name>
<name>
<surname>Garg</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Pleiss</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Hariharan</surname>
<given-names>B.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <source>Pseudo-lidar&#x2b;&#x2b;: accurate depth for 3d object detection in autonomous driving</source>. <comment>
<italic>arXiv preprint arXiv:1906.06310</italic>
</comment>.</citation>
</ref>
<ref id="B132">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Yu</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Luo</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Shu</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Huo</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>Y.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <source>DAIR-V2X: a large-scale dataset for vehicle-infrastructure cooperative 3D object detection</source>. <publisher-name>Cornell University</publisher-name>. <comment>[Online]. Available: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/2204.05575">http://arxiv.org/abs/2204.05575</ext-link>.</comment>
</citation>
</ref>
<ref id="B133">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Qiu</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Guo</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Qiao</surname>
<given-names>Y.</given-names>
</name>
<etal/>
</person-group> (<year>2023</year>). <article-title>Monodetr: depth-guided transformer for monocular 3d object detection</article-title>. <source>ICCV</source>, <fpage>2022</fpage>. <pub-id pub-id-type="doi">10.1109/ICCV51070.2023.00840</pub-id>
</citation>
</ref>
<ref id="B134">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Qin</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Dong</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Gao</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Hashemi</surname>
<given-names>E.</given-names>
</name>
</person-group> (<year>2022a</year>). <article-title>Mile: multi-objective integrated model predictive adaptive cruise control for intelligent vehicle</article-title>. <source>IEEE Trans. Industrial Inf.</source> <volume>19</volume>, <fpage>8539</fpage>&#x2013;<lpage>8548</lpage>. <pub-id pub-id-type="doi">10.1109/tii.2022.3220842</pub-id>
</citation>
</ref>
<ref id="B135">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zheng</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Du</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2022b</year>). &#x201c;<article-title>Dimension embeddings for monocular 3d object detection</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>, <fpage>1589</fpage>&#x2013;<lpage>1598</lpage>.</citation>
</ref>
<ref id="B136">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname>
<given-names>Z.-Q.</given-names>
</name>
<name>
<surname>Zheng</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>tao Xu</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2019</year>). <source>Object detection with deep learning: a review</source>.</citation>
</ref>
<ref id="B137">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Peng</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Long</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Monet3d: towards accurate monocular 3d object localization in real time</article-title>,&#x201d; in <source>International conference on machine learning</source> (<publisher-loc>Held virtual</publisher-loc>: <publisher-name>PMLR</publisher-name>), <fpage>11 503</fpage>&#x2013;<lpage>511 512</lpage>.</citation>
</ref>
<ref id="B138">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>He</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>Q.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Monocular 3d object detection: an extrinsic parameter free approach</article-title>,&#x201d; in <source>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>, <fpage>7556</fpage>&#x2013;<lpage>7566</lpage>.</citation>
</ref>
<ref id="B139">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Du</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Ye</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Zou</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Tan</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>L.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <article-title>Sgm3d: stereo guided monocular 3d object detection</article-title>. <source>IEEE Robotics Automation Lett.</source> <volume>7</volume> (<issue>4</issue>), <fpage>10478</fpage>&#x2013;<lpage>10485</lpage>. <pub-id pub-id-type="doi">10.1109/lra.2022.3191849</pub-id>
</citation>
</ref>
<ref id="B140">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zhu</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Ge</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Peng</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2023a</year>). &#x201c;<article-title>Monoedge: monocular 3d object detection using local perspectives</article-title>,&#x201d; in <source>2023 IEEE/CVF winter conference on applications of computer vision (WACV)</source>, <fpage>643</fpage>&#x2013;<lpage>652</lpage>.</citation>
</ref>
<ref id="B141">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zhu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Hai</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Dong</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Ding</surname>
<given-names>W.</given-names>
</name>
<etal/>
</person-group> (<year>2023b</year>). &#x201c;<article-title>Understanding the robustness of 3D object detection with Bird&#x2019;View representations in autonomous driving</article-title>,&#x201d; in <source>2023 IEEE/CVF conference on computer vision and pattern recognition (CVPR)</source>, <fpage>6</fpage>. <pub-id pub-id-type="doi">10.1109/cvpr52729.2023.02069</pub-id>
</citation>
</ref>
<ref id="B142">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zimmer</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Cre&#xdf;</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Nguyen</surname>
<given-names>H. T.</given-names>
</name>
<name>
<surname>Knoll</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2023</year>). <source>A9 intersection dataset: all You need for urban 3D camera-LiDAR roadside perception</source>. <publisher-name>Cornell University</publisher-name>. <comment>[Online]. Available: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/2306.09266">http://arxiv.org/abs/2306.09266</ext-link>.</comment>
</citation>
</ref>
</ref-list>
</back>
</article>