AUTHOR=Wang Hua , Wang Wenshuai , Li Wenhao , Liu Hong 

TITLE=Dense captioning and multidimensional evaluations for indoor robotic scenes

JOURNAL=Frontiers in Neurorobotics

VOLUME=Volume 17 - 2023

YEAR=2023

URL=https://www.frontiersin.org/journals/neurorobotics/articles/10.3389/fnbot.2023.1280501

DOI=10.3389/fnbot.2023.1280501

ISSN=1662-5218

ABSTRACT=The burgeoning field of intelligent technologies has amplified the need for human-computer interaction. Scene understanding, a critical aspect of this interaction, involves generating advanced semantic descriptions based on scene content, a challenging yet vital task. Dense captioning methods for robot scenes can be categorized into 2D image-based methods and 3D point cloud-based methods. While 2D methods are simpler to implement, they lack comprehensive spatial information. Conversely, 3D methods offer a deeper understanding of complex scenes but are computationally intensive. This paper introduces RGBD2Cap, a novel scene semantic description method based on RGBD images, and evaluates the performance of its results in several dimensions, i. e. automatic evaluation, manual evaluation, and simulation testing. RGBD2Cap employs a RGB and Depth multimodal fusion module for multi-level feature extraction, Faster-RCNN for indoor target detection, and Top-Down Attention LSTM for semantic description generation. The experimental data are derived from the ScanRefer indoor scene dataset, with RGB and depth images rendered from ScanNet's 3D scene serving as the model's input. The method outperforms the DenseCap network in several metrics, including BLEU, CIDEr, and METEOR. Ablation experiments confirm the effectiveness of the RGBD module, and the method's reliability is validated in the AI2RHOT embodied intelligence experimental environment.