<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Comput. Sci.</journal-id>
<journal-title>Frontiers in Computer Science</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Comput. Sci.</abbrev-journal-title>
<issn pub-type="epub">2624-9898</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fcomp.2024.1388174</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Computer Science</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Alignment of a 360&#x000B0; image with posed color images for locally accurate texturing of 3D mesh</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes" equal-contrib="yes">
<name><surname>Khanal</surname> <given-names>Bishwash</given-names></name>
<xref ref-type="corresp" rid="c002"><sup>&#x0002A;</sup></xref>
<xref ref-type="author-notes" rid="fn001"><sup>&#x02020;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2662935/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/conceptualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/data-curation/"/>
<role content-type="https://credit.niso.org/contributor-roles/formal-analysis/"/>
<role content-type="https://credit.niso.org/contributor-roles/investigation/"/>
<role content-type="https://credit.niso.org/contributor-roles/methodology/"/>
<role content-type="https://credit.niso.org/contributor-roles/software/"/>
<role content-type="https://credit.niso.org/contributor-roles/validation/"/>
<role content-type="https://credit.niso.org/contributor-roles/visualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author" equal-contrib="yes">
<name><surname>Om</surname> <given-names>Madhav</given-names></name>
<xref ref-type="author-notes" rid="fn001"><sup>&#x02020;</sup></xref>
<role content-type="https://credit.niso.org/contributor-roles/conceptualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/formal-analysis/"/>
<role content-type="https://credit.niso.org/contributor-roles/investigation/"/>
<role content-type="https://credit.niso.org/contributor-roles/methodology/"/>
<role content-type="https://credit.niso.org/contributor-roles/software/"/>
<role content-type="https://credit.niso.org/contributor-roles/supervision/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author" corresp="yes" equal-contrib="yes">
<name><surname>Rijal</surname> <given-names>Sanjay</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<xref ref-type="author-notes" rid="fn001"><sup>&#x02020;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2660606/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/data-curation/"/>
<role content-type="https://credit.niso.org/contributor-roles/formal-analysis/"/>
<role content-type="https://credit.niso.org/contributor-roles/investigation/"/>
<role content-type="https://credit.niso.org/contributor-roles/methodology/"/>
<role content-type="https://credit.niso.org/contributor-roles/software/"/>
<role content-type="https://credit.niso.org/contributor-roles/validation/"/>
<role content-type="https://credit.niso.org/contributor-roles/visualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Ojha</surname> <given-names>Vaghawan Prasad</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/2660859/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/project-administration/"/>
<role content-type="https://credit.niso.org/contributor-roles/resources/"/>
<role content-type="https://credit.niso.org/contributor-roles/supervision/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
</contrib-group>
<aff><institution>E.K. Solutions Pvt. Ltd.</institution>, <addr-line>Lalitpur</addr-line>, <country>Nepal</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Jon Sporring, University of Copenhagen, Denmark</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Maria Paula Queluz, University of Lisbon, Portugal</p>
<p>Dieter Fritsch, University of Stuttgart, Germany</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Sanjay Rijal <email>sanjay.rijal&#x00040;ekbana.info</email></corresp>
<corresp id="c002">Bishwash Khanal <email>bishwash.khanal&#x00040;ekbana.info</email></corresp>
<fn fn-type="equal" id="fn001"><p>&#x02020;These authors have contributed equally to this work</p></fn></author-notes>
<pub-date pub-type="epub">
<day>19</day>
<month>09</month>
<year>2024</year>
</pub-date>
<pub-date pub-type="collection">
<year>2024</year>
</pub-date>
<volume>6</volume>
<elocation-id>1388174</elocation-id>
<history>
<date date-type="received">
<day>19</day>
<month>02</month>
<year>2024</year>
</date>
<date date-type="accepted">
<day>27</day>
<month>08</month>
<year>2024</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2024 Khanal, Om, Rijal and Ojha.</copyright-statement>
<copyright-year>2024</copyright-year>
<copyright-holder>Khanal, Om, Rijal and Ojha</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>With the popularity of 3D content like virtual tours, the challenges of 3D data registration have become increasingly significant. The registration of heterogeneous data obtained from 2D and 3D sensors is required to create photo-realistic 3D models. However, the alignment of 2D images with 3D models introduces a significant challenge due to their inherent differences. This article introduces a rigorous mathematical approach to align a 360&#x000B0; image with its corresponding 3D model generated from images with known camera poses followed by texture projection on the model. We use Scale-Invariant Feature Transform (SIFT) feature descriptors enhanced with a homography-based metric to establish correspondences between the faces of a cubemap and the posed images. To achieve optimal alignment, we use a non-linear least squares optimization technique with a custom objective function. Subsequently, the outcomes of the alignment process are evaluated through texturing using a customized raytracing algorithm. The resulting projections are compared against the original textures, with a comprehensive assessment of the alignment&#x00027;s fidelity and precision.</p></abstract>
<kwd-group>
<kwd>cubemap projection</kwd>
<kwd>least squares optimization</kwd>
<kwd>raytracing</kwd>
<kwd>texturing</kwd>
<kwd>360 image alignment</kwd>
</kwd-group>
<counts>
<fig-count count="14"/>
<table-count count="3"/>
<equation-count count="16"/>
<ref-count count="51"/>
<page-count count="17"/>
<word-count count="8039"/>
</counts>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Computer Vision</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1 Introduction</title>
<p>Recent advancements in data capture technologies have enabled 3D sensors to capture indoor environments to create real-world 3D models. Employing 3D sensors such as light detection and ranging (LiDAR) and time-of-flight (ToF) along with 2D image-capturing sensors has become instrumental in creating photorealistic 3D models. This integration of multi-modal data has become pervasive across diverse domains such as urban scenes (Mastin et al., <xref ref-type="bibr" rid="B27">2009</xref>; Mishra, <xref ref-type="bibr" rid="B31">2012</xref>), medical imaging (Markelj et al., <xref ref-type="bibr" rid="B25">2010</xref>), autonomous driving (Wang et al., <xref ref-type="bibr" rid="B49">2021</xref>), emergency evacuation (Sansoni et al., <xref ref-type="bibr" rid="B41">2009</xref>), and post-event (natural hazard) building assessments (Liu et al., <xref ref-type="bibr" rid="B23">2020</xref>). Notably, such registration approaches are relevant in indoor environments, often achieved using several images (Stamos, <xref ref-type="bibr" rid="B46">2010</xref>) using structure-from-motion (SfM), and manually describing correspondences with similarity transformations for pose estimation. The indoor environment modeling has also opened doors for industrial applications such as virtual tours (Metareal Inc., <xref ref-type="bibr" rid="B28">2023</xref>; Chang et al., <xref ref-type="bibr" rid="B9">2017</xref>) and indoor localization (Arth et al., <xref ref-type="bibr" rid="B6">2009</xref>; Sattler et al., <xref ref-type="bibr" rid="B42">2011</xref>).</p>
<p>In addition to LiDAR and ToF sensors, photogrammetry has emerged as a powerful technique for 3D reconstruction. By combining SfM with multi-view stereo (MVS) (Nebel et al., <xref ref-type="bibr" rid="B34">2020</xref>), photogrammetry can generate detailed 3D models from 2D images. More modern techniques such as neural radiance fields (NeRF) (Mildenhall et al., <xref ref-type="bibr" rid="B30">2020</xref>) and Gaussian Splats (Kerbl et al., <xref ref-type="bibr" rid="B19">2023</xref>) have further advanced the field by enabling high-quality 3D reconstruction from sparse views while addressing complex lighting conditions.</p>
<p>3D models are generally textured using perspective or 360&#x000B0; images after the capture session (throughout the article, we refer to the 360&#x000B0; image in equirectangular planar projection as just 360&#x000B0; image). On the other hand, hardware-embedded sensors like Azure Kinect (Microsoft, <xref ref-type="bibr" rid="B29">2023</xref>) and Intel RealSense (Yang et al., <xref ref-type="bibr" rid="B51">2017</xref>) enable the real-time generation of 3D textures. However, challenges arise during the integration of this heterogeneous data resulting in texture misprojection. This misprojection can occur due to misaligned projection of the posed images onto the 3D model, errors in optimization, and sensor inaccuracies. The inherent sensor inaccuracies and the restricted field of view (FOV) often result in holes and missing textures in the 3D model. Moreover, projecting multiple images onto a 3D model may also introduce blending challenges.</p>
<p>Our proposed method addresses these issues by aligning a 360&#x000B0; image with respect to a 3D mesh using its posed images and their associated features. Our approach uses least squares optimization (LSO) of an objective function defined as the projection of features from posed images on the feature plane of the 360&#x000B0; image. This method not only mitigates texture misprojection but also enhances the overall texture quality by leveraging the comprehensive scene information captured in 360&#x000B0; images.</p>
</sec>
<sec id="s2">
<title>2 Related works</title>
<p>Various attempts have been made to align 3D models with posed images. Local alignment methods such as iterative point cloud (ICP) registration (Delamarre and Faugeras, <xref ref-type="bibr" rid="B10">1999</xref>) rely on a reliable initialization and are limited in terms of their applicability in the case of unknown relative pose. Russell et al. (<xref ref-type="bibr" rid="B40">2011</xref>) presents a combination of global image structure tensor (GIST) descriptors with view-synthesis/retrieval for coarse alignment followed by fine alignment with view-dependent contours matching. However, it has low alignment precision in the case of images with shadings and unreliable features.</p>
<p>Textures generated from RGB-D reconstruction methods are generally sensitive to computational noises such as blurring, ghosting, and texture bleeding. Whelan et al. (<xref ref-type="bibr" rid="B50">2015</xref>) and Nie&#x000DF;ner et al. (<xref ref-type="bibr" rid="B35">2013</xref>) use truncated signed distance function (TSDF) volumetric grid running weighted average of multiple RGB images. For each triangle face of a 3D mesh, that is, triangle mesh, the vertex color is determined by the TSDF volumetric grid. Lempitsky and Ivanov (<xref ref-type="bibr" rid="B22">2007</xref>) and Allene et al. (<xref ref-type="bibr" rid="B4">2008</xref>) utilize pairwise Markov random field as an energy minimization problem to select the optimal image for texturing. On the other hand, Buehler et al. (<xref ref-type="bibr" rid="B7">2001</xref>) and Alj et al. (<xref ref-type="bibr" rid="B2">2012a</xref>) specify view-dependent texture mapping where the best texture for each triangle mesh is selected based on the minimum angle between the normal of the triangle mesh and the camera directions leveraging rendering methods based on unstructured lumigraph (Gortler et al., <xref ref-type="bibr" rid="B16">1996</xref>) and photoconsistency (Alj et al., <xref ref-type="bibr" rid="B3">2012b</xref>), respectively. However, visual artifacts and distortions are still persistent in such methods. Waechter et al. (<xref ref-type="bibr" rid="B48">2014</xref>) use global color adjustment (Lempitsky and Ivanov, <xref ref-type="bibr" rid="B22">2007</xref>) to mitigate visual artifacts due to view projection. Nevertheless, texture bleeding and multi-band blending are challenges for such methods, which can be handled by local texture warping and high-quality non-rigid texture mapping with a global optimization (Fu et al., <xref ref-type="bibr" rid="B14">2018</xref>). However, such approaches face boundary texture deformations and local texture distortions for large geometric errors.</p>
<p>Works on image alignment and texturing have also been done using either direct matching of features (2D-3D matching) (Sattler et al., <xref ref-type="bibr" rid="B42">2011</xref>) or with intermediate image (2D-2D-3D matching) (Sattler et al., <xref ref-type="bibr" rid="B43">2012</xref>). Modern 3D cameras like Azure Kinect (Microsoft, <xref ref-type="bibr" rid="B29">2023</xref>) and Realsense (Yang et al., <xref ref-type="bibr" rid="B51">2017</xref>) use calibration patterns to align 3D data with 2D images (Geiger et al., <xref ref-type="bibr" rid="B15">2012</xref>), but they are not designed to establish relationships between several images. Instead of using multiple images, Sufiyan et al. (<xref ref-type="bibr" rid="B47">2023</xref>) use 360&#x000B0; panoramic images for end-to-end image-based localization on both indoor and aerial scenes using deep learning-based approaches. Park et al. (<xref ref-type="bibr" rid="B37">2021</xref>) present image-model registration on large-scale urban scenes extracting semantic information from street-view images. However, such stitched posed images or even multiple projections of those images require several treatments such as multi-band blending and exposure compensation (Fu et al., <xref ref-type="bibr" rid="B14">2018</xref>).</p>
<p>Simultaneous localization and mapping (SLAM) algorithms such as RTAB-Map SLAM (Labb&#x000E9; et al., <xref ref-type="bibr" rid="B21">2018</xref>), ORB-SLAM (Mur-Artal et al., <xref ref-type="bibr" rid="B32">2015</xref>), and LSD-SLAM (Engel et al., <xref ref-type="bibr" rid="B12">2014</xref>) give precise poses, but they often fail to achieve high-quality textures. Structure-from-motion (SfM) approaches such as COLMAP (Sch&#x000F6;nberger and Frahm, <xref ref-type="bibr" rid="B44">2016</xref>; Sch&#x000F6;nberger et al., <xref ref-type="bibr" rid="B45">2016</xref>) and OpenMVS (Cernea, <xref ref-type="bibr" rid="B8">2020</xref>) are widely used for 3D reconstruction from multiple images. These methods involve feature extraction, matching, and triangulation to create 3D models. While effective, they often require a large number of overlapping images and computational resources.</p>
<p>Matterport (Chang et al., <xref ref-type="bibr" rid="B9">2017</xref>) uses three stationary structured cameras in a hardware enclosure to create a 3D mesh and 360&#x000B0; images. Matterport3D is particularly relevant as it uses posed images and 360&#x000B0; images for texturing 3D meshes, similar to our approach, making it a suitable baseline for comparison. Similarly, Metareal (Metareal Inc., <xref ref-type="bibr" rid="B28">2023</xref>) utilizes a 360&#x000B0; camera to capture data, creating a 3D model through the detection of line segments manually within the 360&#x000B0; images. Such advancements not only address challenges in data registration but also open avenues for innovative applications across various domains.</p>
<p>We propose a pipeline that uses a single 360&#x000B0; image for texturing that better encompasses and represents the scene information. Modern 360&#x000B0; cameras like RICOH THETA V (Aghayari et al., <xref ref-type="bibr" rid="B1">2017</xref>) are capable of generating HDR images, which further improve the texture quality. Addressing the limitations of SLAM algorithms, we use the relationship between SIFT features (Lowe, <xref ref-type="bibr" rid="B24">2004</xref>) in posed images and the cubemap face of a 360&#x000B0; image to minimize a custom objective function that better represents the 3D nature of the data than usual reprojection errors. Rather than computing 3D feature descriptors as suggested in Panek et al. (<xref ref-type="bibr" rid="B36">2023</xref>), we use camera extrinsics for projecting 2D features on a 3D plane. For 360&#x000B0; images, camera intrinsics are not required. The textures from the 360&#x000B0; image are finally projected using a custom raytracing pipeline.</p>
<p>In the following sections, we provide an overview of our method (Section 3) followed by a rigorous mathematical and geometrical interpretation of our approach focused on alignment estimation (Section 4), feature projection (Section 5), least squares optimization (LSO) (Section 6), and texturing (Section 7). Section 8 shows the evaluation of the metrics, performance of our approach, and comparison with RTAB-Map, COLMAP, and Matterport3D as the baseline. We chose RTAB-Map as we use its posed images for alignment, making it a suitable comparison for texture quality. COLMAP, on the other hand, is chosen as a baseline, for our goal is similar to SfM algorithms. Throughout the article, we refer to 2D perspective RGB images with known camera poses as &#x0201C;posed images&#x0201D;.</p>
</sec>
<sec id="s3">
<title>3 Method overview</title>
<p>The overall pipeline is illustrated in <xref ref-type="fig" rid="F1">Figure 1</xref>. Given a 360&#x000B0; image <inline-formula><mml:math id="M7"><mml:mrow><mml:mi mathvariant="script">S</mml:mi></mml:mrow></mml:math></inline-formula> and a set of posed images <inline-formula><mml:math id="M8"><mml:mrow><mml:mi mathvariant="script">C</mml:mi></mml:mrow></mml:math></inline-formula>, our objective is to achieve an accurate alignment between <inline-formula><mml:math id="M9"><mml:mrow><mml:mi mathvariant="script">S</mml:mi></mml:mrow></mml:math></inline-formula> and the 3D mesh generated from <inline-formula><mml:math id="M10"><mml:mrow><mml:mi mathvariant="script">C</mml:mi></mml:mrow></mml:math></inline-formula>. We obtain the pose for each image <italic>Q</italic><sub><italic>i</italic></sub> and a dense 3D mesh from RTAB-Map, which uses iterative sparse bundle adjustment and loop closure algorithms (Labbe and Michaud, <xref ref-type="bibr" rid="B20">2014</xref>) for optimal image registration.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Overall system block diagram of our approach.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-06-1388174-g0001.tif"/>
</fig>
<p>Using the front face from the cubemap projection of a <inline-formula><mml:math id="M11"><mml:mrow><mml:mi mathvariant="script">S</mml:mi></mml:mrow></mml:math></inline-formula>, we estimate the best match-posed image followed by feature preprocessing, which involves the extraction of SIFT features (Lowe, <xref ref-type="bibr" rid="B24">2004</xref>), image filtering, and 2D to 3D feature projection. Applying LSO on the processed features provides an optimal transformation (<italic>T</italic><sub><italic>opt</italic></sub>) for <inline-formula><mml:math id="M12"><mml:mrow><mml:mi mathvariant="script">S</mml:mi></mml:mrow></mml:math></inline-formula>, which is subsequently used to align the <inline-formula><mml:math id="M13"><mml:mrow><mml:mi mathvariant="script">S</mml:mi></mml:mrow></mml:math></inline-formula> with the 3D mesh for texturing. In summary, the process involves: (i) obtaining an initial transformation <italic>T</italic><sub><italic>init</italic></sub> for a given <inline-formula><mml:math id="M14"><mml:mrow><mml:mi mathvariant="script">S</mml:mi></mml:mrow></mml:math></inline-formula>, (ii) projecting 2D SIFT features from <inline-formula><mml:math id="M15"><mml:mrow><mml:mi mathvariant="script">S</mml:mi></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M16"><mml:mrow><mml:mi mathvariant="script">C</mml:mi></mml:mrow></mml:math></inline-formula> on a 3D space, (iii) using these features and <italic>T</italic><sub><italic>init</italic></sub> to obtain <italic>T</italic><sub><italic>opt</italic></sub> through the optimization of the point-line distance using the LSO, and (iv) projection of <inline-formula><mml:math id="M17"><mml:mrow><mml:mi mathvariant="script">S</mml:mi></mml:mrow></mml:math></inline-formula> on the 3D mesh using raytracing algorithm.</p>
</sec>
<sec id="s4">
<title>4 Initial estimation</title>
<p>Since we use least squares optimization for obtaining the optimal transformation <italic>T</italic><sub><italic>opt</italic></sub> of <inline-formula><mml:math id="M18"><mml:mrow><mml:mi mathvariant="script">S</mml:mi></mml:mrow></mml:math></inline-formula>, we need an initial estimate <italic>T</italic><sub><italic>init</italic></sub> of that transformation. Rather than using random assignment, we estimate <italic>T</italic><sub><italic>init</italic></sub> based on the number of feature matches between a posed image <inline-formula><mml:math id="M19"><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mi>&#x003F5;</mml:mi><mml:mrow><mml:mi mathvariant="script">C</mml:mi></mml:mrow></mml:math></inline-formula> and the front face <italic>S</italic><sub><italic>f</italic></sub> of <inline-formula><mml:math id="M20"><mml:mrow><mml:mi mathvariant="script">S</mml:mi></mml:mrow></mml:math></inline-formula>. Out of the six faces of the cubemap, we take the front face for initial estimation. However, the choice of the cubemap face can be arbitrary. This choice only impacts the estimation of <italic>T</italic><sub><italic>init</italic></sub> as we later use features from all the faces for optimization. We search for the best matching image <inline-formula><mml:math id="M21"><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mi>&#x003F5;</mml:mi><mml:mrow><mml:mi mathvariant="script">C</mml:mi></mml:mrow></mml:math></inline-formula> to <italic>S</italic><sub><italic>f</italic></sub> as depicted in <xref ref-type="fig" rid="F2">Figure 2</xref>. We leverage SIFT feature descriptor, for its scale and rotation invariance. We compare the SIFT features from <italic>S</italic><sub><italic>f</italic></sub>, <italic>F</italic><sub><italic>i</italic></sub>(<italic>S</italic><sub><italic>f</italic></sub>) to those from all <italic>C</italic><sub><italic>i</italic></sub>s, <italic>F</italic><sub><italic>i</italic></sub>(<italic>C</italic><sub><italic>i</italic></sub>).</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Cubemap projection of <inline-formula><mml:math id="M1"><mml:mrow><mml:mi mathvariant="script">S</mml:mi></mml:mrow></mml:math></inline-formula> <bold>(A)</bold>, its <italic>S</italic><sub><italic>f</italic></sub> <bold>(B)</bold>, and the corresponding <italic>C</italic><sub><italic>bm</italic></sub> <bold>(C)</bold>.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-06-1388174-g0002.tif"/>
</fig>
<p>Instead of the conventional way of using only the number of feature matches to determine the best match of <italic>S</italic><sub><italic>f</italic></sub> among <italic>C</italic><sub><italic>i</italic></sub>. Let <italic>a</italic> be the number of &#x0201C;good&#x0201D; feature matches between <italic>F</italic><sub><italic>i</italic></sub>(<italic>S</italic><sub><italic>f</italic></sub>) and <italic>F</italic><sub><italic>i</italic></sub>(<italic>C</italic><sub><italic>i</italic></sub>), where a &#x0201C;good&#x0201D; match is determined using <xref ref-type="disp-formula" rid="E1">Equation 1</xref>. Let <italic>b</italic> be the Frobenius norm (Horn and Johnson, <xref ref-type="bibr" rid="B17">1990</xref>) of the homography matrix between the keypoints from <italic>S</italic><sub><italic>f</italic></sub> and the corresponding matched keypoints from <italic>C</italic><sub><italic>i</italic></sub>. The image (<italic>C</italic><sub><italic>bm</italic></sub> &#x0003D; <italic>C</italic><sub><italic>i</italic></sub>) with <monospace>max</monospace>(<italic>a</italic><sup><italic>m</italic></sup><italic>b</italic><sup>&#x02212;<italic>n</italic></sup>) is chosen as the best matching image among <inline-formula><mml:math id="M22"><mml:mrow><mml:mi mathvariant="script">C</mml:mi></mml:mrow></mml:math></inline-formula>. Here, <italic>m</italic> and <italic>n</italic> are constants determined empirically.</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M23"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>|</mml:mo><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>|</mml:mo><mml:mo>|</mml:mo></mml:mrow></mml:mfrac><mml:mo>&#x02264;</mml:mo><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>75</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>F</italic><sub><italic>i</italic></sub>(<italic>C</italic><sub><italic>i</italic></sub>) and <italic>F</italic><sub><italic>j</italic></sub>(<italic>C</italic><sub><italic>i</italic></sub>) are the features in <italic>C</italic><sub><italic>i</italic></sub> representing the first and the second best SIFT feature matches between <italic>S</italic><sub><italic>f</italic></sub> and <italic>C</italic><sub><italic>i</italic></sub>, respectively.</p>
<p>By incorporating the Frobenius norm, our method balances the number of good feature matches and the geometric transformation between <italic>S</italic><sub><italic>f</italic></sub> and <italic>C</italic><sub><italic>i</italic></sub>. In general, given a homography matrix between <italic>S</italic><sub><italic>f</italic></sub> and <italic>C</italic><sub><italic>i</italic></sub>, the Frobenius norm quantifies the amount of geometric transformation including rotation, translation, and scaling, required to map points from one image to the corresponding points in the other image.</p>
<p>We specifically select <italic>C</italic><sub><italic>bm</italic></sub> by maximizing the metric <italic>a</italic><sup><italic>m</italic></sup><italic>b</italic><sup>&#x02212;<italic>n</italic></sup>, ensuring that the selected image not only has a high number of feature matches but also requires minimal transformation, resulting in a more accurate and reliable alignment. The comparison between the alignment results from the metrics <monospace>max</monospace>(<italic>a</italic>) and <monospace>max</monospace>(<italic>a</italic><sup><italic>m</italic></sup><italic>b</italic><sup>&#x02212;<italic>n</italic></sup>) is described in Section 8.1.1.</p>
<p>For our experiment, we find <italic>m</italic> &#x0003D; <italic>n</italic> &#x0003D; 2 as appropriate choices. This choice ensures a quadratic relationship, balancing the influence of the number of good matches and the Frobenius norm maintaining the simplicity of the calculations. The initial estimate <italic>T</italic><sub><italic>init</italic></sub> is determined by the inversion of the pose <italic>Q</italic><sub><italic>bm</italic></sub> of <italic>C</italic><sub><italic>bm</italic></sub>. This accounts for the difference between the world and camera coordinate systems.</p>
</sec>
<sec id="s5">
<title>5 Feature projection</title>
<p>It is essential to address the inherent challenge posed by the difference in dimensions between <italic>F</italic><sub><italic>i</italic></sub>(<italic>C</italic><sub><italic>i</italic></sub>) and the 3D nature of <italic>T</italic><sub><italic>init</italic></sub>, so we project 2D features into 3D space. Moreover, <italic>F</italic><sub><italic>i</italic></sub>(<italic>S</italic><sub><italic>f</italic></sub>)s originally in the spherical coordinate system require transformation to the Cartesian coordinate system.</p>
<sec>
<title>5.1 2D to 3D projection</title>
<p>We use camera intrinsic matrix <italic>K</italic>, and extrinsic matrix [<italic>R</italic>|<italic>t</italic>] as the inverse of the pose matrix <italic>Q</italic> &#x0003D; {<italic>Q</italic><sub><italic>i</italic></sub>}, which maps a 2D pixel coordinate of the respective SIFT feature <italic>F</italic><sub><italic>i</italic></sub>(<italic>C</italic><sub><italic>i</italic></sub>) to a 3D point <italic>P</italic><sub><italic>i</italic></sub> as given by <xref ref-type="disp-formula" rid="E2">Equation 2</xref> (Imatest, <xref ref-type="bibr" rid="B18">2023</xref>):</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M24"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:msup><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>X</mml:mi><mml:mtext>&#x02003;</mml:mtext><mml:mi>Y</mml:mi><mml:mtext>&#x02003;</mml:mtext><mml:mi>Z</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:msup><mml:mrow><mml:mi>K</mml:mi></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:msup><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>u</mml:mi><mml:mtext>&#x02003;</mml:mtext><mml:mi>v</mml:mi><mml:mtext>&#x02003;</mml:mtext><mml:mn>1</mml:mn></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mo>-</mml:mo><mml:msup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mi>t</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>R</italic> is the rotation matrix, <italic>t</italic> is the translation vector, and (<italic>u, v</italic>) are pixel coordinates in (<italic>x, y</italic>) image axes along the width and height of the image.</p>
<p>This method of 2D to 3D projection provides an equivalent representation of the features for our algorithm, thus eliminating the dependency on depth maps. Therefore, we choose this projection method over the use of depth information. Moreover, it is efficient in terms of the runtime of the algorithm. Even among the matched feature points <italic>P</italic><sub><italic>i</italic></sub>(<italic>C</italic><sub><italic>i</italic></sub>), not all of them are spatially close to each other, as they may be distributed in various directions in the 3D space. Such outliers can affect the performance of LSO. Hence to identify the clustered features <inline-formula><mml:math id="M25"><mml:msubsup><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02282;</mml:mo><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, we use a naive outlier rejection algorithm that excludes all <italic>P</italic><sub><italic>i</italic></sub>(<italic>C</italic><sub><italic>i</italic></sub>) lying outside a sphere of radius 0.5 around <italic>P</italic><sub><italic>i</italic></sub>(<italic>S</italic><sub><italic>f</italic></sub>) as shown in <xref ref-type="disp-formula" rid="E3">Equation 3</xref>.</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M26"><mml:mrow><mml:msub><mml:msup><mml:mi>P</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mo>&#x0007B;</mml:mo><mml:msub><mml:mi>P</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>:</mml:mo><mml:mtext>&#x0205F;</mml:mtext><mml:mo>&#x0007C;</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:msub><mml:mi>P</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>S</mml:mi><mml:mi>f</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02212;</mml:mo><mml:msub><mml:mi>P</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:mo>&#x02264;</mml:mo><mml:mn>0.5</mml:mn><mml:mo>&#x0007D;</mml:mo></mml:mrow></mml:math></disp-formula>
</sec>
<sec>
<title>5.2 Spherical to cartesian projection</title>
<p>Let (&#x003B8;, &#x003D5;) be the unique longitude and latitude of a feature <italic>F</italic><sub><italic>i</italic></sub>(<italic>S</italic><sub><italic>f</italic></sub>). We can project 2D features <italic>F</italic><sub><italic>i</italic></sub>(<italic>S</italic><sub><italic>f</italic></sub>)s onto a unit sphere using spherical to Cartesian coordinate conversion.</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M29"><mml:mtable class="eqnarray" columnalign="right"><mml:mtr><mml:mtd><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>9</mml:mn><mml:msup><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mo>&#x000B0;</mml:mo></mml:mrow></mml:msup><mml:mo>-</mml:mo><mml:mi>&#x003D5;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>-</mml:mo><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>9</mml:mn><mml:msup><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mo>&#x000B0;</mml:mo></mml:mrow></mml:msup><mml:mo>-</mml:mo><mml:mi>&#x003D5;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>-</mml:mo><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>z</mml:mi><mml:mo>=</mml:mo><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>9</mml:mn><mml:msup><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mo>&#x000B0;</mml:mo></mml:mrow></mml:msup><mml:mo>-</mml:mo><mml:mi>&#x003D5;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>In the world coordinate system (WCS), coordinates are typically projected along the &#x0002B;X-axis, whereas in the camera&#x00027;s coordinate system, they are projected along the &#x0002B;<italic>Z</italic>-axis. To align these different conventions, we apply a correction using a rotation matrix <italic>R</italic><sub><italic>corr</italic></sub>. This matrix rotates the spherical coordinate <inline-formula><mml:math id="M31"><mml:msub><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mi>y</mml:mi><mml:mi>z</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> by 90&#x000B0; on both X and Z axes. This correction ensures that the spherical points from <italic>S</italic><sub><italic>f</italic></sub> are properly aligned with <inline-formula><mml:math id="M32"><mml:msubsup><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> when transformed using <italic>T</italic><sub><italic>init</italic></sub>.</p>
<disp-formula id="E5"><mml:math id="M33"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mtable style="text-align:axis;" equalrows="false" columnlines="none none none none none none none none none" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>All these processes of feature projection are repeated for the features on all the cubemap faces excluding the top and bottom as they generally contain ceilings and floors with a lower number of features.</p>
</sec>
</sec>
<sec id="s6">
<title>6 Least squares optimization</title>
<p>As a <italic>F</italic><sub><italic>i</italic></sub>(<italic>S</italic><sub><italic>f</italic></sub>) can have matches across different <italic>C</italic><sub><italic>i</italic></sub>s, we augment <italic>P</italic><sub><italic>i</italic></sub>(<italic>S</italic><sub><italic>f</italic></sub>) to match <inline-formula><mml:math id="M34"><mml:msubsup><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> followed by decomposition of <italic>T</italic><sub><italic>init</italic></sub> into an initial state matrix &#x003C1;<sub>0</sub> &#x0003D; (<italic>t</italic><sub><italic>x</italic></sub>, <italic>t</italic><sub><italic>y</italic></sub>, <italic>t</italic><sub><italic>z</italic></sub>, &#x003B1;, &#x003B2;, &#x003B3;). The state matrix &#x003C1; represents six degrees of freedom (dof) of the transformation matrix <italic>T</italic>. <xref ref-type="disp-formula" rid="E6">Equations 5</xref> and <xref ref-type="disp-formula" rid="E7">6</xref> give the relationship between <italic>T</italic> and its &#x003C1;.</p>
<disp-formula id="E6"><label>(5)</label><mml:math id="M35"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>T</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mtable style="text-align:axis;" equalrows="false" columnlines="none none none none none none none none none" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B3;</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B3;</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B3;</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B3;</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B3;</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B3;</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B3;</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B3;</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B3;</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B3;</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:msub><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>z</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>s</italic> and <italic>c</italic> represent <italic>sine</italic> and <italic>cosine</italic> functions respectively.</p>
<disp-formula id="E7"><label>(6)</label><mml:math id="M36"><mml:mtable class="eqnarray" columnalign="right"><mml:mtr><mml:mtd><mml:mi>&#x003B1;</mml:mi><mml:mo>=</mml:mo><mml:mo class="qopname">arctan</mml:mo><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mn>32</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mn>33</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x02003;</mml:mtext><mml:mi>&#x003B2;</mml:mi><mml:mo>=</mml:mo><mml:mo class="qopname">arctan</mml:mo><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mn>31</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msqrt><mml:mrow><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mn>32</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>&#x0002B;</mml:mo><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mn>33</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:msqrt></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>&#x003B3;</mml:mi><mml:mo>=</mml:mo><mml:mo class="qopname">arctan</mml:mo><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mn>21</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mn>11</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Since, the domain points <italic>P</italic><sub><italic>i</italic></sub>(<italic>S</italic><sub><italic>f</italic></sub>) and observation points <inline-formula><mml:math id="M38"><mml:msubsup><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> are taken from two different projections, that is, spherical surface (360&#x000B0; image) and plane (posed images) respectively, we implement an objective function <italic>e</italic><sub><italic>i</italic></sub> representing the perpendicular distance <italic>D</italic> from the observation point (<italic>N</italic>) to the line drawn by joining [<italic>t</italic><sub><italic>x</italic></sub>, <italic>t</italic><sub><italic>y</italic></sub>, <italic>t</italic><sub><italic>z</italic></sub>] and the predicted point <inline-formula><mml:math id="M39"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> (Nagwa, <xref ref-type="bibr" rid="B33">2023</xref>) as shown in <xref ref-type="fig" rid="F3">Figure 3</xref>. We also refer to this objective function as an error function as it quantifies the error as the distance between <italic>P</italic><sub><italic>i</italic></sub>(<italic>S</italic><sub><italic>f</italic></sub>) and the corresponding <inline-formula><mml:math id="M40"><mml:msubsup><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>.</p>
<disp-formula id="E8"><label>(7)</label><mml:math id="M41"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo>|</mml:mo><mml:mo>|</mml:mo><mml:mover class="overrightarrow"><mml:mrow><mml:mi>O</mml:mi><mml:mi>N</mml:mi></mml:mrow><mml:mo>&#x020D7;</mml:mo></mml:mover><mml:mo>&#x000D7;</mml:mo><mml:mover class="overrightarrow"><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mo>&#x020D7;</mml:mo></mml:mover><mml:mo>|</mml:mo><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mo>|</mml:mo><mml:mover class="overrightarrow"><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mo>&#x020D7;</mml:mo></mml:mover><mml:mo>|</mml:mo><mml:mo>|</mml:mo></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>The red and blue points represent features from the cubemap faces and their corresponding matched features across different posed images respectively <bold>(A)</bold>. The goal is to minimize the perpendicular distance (<italic>D</italic>) between the line passing through red (<inline-formula><mml:math id="M2"><mml:mover accent="true"><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mo>&#x02192;</mml:mo></mml:mover></mml:math></inline-formula>) and blue points (<inline-formula><mml:math id="M3"><mml:mover accent="true"><mml:mrow><mml:mi>O</mml:mi><mml:mi>N</mml:mi></mml:mrow><mml:mo>&#x02192;</mml:mo></mml:mover></mml:math></inline-formula>). <italic>D</italic> is calculated as the projection of <inline-formula><mml:math id="M4"><mml:mover accent="true"><mml:mrow><mml:mi>O</mml:mi><mml:mi>N</mml:mi></mml:mrow><mml:mo>&#x02192;</mml:mo></mml:mover></mml:math></inline-formula> on <inline-formula><mml:math id="M5"><mml:mover accent="true"><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mo>&#x02192;</mml:mo></mml:mover></mml:math></inline-formula>, a unit vector along <inline-formula><mml:math id="M6"><mml:mover accent="true"><mml:mrow><mml:mi>O</mml:mi><mml:mover accent="true"><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mo>&#x02192;</mml:mo></mml:mover></mml:math></inline-formula> <bold>(B)</bold>, using the <xref ref-type="disp-formula" rid="E8">Equation 7</xref>.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-06-1388174-g0003.tif"/>
</fig>
<p>While the conventional approach in reconstruction tasks is to minimize the reprojection error, we opt to minimize the point-line distance due to distinct advantages. The point-line distance method maintains dimensional consistency between our spherical and Cartesian coordinates, accurately captures the geometric distribution of features, and reduces computational complexity. It consequently leads to a faster convergence.</p>
<p>We obtain the final (optimal) state matrix &#x003C1;<sub><italic>opt</italic></sub> from <italic>e</italic><sub><italic>i</italic></sub> and its Jacobian <italic>J</italic><sub><italic>i</italic></sub> using the Levenberg&#x02013;Marquardt algorithm (Marquardt, <xref ref-type="bibr" rid="B26">1963</xref>) followed by iterative training of a stochastic gradient descent (SGD) (Rumelhart et al., <xref ref-type="bibr" rid="B39">1986</xref>) algorithm with a small learning rate &#x003BB; as shown in <xref ref-type="disp-formula" rid="E10">Equation 9</xref>:</p>
<disp-formula id="E9"><label>(8)</label><mml:math id="M42"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x02202;</mml:mi><mml:msub><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>&#x02202;</mml:mi><mml:mi>&#x003C1;</mml:mi></mml:mrow></mml:mfrac><mml:mo>,</mml:mo><mml:mtext>&#x02003;</mml:mtext><mml:mi>H</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x02211;</mml:mo><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:msub><mml:mrow><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mtext>&#x02003;</mml:mtext><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mo>&#x02211;</mml:mo><mml:msup><mml:mrow><mml:msub><mml:mrow><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:msub><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003C1; &#x0003D; (<italic>t</italic><sub><italic>x</italic></sub>, <italic>t</italic><sub><italic>y</italic></sub>, <italic>t</italic><sub><italic>z</italic></sub>, &#x003B1;, &#x003B2;, &#x003B3;).</p>
<disp-formula id="E10"><label>(9)</label><mml:math id="M43"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x003C1;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C1;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>H</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mo>&#x003BB;</mml:mo><mml:mi>I</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mi>b</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>I</italic> is an identity matrix of the same order as that of <italic>H</italic>. Finally, <italic>T</italic><sub><italic>opt</italic></sub> is computed from &#x003C1;<sub><italic>opt</italic></sub> using <xref ref-type="disp-formula" rid="E6">Equation 5</xref>.</p>
</sec>
<sec id="s7">
<title>7 Texture projection</title>
<p>Given a 3D mesh <inline-formula><mml:math id="M47"><mml:mrow><mml:mi mathvariant="script">M</mml:mi></mml:mrow></mml:math></inline-formula> with vertices <italic>V</italic> &#x0003D; (<italic>x, y, z</italic>), we correct <italic>V</italic> with respect to <italic>T</italic><sub><italic>opt</italic></sub> followed by a texture correction matrix <italic>T</italic><sub><italic>texture</italic></sub>, which ensures the texture projection as per the Manhattan World assumptions. In our case, the texture correction matrix has an Eulerian angle of -<inline-formula><mml:math id="M48"><mml:mfrac><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:math></inline-formula> along both the <italic>X</italic> and <italic>Y</italic> axes.</p>
<disp-formula id="E11"><label>(10)</label><mml:math id="M49"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>t</mml:mi><mml:mi>u</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>p</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<fig id="F15" position="float">
<label>Algorithm 1</label>
<caption><p>360&#x000B0; image alignment algorithm.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-06-1388174-g0015.tif"/>
</fig>
<p>We convert <italic>V</italic>&#x02032; to polar coordinates (&#x003B8;, &#x003D5;). Since the 360&#x000B0; images we use are in equirectangular projection, we adjust (&#x003B8;, &#x003D5;) as per the dimensional ratio of equirectangular map (2:1) such that the (&#x003B8;, &#x003D5;) represent the (<italic>latitude, longitude</italic>). We then convert (<italic>latitude, longitude</italic>) to pixel coordinates (<italic>u, v</italic>) as per the standard coordinate transformation (<xref ref-type="disp-formula" rid="E12">Equations 11</xref> and <xref ref-type="disp-formula" rid="E13">12</xref>).</p>
<disp-formula id="E12"><label>(11)</label><mml:math id="M50"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>u</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable columnalign='left'><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mi>&#x003B8;</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x003C0;</mml:mi><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mtext>if&#x000A0;</mml:mtext><mml:mi>&#x003B8;</mml:mi><mml:mo>&#x0003C;</mml:mo><mml:mn>0</mml:mn></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mi>&#x003B8;</mml:mi><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mtext>otherwise</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>g</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>u</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable columnalign='left'><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mi>&#x003D5;</mml:mi><mml:mo>+</mml:mo><mml:mn>2</mml:mn><mml:mi>&#x003C0;</mml:mi><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mtext>if&#x000A0;</mml:mtext><mml:mi>&#x003D5;</mml:mi><mml:mo>&#x0003C;</mml:mo><mml:mn>0</mml:mn></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mi>&#x003D5;</mml:mi><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mtext>otherwise</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E13"><label>(12)</label><mml:math id="M52"><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>u</mml:mi><mml:mo>,</mml:mo><mml:mi>v</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mo stretchy='true'>(</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x02212;</mml:mo><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>g</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>u</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mi>&#x003C0;</mml:mi></mml:mrow></mml:mfrac><mml:mo>,</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x02212;</mml:mo><mml:mi>l</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi><mml:mi>u</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi></mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mfrac><mml:mo stretchy='true'>)</mml:mo></mml:mrow></mml:math></disp-formula>
<p>360&#x000B0; images contain the same textures on the extreme north pole and extreme south pole of a <italic>uv</italic> map and as a result, while assigning textures to each triangle mesh, the position of triangle mesh vertices beyond or at the poles are interpreted as the position on the opposite side of the <italic>uv</italic> map as shown in <xref ref-type="fig" rid="F4">Figure 4</xref>.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Geometric interpretation of the issue beyond or at poles during 360&#x000B0; image texturing. The triangles represent the triangle meshes on which textures are to be projected (blue represents the actual position and red represents the apparent position). Due to the position of vertex B (in blue triangles) near the poles of the <italic>uv</italic> map, the blue triangles are interpreted as red triangles.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-06-1388174-g0004.tif"/>
</fig>
<p>Due to the periodic nature of 360&#x000B0; image textures, the vertices of blue and red triangles are actually the same. As a result, instead of projecting textures on the blue triangle, the same textures are projected on the red triangle. This introduces rainbow-like artifacts for all such triangles as shown in <xref ref-type="fig" rid="F5">Figure 5</xref>. This is a common problem with most texturing algorithms. To solve this issue, we utilize the same periodic nature of 360&#x000B0; images&#x02014;we translate the triangles along the <italic>u</italic>-axis<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref> by a translation factor of 0.5 for such cases. This translation repeats the (<italic>u, v</italic>) coordinates in the region beyond the poles thus correctly identifying the actual position of vertices of the triangle to be textured.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Rainbow-like artifacts <bold>(A)</bold> due to the texture projection on red triangles instead of blue triangles (from <xref ref-type="fig" rid="F4">Figure 4</xref>) observed on a spherical mesh. The artifacts are removed <bold>(B)</bold> leveraging the periodic nature of 360&#x000B0; image textures.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-06-1388174-g0005.tif"/>
</fig>
</sec>
<sec sec-type="results" id="s8">
<title>8 Results</title>
<sec>
<title>8.1 Evaluation</title>
<sec>
<title>8.1.1 Best match parameter evaluation</title>
<p>As explained in Section 4, we evaluate the performance of our algorithm using two distinct metrics: <monospace>max</monospace>(<italic>a</italic>) and <monospace>max</monospace>(<italic>a</italic><sup>2</sup><italic>b</italic><sup>&#x02212;2</sup>) to identify <italic>C</italic><sub><italic>bm</italic></sub>s. <xref ref-type="fig" rid="F6">Figure 6</xref> and <xref ref-type="table" rid="T1">Table 1</xref> show that <monospace>max</monospace>(<italic>a</italic><sup>2</sup><italic>b</italic><sup>&#x02212;2</sup>) consistently yields lower loss compared to <monospace>max</monospace>(<italic>a</italic>) during LSO. Moreover, as shown in <xref ref-type="fig" rid="F7">Figure 7</xref>, the initial transformation estimated using <italic>C</italic><sub><italic>bm</italic></sub> based on <monospace>max</monospace>(<italic>a</italic><sup>2</sup><italic>b</italic><sup>&#x02212;2</sup>) gives significantly robust alignment compared to the estimation based on <monospace>max</monospace>(<italic>a</italic>). On all the scenes, the transformations obtained with the metric <monospace>max</monospace>(<italic>a</italic><sup>2</sup><italic>b</italic><sup>&#x02212;2</sup>) show a better alignment of textures with the 3D mesh than those with the metric <monospace>max</monospace>(<italic>a</italic>). This emphasizes the critical role of parameter <italic>b</italic> and <monospace>max</monospace>(<italic>a</italic><sup>2</sup><italic>b</italic><sup>&#x02212;2</sup>) metric, even after obtaining the maximum number of good feature matches between the <italic>S</italic><sub><italic>f</italic></sub> and <italic>C</italic><sub><italic>i</italic></sub>.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Performance comparison of metrics <monospace>max</monospace>(<italic>a</italic>) and <monospace>max</monospace>(<italic>a</italic><sup>2</sup><italic>b</italic><sup>&#x02212;2</sup>) on the real-world dataset. The metric <monospace>max</monospace>(<italic>a</italic><sup>2</sup><italic>b</italic><sup>&#x02212;2</sup>) shows better performance than <monospace>max</monospace>(<italic>a</italic>) both in terms of loss and convergence rate.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-06-1388174-g0006.tif"/>
</fig>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Parameters and metrics for <italic>C</italic><sub><italic>bm</italic></sub> selection as discussed in Section 4 with their LSO losses.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>Dataset</bold></th>
<th valign="top" align="center"><bold>Image ID</bold></th>
<th valign="top" align="center" colspan="2"><bold>Metric</bold></th>
<th valign="top" align="center"><bold>Loss</bold></th>
</tr>
</thead>
<tbody>
<tr style="background-color:#919498;color:#ffffff">
<td/>
<td/>
<td valign="top" align="center"><italic>a</italic></td>
<td valign="top" align="center"><italic>a</italic><sup>2</sup><italic>b</italic><sup>&#x02212;2</sup></td>
<td/>
</tr>
<tr>
<td valign="top" align="left" rowspan="2"><monospace>office_1</monospace></td>
<td valign="top" align="center">87</td>
<td valign="top" align="center"><bold>342</bold></td>
<td valign="top" align="center">0.679</td>
<td valign="top" align="center">0.017</td>
</tr>
 <tr>
<td/>
<td valign="top" align="center">89</td>
<td valign="top" align="center">333</td>
<td valign="top" align="center"><bold>8.788</bold></td>
<td valign="top" align="center"><bold>0.013</bold></td>
</tr>
<tr>
<td valign="top" align="left" rowspan="2"><monospace>office_2</monospace></td>
<td valign="top" align="center">1</td>
<td valign="top" align="center"><bold>151</bold></td>
<td valign="top" align="center"><bold>11.521</bold></td>
<td valign="top" align="center"><bold>0.013</bold></td>
</tr>
 <tr>
<td/>
<td valign="top" align="center">130</td>
<td valign="top" align="center">121</td>
<td valign="top" align="center">1.369</td>
<td valign="top" align="center">0.015</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="2"><monospace>kitchen</monospace></td>
<td valign="top" align="center">4</td>
<td valign="top" align="center">415</td>
<td valign="top" align="center"><bold>93.789</bold></td>
<td valign="top" align="center"><bold>0.017</bold></td>
</tr>
 <tr>
<td/>
<td valign="top" align="center">7</td>
<td valign="top" align="center"><bold>420</bold></td>
<td valign="top" align="center">3.738</td>
<td valign="top" align="center">0.022</td>
</tr>
<tr>
<td valign="top" align="left" rowspan="2"><monospace>room</monospace></td>
<td valign="top" align="center">70</td>
<td valign="top" align="center">76</td>
<td valign="top" align="center"><bold>0.025</bold></td>
<td valign="top" align="center"><bold>0.013</bold></td>
</tr>
<tr>
<td/>
<td valign="top" align="center">67</td>
<td valign="top" align="center"><bold>86</bold></td>
<td valign="top" align="center">0.016</td>
<td valign="top" align="center">0.017</td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>The numbers in bold represent their corresponding extrema.</p>
</table-wrap-foot>
</table-wrap>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Comparison of texture alignment with 3D mesh using the transformations obtained with metrics <monospace>max</monospace>(<italic>a</italic>) and <monospace>max</monospace>(<italic>a</italic><sup>2</sup><italic>b</italic><sup>&#x02212;2</sup>). The metric <monospace>max</monospace>(<italic>a</italic><sup>2</sup><italic>b</italic><sup>&#x02212;2</sup>) yields more precise alignment than <monospace>max</monospace>(<italic>a</italic>). The missing triangle mesh regions (white) are due to the holes in the input 3D mesh which we later compensate for using the hole-filling algorithm as described in the <xref ref-type="sec" rid="A1">Appendix 1</xref>.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-06-1388174-g0007.tif"/>
</fig>
</sec>
<sec>
<title>8.1.2 Feature alignment and pose optimization</title>
<p>With the transformation <italic>T</italic><sub><italic>init</italic></sub> converted to &#x003C1;<sub>0</sub> using <xref ref-type="disp-formula" rid="E6">Equation 5</xref>, the algorithm leverages standard SGD for the LSO, with a learning rate of 1<italic>e</italic>&#x02212;3. The initial alignment disparities between <italic>P</italic><sub><italic>i</italic></sub>(<italic>S</italic><sub><italic>f</italic></sub>) and <inline-formula><mml:math id="M53"><mml:msubsup><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> for a given face of the cubemap are evident as indicated by the green ellipses in <xref ref-type="fig" rid="F8">Figure 8</xref>. However, these disparities diminish progressively with iterations. A closer analysis reveals that due to the selection of the front face features for initial estimation, <italic>P</italic><sub><italic>i</italic></sub>(<italic>S</italic><sub><italic>f</italic></sub>) are initially more tilted toward one side of the 3D features. However, the inclusion of features from all the faces of the cubemap for optimization eventually results in a more centralized feature localization.</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p><inline-formula><mml:math id="M27"><mml:msubsup><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> (blue) and <italic>P</italic><sub><italic>i</italic></sub>(<italic>S</italic><sub><italic>f</italic></sub>) (red). <bold>(A)</bold> shows the position of 3D features before optimization and <bold>(B)</bold> shows their position during initialization. The black boundary lines represent one of the cubemap faces (front face), the green ellipses represent the corresponding features in <inline-formula><mml:math id="M28"><mml:msubsup><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> and <italic>P</italic><sub><italic>i</italic></sub>(<italic>S</italic><sub><italic>f</italic></sub>), and the orange-dotted arrows represent the rays traced from the features for texturing.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-06-1388174-g0008.tif"/>
</fig>
<p>As illustrated in <xref ref-type="fig" rid="F9">Figure 9</xref>, our algorithm demonstrates consistent loss reduction across all the scenes over successive iterations further supporting the performance of the algorithm. Moreover, the green bounding boxes highlight the region of interest (ROI) for comparison of the precision of alignment for initial, intermediate, and final iterations. This shows a significant improvement in the alignment, ultimately giving the least error at the point of convergence. The final iteration typically yields the most accurate alignment, although instances of local minima convergence can be observed. In such cases, early stopping proved beneficial by selecting transformations from the intermediate iterations that offered better alignment.</p>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption><p>Performance of our approach at different iterations during LSO. The loss function exponentially decreases for all the scenes. Comparing the initial, intermediate, and final iterations shows significant improvement in the alignment process as highlighted by the green ROI.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-06-1388174-g0009.tif"/>
</fig>
</sec>
</sec>
<sec>
<title>8.2 Experiments</title>
<sec>
<title>8.2.1 Comparison on real-world scenes</title>
<p>We use Azure Kinect (Microsoft, <xref ref-type="bibr" rid="B29">2023</xref>) to capture the posed images (<inline-formula><mml:math id="M60"><mml:mrow><mml:mi mathvariant="script">C</mml:mi></mml:mrow></mml:math></inline-formula>) for its ability to capture detailed 3D data with minimal noise in depth maps (Rijal et al., <xref ref-type="bibr" rid="B38">2023</xref>). As shown in <xref ref-type="fig" rid="F10">Figure 10</xref>, we mount the Azure Kinect camera on a tripod facilitating the controlled rotation across the different pans and tilts to achieve nearly 360&#x000B0; panoramic coverage of a scene. However, due to hardware constraints, we exclude the nadir and the zenith corresponding to the tilts beyond a range of [&#x02212;75&#x000B0;, 75&#x000B0;]. We utilize RTAB-Map (Labb&#x000E9; et al., <xref ref-type="bibr" rid="B21">2018</xref>) SLAM pipeline to obtain posed images and 3D meshes. Following this, we position the RICOH Theta camera on the same tripod to capture 360&#x000B0; images maintaining similar levels to the Azure Kinect.</p>
<fig id="F10" position="float">
<label>Figure 10</label>
<caption><p>3D mesh <bold>(B)</bold> obtained from Azure Kinect mounted in a tripod with custom-designed hardware <bold>(A)</bold> and the 360&#x000B0; image in equirectangular projection <bold>(D)</bold> from RICOH THETA V <bold>(C)</bold>. Azure Kinect and RICOH Theta V are mounted on the same tripod maintaining similar levels.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-06-1388174-g0010.tif"/>
</fig>
<p>The problem of estimating the transformation for <italic>S</italic><sub><italic>f</italic></sub> using pre-registered <italic>C</italic><sub><italic>i</italic></sub>s resembles standard SfM approaches where new images can be integrated into an existing system of posed images by extracting features, matching them with pre-registered images, and then triangulating the feature points to obtain the pose of the newly integrated image. We compare our texture alignment results with those of COLMAP (Sch&#x000F6;nberger and Frahm, <xref ref-type="bibr" rid="B44">2016</xref>; Sch&#x000F6;nberger et al., <xref ref-type="bibr" rid="B45">2016</xref>), a popular SfM tool. We utilize COLMAP&#x00027;s feature extraction and matching, update the camera intrinsics, and triangulate the matched features to integrate <italic>S</italic><sub><italic>f</italic></sub> into the existing SLAM system of <italic>C</italic><sub><italic>i</italic></sub>s. Using the transformation calculated for <italic>S</italic><sub><italic>f</italic></sub> in this system, we align the 360&#x000B0; image for texturing. <xref ref-type="fig" rid="F11">Figure 11</xref> shows the comparison of final alignment quality between COLMAP and our approach.</p>
<fig id="F11" position="float">
<label>Figure 11</label>
<caption><p>Texture alignment results from different approaches: original 3D mesh from RTAB-Map <bold>(left)</bold>, re-textured mesh with transformation obtained from COLMAP <bold>(middle)</bold>, and our approach <bold>(right)</bold>.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-06-1388174-g0011.tif"/>
</fig>
<fig id="F16" position="float">
<label>Algorithm 2</label>
<caption><p>Raytracing algorithm.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-06-1388174-g0016.tif"/>
</fig>
<p>COLMAP primarily relies on traditional SfM techniques designed for perspective images, which limits its ability to register an entire 360&#x000B0; image within a SLAM system of perspective-posed images. Therefore, the front face is chosen as the reference for the 360&#x000B0; image. While other faces can also be registered within the same system, they do not contribute to the overall alignment of the 360&#x000B0; image and act independently. This reliance on only the front face for image registration causes COLMAP to localize features toward one side, neglecting features from other faces. In contrast, our approach incorporates features from all faces, resulting in a more robust and accurate feature alignment as shown in <xref ref-type="fig" rid="F8">Figure 8</xref>. This feature alignment is further corroborated by <xref ref-type="fig" rid="F11">Figure 11</xref>, which shows better texture alignment with our approach.</p>
<p>Specifically, our method consistently achieves higher values in image similarity metrics such as PSNR (peak signal-to-noise ratio) and structural similarity index (SSIM), indicating better alignment quality. PSNR measures the ratio between the maximum possible value of a pixel and the power of the noise affecting the image, with higher values representing better texture alignment. SSIM assesses the similarity between two images, with higher values indicating greater structural similarity. Moreover, our approach yields lower values in the perceptual similarity metric LPIPS (Learned Perceptual Image Patch Similarity), which measures perceptual differences between images, with lower values indicating better perceptual alignment. The results presented in <xref ref-type="table" rid="T2">Table 2</xref> clearly demonstrate that our approach outperforms COLMAP.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Comparison of PSNR, SSIM, and LPIPS metrics between COLMAP and our approach on the real-world dataset.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>Dataset</bold></th>
<th valign="top" align="center" colspan="2"><bold>PSNR</bold></th>
<th valign="top" align="center" colspan="2"><bold>SSIM</bold></th>
<th valign="top" align="center" colspan="2"><bold>LPIPS</bold></th>
</tr>
</thead>
<tbody>
<tr style="background-color:#919498;color:#ffffff">
<td/>
<td valign="top" align="center"><bold>COLMAP</bold></td>
<td valign="top" align="center"><bold>OURS</bold></td>
<td valign="top" align="center"><bold>COLMAP</bold></td>
<td valign="top" align="center"><bold>OURS</bold></td>
<td valign="top" align="center"><bold>COLMAP</bold></td>
<td valign="top" align="center"><bold>OURS</bold></td>
</tr>
<tr>
<td valign="top" align="left"><monospace>office_1</monospace> </td>
<td valign="top" align="center">28.198</td>
<td valign="top" align="center"><bold>28.225</bold></td>
<td valign="top" align="center">0.630</td>
<td valign="top" align="center"><bold>0.665</bold></td>
<td valign="top" align="center">0.471</td>
<td valign="top" align="center"><bold>0.357</bold></td>
</tr>
<tr>
<td valign="top" align="left"><monospace>office_2</monospace> </td>
<td valign="top" align="center">28.354</td>
<td valign="top" align="center"><bold>28.363</bold></td>
<td valign="top" align="center">0.694</td>
<td valign="top" align="center"><bold>0.698</bold></td>
<td valign="top" align="center">0.625</td>
<td valign="top" align="center"><bold>0.392</bold></td>
</tr>
<tr>
<td valign="top" align="left"><monospace>kitchen</monospace> </td>
<td valign="top" align="center">28.338</td>
<td valign="top" align="center"><bold>28.942</bold></td>
<td valign="top" align="center">0.600</td>
<td valign="top" align="center"><bold>0.735</bold></td>
<td valign="top" align="center">0.629</td>
<td valign="top" align="center"><bold>0.434</bold></td>
</tr>
<tr>
<td valign="top" align="left"><monospace>room</monospace> </td>
<td valign="top" align="center">28.308</td>
<td valign="top" align="center"><bold>30.060</bold></td>
<td valign="top" align="center">0.480</td>
<td valign="top" align="center"><bold>0.738</bold></td>
<td valign="top" align="center">0.712</td>
<td valign="top" align="center"><bold>0.322</bold></td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>Bold values represents the best performances for a given type of comparison.</p>
</table-wrap-foot>
</table-wrap>
<p>For calculating these metrices, we use the 2D projection of the 3D mesh region, which corresponds to the front face of its 360&#x000B0; image. We take the region selected from 3D mesh generated by RTAB-Map as the reference, and compare with COLMAP and our approach on the same region from the texture-projected mesh. An example of images used for reference and comparison on <monospace>office_2</monospace> scene is shown in <xref ref-type="fig" rid="F12">Figure 12</xref>.</p>
<fig id="F12" position="float">
<label>Figure 12</label>
<caption><p>Reference image obtained from the 3D mesh generated with RTAB-Map used for comparing similarity metrics on COLMAP and our approach.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-06-1388174-g0012.tif"/>
</fig>
</sec>
<sec>
<title>8.2.2 Comparison with Matterport3D dataset</title>
<p>We also evaluate our algorithm on the Matterport3D dataset (Chang et al., <xref ref-type="bibr" rid="B9">2017</xref>), which contains 90 RGBD real-world scenes featuring 10,800 panoramic views. Due to the differences in the texture alignment and projection methods used by Matterport3D and our approach, in this section, we present a performance evaluation of our approach on the Matterport3D dataset instead of a direct comparison. Leveraging posed images, depth maps, poses, and intrinsics from the dataset, we construct a TSDF volume for the 3D mesh of a scene to establish a baseline for evaluation.</p>
<p>The number of posed images per scene in the Matterport3D dataset is notably less (18) than what our algorithm typically requires (&#x0007E;60). Due to the inherent nature of SGD, the limited number of posed images results in a reduced set of features for optimization, posing a risk of overfitting the objective function. This can cause the algorithm to favor regions with dense feature points, allowing the pose of the 360&#x000B0; image to drift away from the ideal uniform alignment toward the center of projection of the ground truth mesh. To address this problem, we implement early stopping during the optimization process to ensure minimal drift.</p>
<p><xref ref-type="fig" rid="F13">Figure 13</xref> illustrates the performance of our algorithm on four different scenes from the Matterport3D dataset selected based on the presence of walls, which facilitates better comparison of texture projection including potential misalignments. While our algorithm shows overall effectiveness on a large scale, a closer inspection reveals subtle drifts from the ground truth textures, especially around the center of projection. These misalignments are quantified using PSNR, SSIM, and LPIPS metrics in <xref ref-type="table" rid="T3">Table 3</xref>. As the ground truth poses are available for the Matterport3D dataset, we also compute the pose drift of our approach using MSE.</p>
<fig id="F13" position="float">
<label>Figure 13</label>
<caption><p>Performance of our algorithm with a few scenes (identified by their truncated <monospace>panorama_uuid</monospace>) from Matterport3D dataset. The highlighted regions show areas of minor misalignments from our approach as compared to the textures from Matterport3D.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-06-1388174-g0013.tif"/>
</fig>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Performance metrics for our algorithm on the Matterport3D dataset.</p></caption>
<table frame="box" rules="all">
<thead>
<tr style="background-color:#919498;color:#ffffff">
<th valign="top" align="left"><bold>panorama_uuid</bold></th>
<th valign="top" align="center"><bold>PSNR</bold></th>
<th valign="top" align="center"><bold>SSIM</bold></th>
<th valign="top" align="center"><bold>LPIPS</bold></th>
<th valign="top" align="center"><bold>MSE</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><monospace>2a47f0</monospace> </td>
<td valign="top" align="center">32.399</td>
<td valign="top" align="center">0.749</td>
<td valign="top" align="center">0.171</td>
<td valign="top" align="center">0.1887</td>
</tr>
<tr>
<td valign="top" align="left"><monospace>1a4133</monospace> </td>
<td valign="top" align="center">29.5301</td>
<td valign="top" align="center">0.695</td>
<td valign="top" align="center">0.191</td>
<td valign="top" align="center">0.2713</td>
</tr>
<tr>
<td valign="top" align="left"><monospace>3a6d23</monospace> </td>
<td valign="top" align="center">30.941</td>
<td valign="top" align="center">0.680</td>
<td valign="top" align="center">0.139</td>
<td valign="top" align="center">0.0357</td>
</tr>
<tr>
<td valign="top" align="left"><monospace>9ea013</monospace> </td>
<td valign="top" align="center">30.192</td>
<td valign="top" align="center">0.620</td>
<td valign="top" align="center">0.184</td>
<td valign="top" align="center">0.0283</td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>MSE is computed using observed and ground truth poses, while the rest of the metrics are computed as described in Section 8.2.1. The <monospace>panorama_uuid</monospace>s represent the scene ids of the Matterport3D scenes shown in Figure 13.</p>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
</sec>
<sec sec-type="discussion" id="s9">
<title>9 Discussion</title>
<p>In this article, we introduced a novel methodology for aligning a 360&#x000B0; image with its corresponding 3D mesh using a custom objective function for least squares optimization. The results show that our approach significantly reduces the spatial drift between the 3D mesh and the 360&#x000B0; image. The precision of this alignment is further visualized from the alignment of features extracted from the posed and 360&#x000B0; images. For the critical task of texture projection, we introduced a customized raytracing algorithm. Our approach allows for accurate texture projection using the optimal alignment transformation while addressing the inherent texturing issue (rainbow-like artifacts) at the poles of 360&#x000B0; images, thus enhancing the quality of the textured mesh.</p>
<p>We also evaluated our approach with baseline approaches such as RTAB-Map, COLMAP, and Matterport3D. Compared with the RTAB-Map, our method demonstrated better texture quality measured using image similarity metrics. Unlike SfM approaches like COLMAP, our approach takes into account the features from all the cubemap faces, thus preventing the local convergence of SGD loss. As a result, the transformations obtained from our approach provided a better texture alignment than the COLMAP. With the Matterport3D dataset, our method showed commendable performance, albeit with minor misalignment issues. This misalignment is primarily due to the limited number of images available for each scene in the Matterport3D dataset. We also quantified the performance of our approach with standard image similarity metrics such as PSNR, SSIM, LPIPS, and MSE. The implications of our approach hold promise for advancing domains such as 3D reconstruction and virtual tours to enhance realism and accuracy.</p>
<p>The current pipeline is effective but not without limitations. One significant challenge is the tendency for alignment to skew toward areas with a higher concentration of SIFT features, a limitation not unique to our method but inherent in classical feature alignment techniques. Moreover, while deep learning approaches like Superpoint (DeTone et al., <xref ref-type="bibr" rid="B11">2018</xref>) offer improved performance, they are constrained in textureless regions. The computational intensity of our least squares optimization, particularly due to its numerous non-linear components, is a potential area for further refinement. Additionally, due to the nature of 360&#x000B0; image, texturing is best viewed from the center of projection and gradually warps with increasing distance from the center.</p>
</sec>
<sec id="s10">
<title>10 Conclusion and future works</title>
<p>Our proposed methodology for 360&#x000B0; image alignment with corresponding 3D mesh and a customized texture projection algorithm holds promise for advancing domains such as 3D reconstruction and virtual tours. Despite its limitations regarding alignment tendencies for a limited number of posed images and computational intensity, our approach is comparable to the baseline methods and outperforms the widely used SfM and reconstruction approaches in the industry such as COLMAP and RTAB-Map.</p>
<p>Our future research is to optimize our pipeline further and extend its applications. Initially conceived for indoor environment modeling in virtual tours, we aim to conduct rigorous testing of our methodology in such applications in the near future. The current hardware limitations, especially the Azure Kinect&#x00027;s restricted distance coverage, often cause odometry loss during outdoor scene captures. Addressing this limitation, future research will explore alternative hardware with wider distance coverage capabilities, thus enhancing our pipeline&#x00027;s suitability for outdoor environments as well.</p>
</sec>
</body>
<back>
<sec sec-type="data-availability" id="s11">
<title>Data availability statement</title>
<p>The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.</p>
</sec>
<sec sec-type="author-contributions" id="s12">
<title>Author contributions</title>
<p>BK: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing &#x02013; original draft, Writing &#x02013; review &#x00026; editing. MO: Conceptualization, Formal analysis, Investigation, Methodology, Software, Supervision, Writing &#x02013; review &#x00026; editing. SR: Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing &#x02013; original draft, Writing &#x02013; review &#x00026; editing. VO: Project administration, Resources, Supervision, Writing &#x02013; review &#x00026; editing.</p>
</sec>
<sec sec-type="funding-information" id="s13">
<title>Funding</title>
<p>The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.</p>
</sec>
<ack><p>The authors would like to extend their sincere gratitude to E.K. Solutions Pvt. Ltd., Nepal for not only providing invaluable support and resources but also for granting the opportunity to conduct this research. Additionally, the authors would like to express their appreciation to the entire AI team at E.K. Solutions, whose support and constructive feedback contributed significantly to the refinement of this research.</p>
</ack>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>BK, MO, SR, and VO were employed by E.K. Solutions Pvt. Ltd.</p>
</sec>
<sec sec-type="disclaimer" id="s14">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<fn-group>
<fn id="fn0001"><p><sup>1</sup>We do not translate along the <italic>v</italic>-axis as we have a single 360&#x000B0; image, however, in case of vertically stacked 360&#x000B0; images (such as atlas) translation along <italic>v</italic>-axis is required.</p></fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Aghayari</surname> <given-names>S.</given-names></name> <name><surname>Saadatseresht</surname> <given-names>M.</given-names></name> <name><surname>Omidalizarandi</surname> <given-names>M.</given-names></name> <name><surname>Neumann</surname> <given-names>I.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Geometric calibration of full spherical panoramic Ricoh-Theta camera,&#x0201D;</article-title> in <source>ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences; IV-1/W1. 4</source> (<publisher-loc>G&#x000F6;ttingen</publisher-loc>: <publisher-name>Copernicus GmbH</publisher-name>), <fpage>237</fpage>&#x02013;<lpage>245</lpage>. <pub-id pub-id-type="doi">10.5194/isprs-annals-IV-1-W1-237-2017</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Alj</surname> <given-names>Y.</given-names></name> <name><surname>Boisson</surname> <given-names>G.</given-names></name> <name><surname>Bordes</surname> <given-names>P.</given-names></name> <name><surname>Pressigout</surname> <given-names>M.</given-names></name> <name><surname>Morin</surname> <given-names>L.</given-names></name></person-group> (<year>2012</year>). <article-title>&#x0201C;Multi-texturing 3D models: how to choose the best texture?,&#x0201D;</article-title> in <source>2012 International Conference on 3D Imaging (IC3D)</source> (<publisher-loc>Liege</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>8</lpage>. <pub-id pub-id-type="doi">10.1109/IC3D.2012.6615115</pub-id></citation>
</ref>
<ref id="B3">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Alj</surname> <given-names>Y.</given-names></name> <name><surname>Boisson</surname> <given-names>G.</given-names></name> <name><surname>Bordes</surname> <given-names>P.</given-names></name> <name><surname>Pressigout</surname> <given-names>M.</given-names></name> <name><surname>Morin</surname> <given-names>L.</given-names></name></person-group> (<year>2012</year>). <article-title>&#x0201C;Space carving MVD sequences for modeling natural 3d scenes,&#x0201D;</article-title> in <source>Proceedings of the 23rd annual conference on computer graphics and interactive techniques</source> (<publisher-loc>SPIE, ACM</publisher-loc>), <fpage>42</fpage>&#x02013;<lpage>56</lpage>. <pub-id pub-id-type="doi">10.1117/12.908608</pub-id></citation>
</ref>
<ref id="B4">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Allene</surname> <given-names>C.</given-names></name> <name><surname>Pons</surname> <given-names>J.-P.</given-names></name> <name><surname>Keriven</surname> <given-names>R.</given-names></name></person-group> (<year>2008</year>). <article-title>&#x0201C;Seamless image-based texture atlases using multi-band blending,&#x0201D;</article-title> in <source>2008 19th international conference on pattern recognition</source> (<publisher-loc>Tampa, FL</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>4</lpage>. <pub-id pub-id-type="doi">10.1109/ICPR.2008.4761913</pub-id></citation>
</ref>
<ref id="B5">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Alliez</surname> <given-names>P.</given-names></name> <name><surname>Fabri</surname> <given-names>A.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;CGAL: the computational geometry algorithms library,&#x0201D;</article-title> in <source>ACM SIGGRAPH 2016 Courses</source> (<publisher-loc>ACM</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>8</lpage>. <pub-id pub-id-type="doi">10.1145/2897826.2927362</pub-id></citation>
</ref>
<ref id="B6">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Arth</surname> <given-names>C.</given-names></name> <name><surname>Wagner</surname> <given-names>D.</given-names></name> <name><surname>Klopschitz</surname> <given-names>M.</given-names></name> <name><surname>Irschara</surname> <given-names>A.</given-names></name> <name><surname>Schmalstieg</surname> <given-names>D.</given-names></name></person-group> (<year>2009</year>). <article-title>&#x0201C;Wide area localization on mobile phones,&#x0201D;</article-title> in <source>2009 8th IEEE international symposium on mixed and augmented reality</source> (<publisher-loc>Orlando, FL</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>73</fpage>&#x02013;<lpage>82</lpage>. <pub-id pub-id-type="doi">10.1109/ISMAR.2009.5336494</pub-id></citation>
</ref>
<ref id="B7">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Buehler</surname> <given-names>C.</given-names></name> <name><surname>Bosse</surname> <given-names>M.</given-names></name> <name><surname>McMillan</surname> <given-names>L.</given-names></name> <name><surname>Gortler</surname> <given-names>S.</given-names></name> <name><surname>Cohen</surname> <given-names>M.</given-names></name></person-group> (<year>2001</year>). <article-title>&#x0201C;Unstructured lumigraph rendering,&#x0201D;</article-title> in <source>Proceedings of the 28th annual conference on Computer graphics and interactive techniques</source> (<publisher-loc>New Yor, NY</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>425</fpage>&#x02013;<lpage>432</lpage>. <pub-id pub-id-type="doi">10.1145/383259.383309</pub-id></citation>
</ref>
<ref id="B8">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Cernea</surname> <given-names>D.</given-names></name></person-group> (<year>2020</year>). <source>OpenMVS: Multi-View Stereo Reconstruction Library</source>. <ext-link ext-link-type="uri" xlink:href="https://cdcseacave.github.io/openMVS">https://cdcseacave.github.io/openMVS</ext-link> (accessed November 21, 2023).</citation>
</ref>
<ref id="B9">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chang</surname> <given-names>A.</given-names></name> <name><surname>Dai</surname> <given-names>A.</given-names></name> <name><surname>Funkhouser</surname> <given-names>T.</given-names></name> <name><surname>Halber</surname> <given-names>M.</given-names></name> <name><surname>Niebner</surname> <given-names>M.</given-names></name> <name><surname>Savva</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>&#x0201C;Matterport3D: learning from RGB-D data in indoor environments,&#x0201D;</article-title> in <source>2017 International Conference on 3D Vision (3DV)</source> (<publisher-loc>Qingdao</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>667</fpage>&#x02013;<lpage>676</lpage>. <pub-id pub-id-type="doi">10.1109/3DV.2017.00081</pub-id></citation>
</ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Delamarre</surname> <given-names>Q.</given-names></name> <name><surname>Faugeras</surname> <given-names>O.</given-names></name></person-group> (<year>1999</year>). <article-title>3D articulated models and multi-view tracking with silhouettes</article-title>. <source>Proc. Seventh IEEE Int. Conf. Comput. Vis</source>. <volume>2</volume>, <fpage>716</fpage>&#x02013;<lpage>721</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.1999.790292</pub-id></citation>
</ref>
<ref id="B11">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>DeTone</surname> <given-names>D.</given-names></name> <name><surname>Malisiewicz</surname> <given-names>T.</given-names></name> <name><surname>Rabinovich</surname> <given-names>A.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Superpoint: self-supervised interest point detection and description,&#x0201D;</article-title> in <source>Proceedings of the IEEE conference on computer vision and pattern recognition workshops</source> (<publisher-loc>Salt Lake City, UT</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>224</fpage>&#x02013;<lpage>236</lpage>. <pub-id pub-id-type="doi">10.1109/CVPRW.2018.00060</pub-id></citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Engel</surname> <given-names>J.</given-names></name> <name><surname>Schoeps</surname> <given-names>T.</given-names></name> <name><surname>Cremers</surname> <given-names>D.</given-names></name></person-group> (<year>2014</year>). <article-title>LSD-SLAM: large-scale direct monocular SLAM</article-title>. <source>Eur. Conf. Comput</source>. Vis. <volume>8690</volume>, <fpage>1</fpage>&#x02013;<lpage>16</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-10605-2_54</pub-id></citation>
</ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Fix</surname> <given-names>E.</given-names></name> <name><surname>Hodges</surname> <given-names>J. L.</given-names></name></person-group> (<year>1951</year>). <source>Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties</source>. <publisher-loc>San Antonio, TX</publisher-loc>: <publisher-name>USAF School of Aviation Medicine, Randolph Field, Texas</publisher-name>.</citation>
</ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Fu</surname> <given-names>Y.</given-names></name> <name><surname>Yan</surname> <given-names>Q.</given-names></name> <name><surname>Yang</surname> <given-names>L.</given-names></name> <name><surname>Liao</surname> <given-names>J.</given-names></name> <name><surname>Xiao</surname> <given-names>C.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Texture mapping for 3D reconstruction with rgb-d sensor,&#x0201D;</article-title> in <source>Proceedings of the IEEE conference on computer vision and pattern recognition</source> (<publisher-loc>Salt Lake City, UT</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>4645</fpage>&#x02013;<lpage>4653</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2018.00488</pub-id></citation>
</ref>
<ref id="B15">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Geiger</surname> <given-names>A.</given-names></name> <name><surname>Moosmann</surname> <given-names>F.</given-names></name> <name><surname>Car</surname> <given-names>O.</given-names></name> <name><surname>Schuster</surname> <given-names>B.</given-names></name></person-group> (<year>2012</year>). <article-title>&#x0201C;Automatic camera and range sensor calibration using a single shot,&#x0201D;</article-title> in <source>2012 IEEE International Conference on Robotics and Automation</source> (<publisher-loc>Saint Paul, MN</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>3936</fpage>&#x02013;<lpage>3943</lpage>. <pub-id pub-id-type="doi">10.1109/ICRA.2012.6224570</pub-id></citation>
</ref>
<ref id="B16">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Gortler</surname> <given-names>S. J.</given-names></name> <name><surname>Grzeszczuk</surname> <given-names>R.</given-names></name> <name><surname>Szeliski</surname> <given-names>R.</given-names></name> <name><surname>Cohen</surname> <given-names>M. F.</given-names></name></person-group> (<year>1996</year>). <source>The Lumigraph</source>. <publisher-loc>New York, NY</publisher-loc>: <publisher-name>Association for Computing Machinery</publisher-name>.</citation>
</ref>
<ref id="B17">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Horn</surname> <given-names>R. A.</given-names></name> <name><surname>Johnson</surname> <given-names>C. R.</given-names></name></person-group> (<year>1990</year>). <source>Matrix Analysis</source>. Cambridge: Cambridge University Press. Available at: <ext-link ext-link-type="uri" xlink:href="https://books.google.com.np/books?id=PlYQN0ypTwEC">https://books.google.com.np/books?id=PlYQN0ypTwEC</ext-link> (accessed June 14, 2023).</citation>
</ref>
<ref id="B18">
<citation citation-type="web"><person-group person-group-type="author"><collab>Imatest</collab></person-group> (<year>2023</year>). <source>Geometric Calibration: Projective Camera Model</source>. Available at: <ext-link ext-link-type="uri" xlink:href="https://www.imatest.com/support/docs/pre-5-2/geometric-calibration-deprecated/projective-camera/">https://www.imatest.com/support/docs/pre-5-2/geometric-calibration-deprecated/projective-camera/</ext-link> (accessed July 22, 2023).</citation>
</ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kerbl</surname> <given-names>B.</given-names></name> <name><surname>Kopanas</surname> <given-names>G.</given-names></name> <name><surname>Leimk&#x000FC;hler</surname> <given-names>T.</given-names></name> <name><surname>Drettakis</surname> <given-names>G.</given-names></name></person-group> (<year>2023</year>). 3D Gaussian splatting for real-time radiance field rendering. <italic>arXiv</italic> [Preprint]. arXiv:2308.04079. <pub-id pub-id-type="doi">10.48550/arXiv.2308.04079</pub-id></citation>
</ref>
<ref id="B20">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Labbe</surname> <given-names>M.</given-names></name> <name><surname>Michaud</surname> <given-names>F.</given-names></name></person-group> (<year>2014</year>). <article-title>&#x0201C;Online global loop closure detection for large-scale multi-session graph-based SLAM,&#x0201D;</article-title> in <source>2014 IEEE/RSJ International Conference on Intelligent Robots and Systems</source> (<publisher-loc>Chicago, IL</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2661</fpage>&#x02013;<lpage>2666</lpage>. <pub-id pub-id-type="doi">10.1109/IROS.2014.6942926</pub-id></citation>
</ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Labb&#x000E9;</surname> <given-names>M.</given-names></name> <name><surname>Michaud</surname> <given-names>F.</given-names></name></person-group> (<year>2018</year>). <article-title>RTAB-Map as an open-source lidar and visual simultaneous localization and mapping library for large-scale and long-term online operation: LABB&#x000C9; and MICHAUD</article-title>. <source>J. Field Robot</source>. <volume>36</volume>, <fpage>416</fpage>&#x02013;<lpage>446</lpage>. <pub-id pub-id-type="doi">10.1002/rob.21831</pub-id></citation>
</ref>
<ref id="B22">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lempitsky</surname> <given-names>V.</given-names></name> <name><surname>Ivanov</surname> <given-names>D.</given-names></name></person-group> (<year>2007</year>). <article-title>&#x0201C;Seamless mosaicing of image-based texture maps,&#x0201D;</article-title> in <source>2007 IEEE conference on computer vision and pattern recognition</source> (<publisher-loc>Minneapolis, MN</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>6</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2007.383078</pub-id></citation>
</ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>X.</given-names></name> <name><surname>Dyke</surname> <given-names>S. J.</given-names></name> <name><surname>Yeum</surname> <given-names>C. M.</given-names></name> <name><surname>Bilionis</surname> <given-names>I.</given-names></name> <name><surname>Lenjani</surname> <given-names>A.</given-names></name> <name><surname>Choi</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>Automated indoor image localization to support a post-event building assessment</article-title>. <source>Sensors</source> <volume>20</volume>:<fpage>1610</fpage>. <pub-id pub-id-type="doi">10.3390/s20061610</pub-id><pub-id pub-id-type="pmid">32183201</pub-id></citation></ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lowe</surname> <given-names>D.</given-names></name></person-group> (<year>2004</year>). <article-title>Distinctive image features from scale-invariant keypoints</article-title>. <source>Int. J. Comput. Vis</source>. <volume>60</volume>:<fpage>91</fpage>. <pub-id pub-id-type="doi">10.1023/B:VISI.0000029664.99615.94</pub-id></citation>
</ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Markelj</surname> <given-names>P.</given-names></name> <name><surname>Toma&#x0017E;evi&#x00161;</surname> <given-names>D.</given-names></name> <name><surname>Likar</surname> <given-names>B.</given-names></name> <name><surname>Pernu&#x00161;</surname> <given-names>F.</given-names></name></person-group> (<year>2010</year>). <article-title>A review of 3D/2D registration methods for image guided interventions</article-title>. <source>Med Image Anal</source>. <volume>16</volume>, <fpage>642</fpage>&#x02013;<lpage>61</lpage>. <pub-id pub-id-type="doi">10.1016/j.media.2010.03.005</pub-id><pub-id pub-id-type="pmid">20452269</pub-id></citation></ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Marquardt</surname> <given-names>D. W.</given-names></name></person-group> (<year>1963</year>). <article-title>An algorithm for least-squares estimation of nonlinear parameters</article-title>. <source>J. Soc. Ind. Appl. Math</source>. <volume>11</volume>, <fpage>431</fpage>&#x02013;<lpage>441</lpage>. <pub-id pub-id-type="doi">10.1137/0111030</pub-id></citation>
</ref>
<ref id="B27">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mastin</surname> <given-names>A.</given-names></name> <name><surname>Kepner</surname> <given-names>J.</given-names></name> <name><surname>Fisher</surname> <given-names>J.</given-names></name></person-group> (<year>2009</year>). <article-title>&#x0201C;Automatic registration of LIDAR and optical images of urban scenes,&#x0201D;</article-title> in <source>2009 IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Miami, FL</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2639</fpage>&#x02013;<lpage>2646</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2009.5206539</pub-id><pub-id pub-id-type="pmid">30832435</pub-id></citation></ref>
<ref id="B28">
<citation citation-type="web"><person-group person-group-type="author"><collab>Metareal Inc</collab></person-group> (<year>2023</year>). <source>Metareal</source>. Available at: <ext-link ext-link-type="uri" xlink:href="https://www.metareal.com/">https://www.metareal.com/</ext-link> (accessed November 15, 2023).</citation>
</ref>
<ref id="B29">
<citation citation-type="web"><person-group person-group-type="author"><collab>Microsoft</collab></person-group> (<year>2023</year>). <source>Azure Kinect DK depth camera</source>. Available at: <ext-link ext-link-type="uri" xlink:href="https://learn.microsoft.com/en-us/azure/kinect-dk/depth-camera">https://learn.microsoft.com/en-us/azure/kinect-dk/depth-camera</ext-link> (accessed November 15, 2023).</citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mildenhall</surname> <given-names>B.</given-names></name> <name><surname>Srinivasan</surname> <given-names>P. P.</given-names></name> <name><surname>Tancik</surname> <given-names>M.</given-names></name> <name><surname>Barron</surname> <given-names>J. T.</given-names></name> <name><surname>Ramamoorthi</surname> <given-names>R.</given-names></name> <name><surname>Ng</surname> <given-names>R.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>NeRF: representing scenes as neural radiance fields for view synthesis</article-title>. <source>arXiv</source> [Preprint]. arXiv:2003.08934. <pub-id pub-id-type="doi">10.48550/arXiv.2003.08934</pub-id></citation>
</ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mishra</surname> <given-names>R.</given-names></name></person-group> (<year>2012</year>). <article-title>A review of optical imagery and airborne LiDAR data registration methods</article-title>. <source>Open Remote Sens. J</source>. <volume>5</volume>, <fpage>54</fpage>&#x02013;<lpage>63</lpage>. <pub-id pub-id-type="doi">10.2174/1875413901205010054</pub-id></citation>
</ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mur-Artal</surname> <given-names>R.</given-names></name> <name><surname>Montiel</surname> <given-names>J. M. M.</given-names></name> <name><surname>Tardos</surname> <given-names>J. D.</given-names></name></person-group> (<year>2015</year>). <article-title>ORB-SLAM: a versatile and accurate monocular SLAM system</article-title>. <source>IEEE Trans. Robot</source>. <volume>31</volume>, <fpage>1147</fpage>&#x02013;<lpage>1163</lpage>. <pub-id pub-id-type="doi">10.1109/TRO.2015.2463671</pub-id></citation>
</ref>
<ref id="B33">
<citation citation-type="web"><person-group person-group-type="author"><collab>Nagwa</collab></person-group> (<year>2023</year>). <source>The Perpendicular Distance between Points and Straight Lines in Space</source>. Available at: <ext-link ext-link-type="uri" xlink:href="https://www.nagwa.com/en/explainers/939127418581/">https://www.nagwa.com/en/explainers/939127418581/</ext-link> (accessed August 08, 2023).<pub-id pub-id-type="pmid">25239097</pub-id></citation></ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nebel</surname> <given-names>S.</given-names></name> <name><surname>Beege</surname> <given-names>M.</given-names></name> <name><surname>Schneider</surname> <given-names>S.</given-names></name> <name><surname>Rey</surname> <given-names>G. D.</given-names></name></person-group> (<year>2020</year>). <article-title>A review of photogrammetry and photorealistic 3D models in education from a psychological perspective</article-title>. <source>Front. Educ</source>. <volume>5</volume>:<fpage>144</fpage>. <pub-id pub-id-type="doi">10.3389/feduc.2020.00144</pub-id></citation>
</ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nie&#x000DF;ner</surname> <given-names>M.</given-names></name> <name><surname>Zollh&#x000F6;fer</surname> <given-names>M</given-names></name> <name><surname>Izadi</surname> <given-names>S</given-names></name> <name><surname>Stamminger</surname> <given-names>M.</given-names></name></person-group> (<year>2013</year>). <article-title>Real-time 3D reconstruction at scale using voxel hashing</article-title>. <source>ACM Trans. Graph</source>. <volume>32</volume>, <fpage>1</fpage>&#x02013;<lpage>11</lpage>. <pub-id pub-id-type="doi">10.1145/2508363.2508374</pub-id></citation>
</ref>
<ref id="B36">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Panek</surname> <given-names>V.</given-names></name> <name><surname>Kukelova</surname> <given-names>Z.</given-names></name> <name><surname>Sattler</surname> <given-names>T.</given-names></name></person-group> (<year>2023</year>). <article-title>&#x0201C;Visual localization using imperfect 3D Models from the internet,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Vancouver, BC</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>13175</fpage>&#x02013;<lpage>13186</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR52729.2023.01266</pub-id></citation>
</ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Park</surname> <given-names>J.</given-names></name> <name><surname>Jeon</surname> <given-names>I.-B.</given-names></name> <name><surname>Yoon</surname> <given-names>S.-E.</given-names></name> <name><surname>Woo</surname> <given-names>W.</given-names></name></person-group> (<year>2021</year>). <article-title>Instant panoramic texture mapping with semantic object matching for large-scale urban scene reproduction</article-title>. <source>IEEE Trans. Vis. Comput. Graph</source>. <volume>27</volume>, <fpage>2746</fpage>&#x02013;<lpage>2756</lpage>. <pub-id pub-id-type="doi">10.1109/TVCG.2021.3067768</pub-id><pub-id pub-id-type="pmid">33760735</pub-id></citation></ref>
<ref id="B38">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Rijal</surname> <given-names>S.</given-names></name> <name><surname>Pokhrel</surname> <given-names>S.</given-names></name> <name><surname>Om</surname> <given-names>M.</given-names></name> <name><surname>Ojha</surname> <given-names>V. P.</given-names></name></person-group> (<year>2023</year>). <source>Comparing Depth Estimation of Azure Kinect and Realsense D435i Cameras</source>. Available at: <ext-link ext-link-type="uri" xlink:href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4597442">https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4597442</ext-link> (accessed June 15, 2023).</citation>
</ref>
<ref id="B39">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Rumelhart</surname> <given-names>D. E.</given-names></name> <name><surname>Hinton</surname> <given-names>G. E.</given-names></name> <name><surname>Williams</surname> <given-names>R. J.</given-names></name></person-group> (<year>1986</year>). <source>Learning internal representations by error propagation</source>. <publisher-loc>San Diego, CA</publisher-loc>: <publisher-name>Institute for Cognitive Science, University of California, San Diego</publisher-name>.</citation>
</ref>
<ref id="B40">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Russell</surname> <given-names>B. C.</given-names></name> <name><surname>Sivic</surname> <given-names>J.</given-names></name> <name><surname>Ponce</surname> <given-names>J.</given-names></name> <name><surname>Dessales</surname> <given-names>H.</given-names></name></person-group> (<year>2011</year>). <article-title>&#x0201C;Automatic alignment of paintings and photographs depicting a 3D scene,&#x0201D;</article-title> in <source>2011 IEEE international conference on computer vision workshops (ICCV workshops)</source> (<publisher-loc>Barcelona</publisher-loc>: <publisher-name>Spain</publisher-name>), <fpage>545</fpage>&#x02013;<lpage>552</lpage>. <pub-id pub-id-type="doi">10.1109/ICCVW.2011.6130291</pub-id></citation>
</ref>
<ref id="B41">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sansoni</surname> <given-names>G.</given-names></name> <name><surname>Trebeschi</surname> <given-names>M.</given-names></name> <name><surname>Docchio</surname> <given-names>F.</given-names></name></person-group> (<year>2009</year>). <article-title>State-of-the-art and applications of 3D imaging sensors in industry, cultural heritage, medicine, and criminal investigation</article-title>. <source>Sensors</source> <volume>9</volume>, <fpage>568</fpage>&#x02013;<lpage>601</lpage>. <pub-id pub-id-type="doi">10.3390/s90100568</pub-id><pub-id pub-id-type="pmid">22389618</pub-id></citation></ref>
<ref id="B42">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sattler</surname> <given-names>T.</given-names></name> <name><surname>Leibe</surname> <given-names>B.</given-names></name> <name><surname>Kobbelt</surname> <given-names>L.</given-names></name></person-group> (<year>2011</year>). <article-title>&#x0201C;Fast image-based localization using direct 2d-to-3d matching,&#x0201D;</article-title> in <source>2011 International Conference on Computer Vision</source> (<publisher-loc>Barcelona</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>667</fpage>&#x02013;<lpage>674</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2011.6126302</pub-id></citation>
</ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sattler</surname> <given-names>T.</given-names></name> <name><surname>Weyand</surname> <given-names>T.</given-names></name> <name><surname>Leibe</surname> <given-names>B.</given-names></name> <name><surname>Kobbelt</surname> <given-names>L.</given-names></name></person-group> (<year>2012</year>). <article-title>Image retrieval for image-based localization revisited</article-title>. <source>BMVC</source> <volume>1</volume>:<fpage>4</fpage>. <pub-id pub-id-type="doi">10.5244/C.26.76</pub-id></citation>
</ref>
<ref id="B44">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sch&#x000F6;berger</surname> <given-names>J. L.</given-names></name> <name><surname>Frahm</surname> <given-names>J.-M.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Structure-from-motion revisited,&#x0201D;</article-title> in <source>2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>, <fpage>4104</fpage>&#x02013;<lpage>4113</lpage>.</citation>
</ref>
<ref id="B45">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sch&#x000F6;nberger</surname> <given-names>J. L.</given-names></name> <name><surname>Zheng</surname> <given-names>E.</given-names></name> <name><surname>Frahm</surname> <given-names>J.-M.</given-names></name> <name><surname>Pollefeys</surname> <given-names>M.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Pixelwise view selection for unstructured multi-view stereo,&#x0201D;</article-title> in <source>Computer Vision &#x02013; ECCV 2016</source>, eds. B. Leibe, J. Matas, N. Sebe, and M. Welling (Cham: Springer International Publishing), <fpage>501</fpage>&#x02013;<lpage>518</lpage>.</citation>
</ref>
<ref id="B46">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Stamos</surname> <given-names>I.</given-names></name></person-group> (<year>2010</year>). <article-title>&#x0201C;Automated registration of 3D-range with 2D-color images: an overview,&#x0201D;</article-title> in <source>2010 44th Annual Conference on Information Sciences and Systems, CISS 2010</source> (<publisher-loc>Princeton, NJ</publisher-loc>: <publisher-name>IEEE</publisher-name>). <pub-id pub-id-type="doi">10.1109/CISS.2010.5464815</pub-id></citation>
</ref>
<ref id="B47">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sufiyan</surname> <given-names>D.</given-names></name> <name><surname>Pheh</surname> <given-names>Y. H.</given-names></name> <name><surname>Win</surname> <given-names>L. S. T.</given-names></name> <name><surname>Win</surname> <given-names>S. K. H.</given-names></name> <name><surname>Tan</surname> <given-names>U.-X.</given-names></name></person-group> (<year>2023</year>). <article-title>&#x0201C;Panoramic image-based aerial localization using synthetic data via photogrammetric reconstruction,&#x0201D;</article-title> in <source>ASME Transactions on Mechatronics</source> (<publisher-loc>Seattle, WA</publisher-loc>: <publisher-name>IEEE</publisher-name>). <pub-id pub-id-type="doi">10.1109/AIM46323.2023.10196148</pub-id></citation>
</ref>
<ref id="B48">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Waechter</surname> <given-names>M.</given-names></name> <name><surname>Moehrle</surname> <given-names>N.</given-names></name> <name><surname>Goesele</surname> <given-names>M.</given-names></name></person-group> (<year>2014</year>). <article-title>&#x0201C;Let there be color! Large-scale texturing of 3D reconstructions,&#x0201D;</article-title> in <source>Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 836&#x02013;850</source> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>). <pub-id pub-id-type="doi">10.1007/978-3-319-10602-1_54</pub-id></citation>
</ref>
<ref id="B49">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>X.</given-names></name> <name><surname>Kealy</surname> <given-names>A.</given-names></name> <name><surname>Li</surname> <given-names>W.</given-names></name> <name><surname>Jelfs</surname> <given-names>B.</given-names></name> <name><surname>Gilliam</surname> <given-names>C.</given-names></name> <name><surname>Le May</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Toward autonomous UAV localization via aerial image registration</article-title>. <source>Electronics</source> <volume>10</volume>:<fpage>435</fpage>. <pub-id pub-id-type="doi">10.3390/electronics10040435</pub-id></citation>
</ref>
<ref id="B50">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Whelan</surname> <given-names>T.</given-names></name> <name><surname>Kaess</surname> <given-names>M.</given-names></name> <name><surname>Johannsson</surname> <given-names>H.</given-names></name> <name><surname>Fallon</surname> <given-names>M.</given-names></name> <name><surname>Leonard</surname> <given-names>J. J.</given-names></name> <name><surname>McDonald</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>Real-time large-scale dense RGB-D SLAM with volumetric fusion</article-title>. <source>Int. J. Robot. Res</source>. <volume>34</volume>, <fpage>598</fpage>&#x02013;<lpage>626</lpage>. <pub-id pub-id-type="doi">10.1177/0278364914551008</pub-id></citation>
</ref>
<ref id="B51">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>K.</given-names></name> <name><surname>Wang</surname> <given-names>K.</given-names></name> <name><surname>Zhao</surname> <given-names>X.</given-names></name> <name><surname>Cheng</surname> <given-names>R.</given-names></name> <name><surname>Bai</surname> <given-names>J.</given-names></name> <name><surname>Yang</surname> <given-names>Y.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>IR stereo RealSense: Decreasing minimum range of navigational assistance for visually impaired individuals</article-title>. <source>J. Ambient Intell. Smart Environ</source>. <volume>9</volume>, <fpage>743</fpage>&#x02013;<lpage>755</lpage>. <pub-id pub-id-type="doi">10.3233/AIS-170459</pub-id><pub-id pub-id-type="pmid">29714283</pub-id></citation></ref>
</ref-list>
<app-group>
<app id="A1">
<title>Appendix 1</title>
<p>The 3D mesh obtained from RTAB-Map initially contains a large number of holes due to the lack of 3D points in those regions. To address this, we leverage the CGAL&#x00027;s hole-filling pipeline (Alliez and Fabri, <xref ref-type="bibr" rid="B5">2016</xref>) followed by our texturing method.</p>
<p>In the case of 3D meshes with complete scenes such as <monospace>room</monospace> and <monospace>kitchen</monospace>, the texture projections are accurately aligned. However, the partially captured scenes such as <monospace>office_1</monospace> and <monospace>office_2</monospace> suffer mesh deformations. Specifically, the hole-filling algorithm&#x00027;s tendency to produce water-tight meshes results in noticeable deformations, particularly at the curved extremities. This underscores the need for capturing complete scenes to fully leverage the hole-filling process to obtain complete, accurate, and clean results.</p>
<fig id="F14" position="float">
<label>Figure A1</label>
<caption><p>Textured meshes using different approaches: original 3D mesh <bold>(left)</bold>, re-textured mesh <bold>(middle)</bold> with the proposed method, and hole-filled re-textured mesh <bold>(right)</bold>.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-06-1388174-g0014.tif"/>
</fig>
</app>
</app-group>
</back>
</article>