


Numerical qualitative and quantitative evaluation results of tracking, depth prediction, and 3D reconstruction show that RGB-Fusion exceeds most monocular 3D reconstruction systems.Generating large-scale and high-quality 3D scene reconstruction from monocular images is an essential technical foundation in augmented reality and robotics. Through our method, low-uncertainty elements can significantly update the current depth value while avoiding high-uncertainty elements from adversely affecting depth estimation accuracy. We formulate a depth map refinement strategy based on the uncertainty of the depth value, which can naturally lead to a refined depth map. The approximate pose predicted by the neural network is employed as the initial optimization value to avoid the trapping of local minimums.
.jpg)
We employ 3D ICP (Iterative Closest Point) matching and 2D feature matching to construct separate error terms and jointly optimize them, reducing the dependence on the accuracy of depth prediction and improving pose estimation accuracy.

To eliminate the confinements of tracking accuracy imposed by the prediction deficiency of neural networks, we propose integrating the PnP (Perspective-n-Point) algorithm into the tracking module. In this work, we combine the advantage of deep learning and multi-view geometry to propose RGB-Fusion, which effectively solves the inherent limitations of traditional monocular reconstruction. However, the apparent shortcomings (e.g., scale ambiguity, dense depth estimation in texture-less areas) make applying monocular 3D reconstruction to real-world practice challenging. Generating large-scale and high-quality 3D scene reconstruction from monocular images is an essential technical foundation in augmented reality and robotics.
