G-CUT3R：基于相机与深度先验融合的引导式三维重建

摘要

我们提出了G-CUT3R，一种新颖的前馈式引导三维场景重建方法，通过整合先验信息来增强CUT3R模型。与现有仅依赖输入图像的前馈方法不同，我们的方法利用了现实场景中常见的辅助数据，如深度信息、相机校准参数或相机位置。我们对CUT3R进行了轻量级改进，为每种模态引入专用编码器以提取特征，并通过零卷积将这些特征与RGB图像标记融合。这种灵活的设计使得在推理过程中能够无缝整合任意组合的先验信息。在包括三维重建及其他多视图任务在内的多个基准测试中，我们的方法展现了显著的性能提升，证明了其有效利用可用先验信息的能力，同时保持了与不同输入模态的兼容性。

English

We introduce G-CUT3R, a novel feed-forward approach for guided 3D scene reconstruction that enhances the CUT3R model by integrating prior information. Unlike existing feed-forward methods that rely solely on input images, our method leverages auxiliary data, such as depth, camera calibrations, or camera positions, commonly available in real-world scenarios. We propose a lightweight modification to CUT3R, incorporating a dedicated encoder for each modality to extract features, which are fused with RGB image tokens via zero convolution. This flexible design enables seamless integration of any combination of prior information during inference. Evaluated across multiple benchmarks, including 3D reconstruction and other multi-view tasks, our approach demonstrates significant performance improvements, showing its ability to effectively utilize available priors while maintaining compatibility with varying input modalities.