G-CUT3R: 카메라 및 깊이 정보 사전 지식을 통합한 가이드 3D 재구성

초록

우리는 사전 정보를 통합하여 CUT3R 모델을 개선한 새로운 피드포워드 방식의 3D 장면 재구성 방법인 G-CUT3R을 소개합니다. 기존의 피드포워드 방법들이 입력 이미지에만 의존하는 것과 달리, 우리의 방법은 실제 시나리오에서 흔히 사용 가능한 깊이 정보, 카메라 캘리브레이션, 또는 카메라 위치와 같은 보조 데이터를 활용합니다. 우리는 CUT3R에 경량화된 수정을 제안하며, 각 모달리티별로 전용 인코더를 도입하여 특징을 추출하고, 이를 제로 컨볼루션을 통해 RGB 이미지 토큰과 융합합니다. 이 유연한 설계는 추론 과정에서 어떤 조합의 사전 정보라도 원활하게 통합할 수 있게 합니다. 3D 재구성 및 기타 다중 뷰 작업을 포함한 다양한 벤치마크에서 평가한 결과, 우리의 접근법은 다양한 입력 모달리티와의 호환성을 유지하면서도 사용 가능한 사전 정보를 효과적으로 활용하여 성능을 크게 향상시킬 수 있음을 보여줍니다.

English

We introduce G-CUT3R, a novel feed-forward approach for guided 3D scene reconstruction that enhances the CUT3R model by integrating prior information. Unlike existing feed-forward methods that rely solely on input images, our method leverages auxiliary data, such as depth, camera calibrations, or camera positions, commonly available in real-world scenarios. We propose a lightweight modification to CUT3R, incorporating a dedicated encoder for each modality to extract features, which are fused with RGB image tokens via zero convolution. This flexible design enables seamless integration of any combination of prior information during inference. Evaluated across multiple benchmarks, including 3D reconstruction and other multi-view tasks, our approach demonstrates significant performance improvements, showing its ability to effectively utilize available priors while maintaining compatibility with varying input modalities.