WinT3R: 카메라 토큰 풀을 활용한 윈도우 기반 스트리밍 재구성

초록

본 논문에서는 정확한 카메라 포즈와 고품질 포인트 맵을 실시간으로 예측할 수 있는 피드포워드(feed-forward) 재구성 모델인 WinT3R를 제안합니다. 기존 방법들은 재구성 품질과 실시간 성능 간의 트레이드오프 문제를 겪고 있었습니다. 이를 해결하기 위해, 우리는 먼저 슬라이딩 윈도우(sliding window) 메커니즘을 도입하여 윈도우 내 프레임 간의 충분한 정보 교환을 보장함으로써, 큰 계산 비용 없이 기하학적 예측의 품질을 향상시켰습니다. 또한, 카메라의 간결한 표현을 활용하고 전역 카메라 토큰 풀(global camera token pool)을 유지함으로써, 효율성을 희생하지 않으면서도 카메라 포즈 추정의 신뢰성을 높였습니다. 이러한 설계를 통해 WinT3R는 다양한 데이터셋에 대한 광범위한 실험을 통해 검증된 바와 같이, 실시간 재구성 품질, 카메라 포즈 추정, 재구성 속도 측면에서 최첨단 성능을 달성했습니다. 코드와 모델은 https://github.com/LiZizun/WinT3R에서 공개되어 있습니다.

English

We present WinT3R, a feed-forward reconstruction model capable of online prediction of precise camera poses and high-quality point maps. Previous methods suffer from a trade-off between reconstruction quality and real-time performance. To address this, we first introduce a sliding window mechanism that ensures sufficient information exchange among frames within the window, thereby improving the quality of geometric predictions without large computation. In addition, we leverage a compact representation of cameras and maintain a global camera token pool, which enhances the reliability of camera pose estimation without sacrificing efficiency. These designs enable WinT3R to achieve state-of-the-art performance in terms of online reconstruction quality, camera pose estimation, and reconstruction speed, as validated by extensive experiments on diverse datasets. Code and model are publicly available at https://github.com/LiZizun/WinT3R.

WinT3R: 카메라 토큰 풀을 활용한 윈도우 기반 스트리밍 재구성

WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool

초록

Support