Easi3R: 학습 없이 DUSt3R에서 분리된 모션 추정하기

초록

DUSt3R의 최근 발전은 Transformer 네트워크 아키텍처와 대규모 3D 데이터셋에 대한 직접적인 지도를 활용하여 정적 장면의 조밀한 포인트 클라우드와 카메라 파라미터를 견고하게 추정할 수 있게 했습니다. 이와 대조적으로, 사용 가능한 4D 데이터셋의 제한된 규모와 다양성은 고도로 일반화 가능한 4D 모델을 훈련하는 데 주요한 병목 현상을 야기합니다. 이러한 제약으로 인해 기존의 4D 방법들은 광학 흐름 및 깊이와 같은 추가적인 기하학적 사전 지식을 활용하여 확장 가능한 동적 비디오 데이터에 3D 모델을 미세 조정하는 방식을 채택해 왔습니다. 본 연구에서는 이와 반대의 접근법을 취하여, 훈련이 필요 없는 간단하면서도 효율적인 4D 재구성 방법인 Easi3R을 소개합니다. 우리의 접근법은 추론 과정에서 주의(attention) 적응을 적용함으로써, 처음부터 사전 훈련하거나 네트워크를 미세 조정할 필요를 없앱니다. 우리는 DUSt3R의 주의 계층이 카메라와 객체의 움직임에 대한 풍부한 정보를 내재적으로 인코딩하고 있음을 발견했습니다. 이러한 주의 맵을 신중하게 분리함으로써, 정확한 동적 영역 분할, 카메라 포즈 추정, 그리고 4D 조밀 포인트 맵 재구성을 달성합니다. 실제 동적 비디오에 대한 광범위한 실험을 통해, 우리의 경량 주의 적응 방식이 광범위한 동적 데이터셋에서 훈련되거나 미세 조정된 기존의 최첨단 방법들을 크게 능가함을 입증했습니다. 우리의 코드는 연구 목적으로 https://easi3r.github.io/에서 공개되어 있습니다.

English

Recent advances in DUSt3R have enabled robust estimation of dense point clouds and camera parameters of static scenes, leveraging Transformer network architectures and direct supervision on large-scale 3D datasets. In contrast, the limited scale and diversity of available 4D datasets present a major bottleneck for training a highly generalizable 4D model. This constraint has driven conventional 4D methods to fine-tune 3D models on scalable dynamic video data with additional geometric priors such as optical flow and depths. In this work, we take an opposite path and introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction. Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning. We find that the attention layers in DUSt3R inherently encode rich information about camera and object motion. By carefully disentangling these attention maps, we achieve accurate dynamic region segmentation, camera pose estimation, and 4D dense point map reconstruction. Extensive experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods that are trained or finetuned on extensive dynamic datasets. Our code is publicly available for research purpose at https://easi3r.github.io/

Easi3R: 학습 없이 DUSt3R에서 분리된 모션 추정하기

Easi3R: Estimating Disentangled Motion from DUSt3R Without Training

초록

Support