Easi3R:無需訓練即可從DUSt3R估計解耦運動
Easi3R: Estimating Disentangled Motion from DUSt3R Without Training
March 31, 2025
作者: Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, Anpei Chen
cs.AI
摘要
DUSt3R的最新進展使得利用Transformer網絡架構和大規模3D數據集的直接監督,能夠對靜態場景的密集點雲和相機參數進行穩健估計。相比之下,現有4D數據集的規模和多樣性有限,這成為了訓練高度泛化性4D模型的主要瓶頸。這一限制促使傳統的4D方法在可擴展的動態視頻數據上對3D模型進行微調,並引入如光流和深度等額外的幾何先驗。在本研究中,我們採取了一條相反的路徑,提出了Easi3R,這是一種簡單而高效的無訓練4D重建方法。我們的方法在推理過程中應用注意力適應,消除了從頭開始預訓練或網絡微調的需求。我們發現,DUSt3R中的注意力層本質上編碼了豐富的相機和物體運動信息。通過仔細解構這些注意力圖,我們實現了精確的動態區域分割、相機姿態估計以及4D密集點雲圖重建。在真實世界動態視頻上的廣泛實驗表明,我們輕量級的注意力適應方法顯著優於先前在大量動態數據集上訓練或微調的最先進方法。我們的代碼已公開供研究使用,詳見https://easi3r.github.io/。
English
Recent advances in DUSt3R have enabled robust estimation of dense point
clouds and camera parameters of static scenes, leveraging Transformer network
architectures and direct supervision on large-scale 3D datasets. In contrast,
the limited scale and diversity of available 4D datasets present a major
bottleneck for training a highly generalizable 4D model. This constraint has
driven conventional 4D methods to fine-tune 3D models on scalable dynamic video
data with additional geometric priors such as optical flow and depths. In this
work, we take an opposite path and introduce Easi3R, a simple yet efficient
training-free method for 4D reconstruction. Our approach applies attention
adaptation during inference, eliminating the need for from-scratch pre-training
or network fine-tuning. We find that the attention layers in DUSt3R inherently
encode rich information about camera and object motion. By carefully
disentangling these attention maps, we achieve accurate dynamic region
segmentation, camera pose estimation, and 4D dense point map reconstruction.
Extensive experiments on real-world dynamic videos demonstrate that our
lightweight attention adaptation significantly outperforms previous
state-of-the-art methods that are trained or finetuned on extensive dynamic
datasets. Our code is publicly available for research purpose at
https://easi3r.github.io/Summary
AI-Generated Summary