Easi3R：無需訓練即可從DUSt3R估計解耦運動

摘要

DUSt3R的最新進展使得利用Transformer網絡架構和大規模3D數據集的直接監督，能夠對靜態場景的密集點雲和相機參數進行穩健估計。相比之下，現有4D數據集的規模和多樣性有限，這成為了訓練高度泛化性4D模型的主要瓶頸。這一限制促使傳統的4D方法在可擴展的動態視頻數據上對3D模型進行微調，並引入如光流和深度等額外的幾何先驗。在本研究中，我們採取了一條相反的路徑，提出了Easi3R，這是一種簡單而高效的無訓練4D重建方法。我們的方法在推理過程中應用注意力適應，消除了從頭開始預訓練或網絡微調的需求。我們發現，DUSt3R中的注意力層本質上編碼了豐富的相機和物體運動信息。通過仔細解構這些注意力圖，我們實現了精確的動態區域分割、相機姿態估計以及4D密集點雲圖重建。在真實世界動態視頻上的廣泛實驗表明，我們輕量級的注意力適應方法顯著優於先前在大量動態數據集上訓練或微調的最先進方法。我們的代碼已公開供研究使用，詳見https://easi3r.github.io/。

English

Recent advances in DUSt3R have enabled robust estimation of dense point clouds and camera parameters of static scenes, leveraging Transformer network architectures and direct supervision on large-scale 3D datasets. In contrast, the limited scale and diversity of available 4D datasets present a major bottleneck for training a highly generalizable 4D model. This constraint has driven conventional 4D methods to fine-tune 3D models on scalable dynamic video data with additional geometric priors such as optical flow and depths. In this work, we take an opposite path and introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction. Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning. We find that the attention layers in DUSt3R inherently encode rich information about camera and object motion. By carefully disentangling these attention maps, we achieve accurate dynamic region segmentation, camera pose estimation, and 4D dense point map reconstruction. Extensive experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods that are trained or finetuned on extensive dynamic datasets. Our code is publicly available for research purpose at https://easi3r.github.io/

Easi3R：無需訓練即可從DUSt3R估計解耦運動

Easi3R: Estimating Disentangled Motion from DUSt3R Without Training

摘要

Support