通過點軌跡處理從休閒視頻快速進行基於編碼器的三維重建
Fast Encoder-Based 3D from Casual Videos via Point Track Processing
April 10, 2024
作者: Yoni Kasten, Wuyue Lu, Haggai Maron
cs.AI
摘要
本文解決了從具有動態內容的視頻中重建3D結構的長期挑戰。目前解決這個問題的方法並不適用於由標準攝像機錄製的隨意視頻,或需要長時間進行優化。
為了顯著提高先前方法的效率,我們提出了TracksTo4D,這是一種基於學習的方法,可以從來自隨意視頻的動態內容中推斷3D結構和相機位置,並僅需進行一次高效的前向傳遞。為了實現這一目標,我們提出直接在2D點軌跡上進行操作,並設計了一個專門用於處理2D點軌跡的架構。我們提出的架構設計考慮了兩個關鍵原則:(1)考慮輸入點軌跡數據中存在的固有對稱性,以及(2)假設運動模式可以有效地使用低秩逼近表示。TracksTo4D通過在一個由隨意視頻組成的數據集上以無監督方式進行訓練,僅利用從視頻中提取的2D點軌跡,而無需任何3D監督。我們的實驗表明,TracksTo4D可以重建底層視頻的時間點雲和相機位置,其準確性與最先進的方法相當,同時將運行時間大幅減少高達95%。我們進一步展示,TracksTo4D在推斷時對看不見的語義類別的未見視頻具有良好的泛化能力。
English
This paper addresses the long-standing challenge of reconstructing 3D
structures from videos with dynamic content. Current approaches to this problem
were not designed to operate on casual videos recorded by standard cameras or
require a long optimization time.
Aiming to significantly improve the efficiency of previous approaches, we
present TracksTo4D, a learning-based approach that enables inferring 3D
structure and camera positions from dynamic content originating from casual
videos using a single efficient feed-forward pass. To achieve this, we propose
operating directly over 2D point tracks as input and designing an architecture
tailored for processing 2D point tracks. Our proposed architecture is designed
with two key principles in mind: (1) it takes into account the inherent
symmetries present in the input point tracks data, and (2) it assumes that the
movement patterns can be effectively represented using a low-rank
approximation. TracksTo4D is trained in an unsupervised way on a dataset of
casual videos utilizing only the 2D point tracks extracted from the videos,
without any 3D supervision. Our experiments show that TracksTo4D can
reconstruct a temporal point cloud and camera positions of the underlying video
with accuracy comparable to state-of-the-art methods, while drastically
reducing runtime by up to 95\%. We further show that TracksTo4D generalizes
well to unseen videos of unseen semantic categories at inference time.Summary
AI-Generated Summary