朝向視頻生成中的物理理解：一種3D點正則化方法

摘要

我們提出了一個融合三維幾何和動態感知的新型視頻生成框架。為了實現這一目標，我們通過在像素空間中增加三維點軌跡來擴充二維視頻。生成的三維感知視頻數據集 PointVid，然後用於微調潛在擴散模型，使其能夠追踪具有三維笛卡爾坐標的二維物體。在此基礎上，我們對視頻中的物體形狀和運動進行正則化，以消除不需要的瑕疵，例如非物理變形。因此，我們提高了生成的 RGB 視頻的質量，並減輕了常見問題，如對象變形，這些問題在當前視頻模型中普遍存在，因為缺乏形狀感知。通過我們的三維擴充和正則化，我們的模型能夠處理像任務導向視頻這樣的接觸豐富場景。這些視頻涉及固體的復雜交互，其中三維信息對於感知變形和接觸至關重要。此外，我們的模型通過促進移動物體的三維一致性並減少形狀和運動的突變，提高了視頻生成的整體質量。

English

We present a novel video generation framework that integrates 3-dimensional geometry and dynamic awareness. To achieve this, we augment 2D videos with 3D point trajectories and align them in pixel space. The resulting 3D-aware video dataset, PointVid, is then used to fine-tune a latent diffusion model, enabling it to track 2D objects with 3D Cartesian coordinates. Building on this, we regularize the shape and motion of objects in the video to eliminate undesired artifacts, \eg, nonphysical deformation. Consequently, we enhance the quality of generated RGB videos and alleviate common issues like object morphing, which are prevalent in current video models due to a lack of shape awareness. With our 3D augmentation and regularization, our model is capable of handling contact-rich scenarios such as task-oriented videos. These videos involve complex interactions of solids, where 3D information is essential for perceiving deformation and contact. Furthermore, our model improves the overall quality of video generation by promoting the 3D consistency of moving objects and reducing abrupt changes in shape and motion.