ビデオ生成における物理的理解へ向けて：3Dポイント正則化アプローチ

要旨

我々は、3次元ジオメトリと動的認識を統合した新しいビデオ生成フレームワークを提案します。これを実現するために、2次元ビデオに3次元の点軌跡を追加し、それらをピクセル空間で整列させます。その結果得られる3次元認識ビデオデータセット、PointVidを用いて、潜在的な拡散モデルを微調整し、それによって2次元オブジェクトを3次元直交座標で追跡できるようにします。これに基づいて、ビデオ内のオブジェクトの形状と動きを規則化し、非望ましいアーティファクト（例：非物理的変形）を除去します。その結果、生成されるRGBビデオの品質が向上し、形状認識の不足によって現在のビデオモデルで一般的なオブジェクトの変形などの問題が軽減されます。3次元の拡張と規則化により、我々のモデルは、タスク指向のビデオなどのコンタクト豊富なシナリオを処理する能力を持ちます。これらのビデオは、固体の複雑な相互作用を含み、変形や接触を知覚するために3次元情報が不可欠です。さらに、我々のモデルは、移動オブジェクトの3次元的な一貫性を促進し、形状と動きの急激な変化を減らすことで、ビデオ生成の全体的な品質を向上させます。

English

We present a novel video generation framework that integrates 3-dimensional geometry and dynamic awareness. To achieve this, we augment 2D videos with 3D point trajectories and align them in pixel space. The resulting 3D-aware video dataset, PointVid, is then used to fine-tune a latent diffusion model, enabling it to track 2D objects with 3D Cartesian coordinates. Building on this, we regularize the shape and motion of objects in the video to eliminate undesired artifacts, \eg, nonphysical deformation. Consequently, we enhance the quality of generated RGB videos and alleviate common issues like object morphing, which are prevalent in current video models due to a lack of shape awareness. With our 3D augmentation and regularization, our model is capable of handling contact-rich scenarios such as task-oriented videos. These videos involve complex interactions of solids, where 3D information is essential for perceiving deformation and contact. Furthermore, our model improves the overall quality of video generation by promoting the 3D consistency of moving objects and reducing abrupt changes in shape and motion.

ビデオ生成における物理的理解へ向けて：3Dポイント正則化アプローチ

Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach

要旨

Support