TC4D: 軌跡条件付きテキストから4D生成

要旨

最近のテキストから4D生成を行う技術では、事前学習済みのテキストからビデオ生成モデルからの監督を用いて動的な3Dシーンを合成しています。しかし、変形モデルや時間依存のニューラル表現など、既存のモーション表現は生成できる動きの量に制限があり、ボリュームレンダリングに使用されるバウンディングボックスを大きく超える動きを合成することができません。より柔軟なモーションモデルの欠如が、4D生成手法と最近のほぼフォトリアルなビデオ生成モデルとの間のリアリズムのギャップに寄与しています。ここでは、TC4D: 軌道条件付きテキストから4D生成を提案します。これは、モーションをグローバルとローカルの成分に分解するものです。シーンのバウンディングボックスのグローバルな動きを、スプラインによってパラメータ化された軌道に沿った剛体変換として表現します。テキストからビデオ生成モデルからの監督を用いて、グローバルな軌道に従うローカルな変形を学習します。私たちのアプローチは、任意の軌道に沿ってアニメーション化されたシーンの合成、構成可能なシーン生成、および生成される動きのリアリズムと量の大幅な改善を可能にします。これらを定性的に評価し、ユーザースタディを通じて検証しました。ビデオ結果は私たちのウェブサイトでご覧いただけます: https://sherwinbahmani.github.io/tc4d。

English

Recent techniques for text-to-4D generation synthesize dynamic 3D scenes using supervision from pre-trained text-to-video models. However, existing representations for motion, such as deformation models or time-dependent neural representations, are limited in the amount of motion they can generate-they cannot synthesize motion extending far beyond the bounding box used for volume rendering. The lack of a more flexible motion model contributes to the gap in realism between 4D generation methods and recent, near-photorealistic video generation models. Here, we propose TC4D: trajectory-conditioned text-to-4D generation, which factors motion into global and local components. We represent the global motion of a scene's bounding box using rigid transformation along a trajectory parameterized by a spline. We learn local deformations that conform to the global trajectory using supervision from a text-to-video model. Our approach enables the synthesis of scenes animated along arbitrary trajectories, compositional scene generation, and significant improvements to the realism and amount of generated motion, which we evaluate qualitatively and through a user study. Video results can be viewed on our website: https://sherwinbahmani.github.io/tc4d.

TC4D: 軌跡条件付きテキストから4D生成

TC4D: Trajectory-Conditioned Text-to-4D Generation

要旨

Support