フレームインアウト：無制限で制御可能な画像から動画への生成

要旨

映像生成において、制御性、時間的一貫性、詳細な合成は依然として最も重要な課題です。本論文では、一般的に使用されながらも十分に研究されていない映画技法である「フレームイン・フレームアウト」に焦点を当てます。具体的には、画像から映像を生成する際に、ユーザーが指定した動きの軌道に従って、画像内のオブジェクトを自然にシーンから退出させたり、新たなアイデンティティ参照を提供してシーンに進入させたりする制御が可能です。このタスクを支援するため、半自動的にキュレーションされた新しいデータセット、この設定を対象とした包括的な評価プロトコル、そして効率的なアイデンティティ保存型の動き制御可能なビデオDiffusion Transformerアーキテクチャを導入します。評価の結果、提案手法が既存のベースラインを大幅に上回ることが示されました。

English

Controllability, temporal coherence, and detail synthesis remain the most critical challenges in video generation. In this paper, we focus on a commonly used yet underexplored cinematic technique known as Frame In and Frame Out. Specifically, starting from image-to-video generation, users can control the objects in the image to naturally leave the scene or provide breaking new identity references to enter the scene, guided by user-specified motion trajectory. To support this task, we introduce a new dataset curated semi-automatically, a comprehensive evaluation protocol targeting this setting, and an efficient identity-preserving motion-controllable video Diffusion Transformer architecture. Our evaluation shows that our proposed approach significantly outperforms existing baselines.

フレームインアウト：無制限で制御可能な画像から動画への生成

Frame In-N-Out: Unbounded Controllable Image-to-Video Generation

要旨

Support