框架內外：無界可控的圖像到視頻生成

摘要

可控性、時間一致性與細節合成仍是視訊生成中最為關鍵的挑戰。本文聚焦於一種常用卻未充分探索的電影技術——「入鏡與出鏡」。具體而言，從圖像到視訊的生成出發，使用者能夠控制圖像中的物體自然離開場景，或根據使用者指定的運動軌跡引入全新的身份參考進入場景。為支援此任務，我們引入了一個半自動策劃的新數據集、針對此情境的全面評估協議，以及一個高效的身份保持運動可控視訊擴散變換器架構。評估結果表明，我們提出的方法顯著優於現有基線。

English

Controllability, temporal coherence, and detail synthesis remain the most critical challenges in video generation. In this paper, we focus on a commonly used yet underexplored cinematic technique known as Frame In and Frame Out. Specifically, starting from image-to-video generation, users can control the objects in the image to naturally leave the scene or provide breaking new identity references to enter the scene, guided by user-specified motion trajectory. To support this task, we introduce a new dataset curated semi-automatically, a comprehensive evaluation protocol targeting this setting, and an efficient identity-preserving motion-controllable video Diffusion Transformer architecture. Our evaluation shows that our proposed approach significantly outperforms existing baselines.

框架內外：無界可控的圖像到視頻生成

Frame In-N-Out: Unbounded Controllable Image-to-Video Generation

摘要

Support