SmartDirector: ナラティブペーシング制御によるキーフレーム条件付きシネマティック動画生成

要旨

動画の物語性は、その知覚的価値を根本的に決定づける。既存の映像生成手法は視覚的に魅力的なコンテンツを生成できるものの、テキストプロンプトや最初/最後のフレームといった疎な条件信号に依存しており、物語構造や時間的ペーシングの精密な制御が制限されている。本稿では、複数のキーフレームを通じて映像生成モデルの物語生成能力を強化するフレームワークSmartDirectorを提案する。SmartDirectorは、単一ショット生成、マルチショットナラティブ合成、映像拡張といった柔軟な生成シナリオに対応する。本フレームワークは2段階で動作する。Director-Genは入力キーフレームに基づいて低解像度の映像を生成し、Director-SRは高解像度キーフレームを意味的アンカーとして活用することで、微細なディテールを復元し出力を高精細化する。ロバストなマルチキーフレーム学習を実現するため、映画から単一ショットおよびマルチショットのシーケンスを厳選するデータパイプラインを構築した。広範な実験により、SmartDirectorが既存の最先端手法を大幅に上回る性能を示すことを実証する。今後の研究促進のため、コードを公開する予定である。

English

The narrative quality of a video fundamentally determines its perceptual value. Although existing video generation methods can produce visually appealing content, they predominantly rely on sparse conditioning signals such as text prompts or first/last frames, which limits precise control over narrative structure and temporal pacing. In this paper, we propose SmartDirector, a framework that enhances the narrative capacity of video generation models through multiple keyframes. SmartDirector supports flexible generation scenarios including single-shot generation, multi-shot narrative synthesis, and video extension. The framework operates in two stages: Director-Gen generates a low-resolution video conditioned on the provided keyframes, and Director-SR refines the output by exploiting high-resolution keyframes as semantic anchors to recover fine-grained details. To enable robust multi-keyframe training, we construct a data pipeline that curates single-shot and multi-shot sequences from movies. Extensive experiments demonstrate that SmartDirector substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research.