統一されたディレクターによる想像力と映像生成の架け橋

要旨

既存のAI駆動型映像制作システムは、脚本作成とキーシーン設計を通常は独立したタスクとして扱う。すなわち、前者は大規模言語モデルに、後者は画像生成モデルに依存している。我々は、論理的推論と想像的思考がともに映画監督の基本的資質であることから、これら二つのタスクを単一フレームワーク内で統合すべきだと主張する。本論文では、ユーザーのプロンプトから構造化された脚本を生成する統合型監督モデルUniMAGEを提案する。これにより、既存の音声・映像生成モデルを活用し、非専門家が長尺・マルチシーンの映像制作を可能にする。これを実現するため、テキスト生成と画像生成を統合するMixture-of-Transformersアーキテクチャを採用した。さらに物語の論理性とキーフレームの一貫性を高めるため、「まず交互学習、その後分離学習」という新しい訓練パラダイムを導入する。具体的には、まず交互配置されたテキスト・画像データを用いて脚本の深い理解と想像的解釈を促進する「交互概念学習」を実施し、続いて脚本執筆とキーフレーム生成を分離することで、ストーリーテリングの柔軟性と創造性を高める「分離専門家学習」を実行する。大規模な実験により、UniMAGEがオープンソースモデルの中で最先端の性能を達成し、論理的整合性の高い映像脚本と視覚的一貫性のあるキーフレーム画像を生成することを実証した。

English

Existing AI-driven video creation systems typically treat script drafting and key-shot design as two disjoint tasks: the former relies on large language models, while the latter depends on image generation models. We argue that these two tasks should be unified within a single framework, as logical reasoning and imaginative thinking are both fundamental qualities of a film director. In this work, we propose UniMAGE, a unified director model that bridges user prompts with well-structured scripts, thereby empowering non-experts to produce long-context, multi-shot films by leveraging existing audio-video generation models. To achieve this, we employ the Mixture-of-Transformers architecture that unifies text and image generation. To further enhance narrative logic and keyframe consistency, we introduce a ``first interleaving, then disentangling'' training paradigm. Specifically, we first perform Interleaved Concept Learning, which utilizes interleaved text-image data to foster the model's deeper understanding and imaginative interpretation of scripts. We then conduct Disentangled Expert Learning, which decouples script writing from keyframe generation, enabling greater flexibility and creativity in storytelling. Extensive experiments demonstrate that UniMAGE achieves state-of-the-art performance among open-source models, generating logically coherent video scripts and visually consistent keyframe images.

統一されたディレクターによる想像力と映像生成の架け橋

Bridging Your Imagination with Audio-Video Generation via a Unified Director

要旨

Support