通过统一导演系统,将您的想象与音视频生成无缝连接
Bridging Your Imagination with Audio-Video Generation via a Unified Director
December 29, 2025
作者: Jiaxu Zhang, Tianshu Hu, Yuan Zhang, Zenan Li, Linjie Luo, Guosheng Lin, Xin Chen
cs.AI
摘要
现有基于AI的视频创作系统通常将剧本草拟与关键镜头设计视为两个独立任务:前者依赖大语言模型,后者依托图像生成模型。我们认为这两个任务应当统一于单一框架内,因为逻辑推理与想象力思维同属电影导演的基本素养。本研究提出UniMAGE统一导演模型,通过连接用户提示与结构化剧本,赋能非专业用户借助现有音视频生成模型创作长上下文、多镜头的影片。为实现这一目标,我们采用混合Transformer架构统一文本与图像生成。为进一步增强叙事逻辑与关键帧一致性,我们提出"先交错后解耦"的训练范式:首先进行交错概念学习,利用交错式图文数据促进模型对剧本的深度理解与想象诠释;随后实施解耦专家学习,将剧本写作与关键帧生成分离,以提升故事叙述的灵活性与创造性。大量实验表明,UniMAGE在开源模型中实现了最优性能,能生成逻辑连贯的视频剧本与视觉一致的关键帧图像。
English
Existing AI-driven video creation systems typically treat script drafting and key-shot design as two disjoint tasks: the former relies on large language models, while the latter depends on image generation models. We argue that these two tasks should be unified within a single framework, as logical reasoning and imaginative thinking are both fundamental qualities of a film director. In this work, we propose UniMAGE, a unified director model that bridges user prompts with well-structured scripts, thereby empowering non-experts to produce long-context, multi-shot films by leveraging existing audio-video generation models. To achieve this, we employ the Mixture-of-Transformers architecture that unifies text and image generation. To further enhance narrative logic and keyframe consistency, we introduce a ``first interleaving, then disentangling'' training paradigm. Specifically, we first perform Interleaved Concept Learning, which utilizes interleaved text-image data to foster the model's deeper understanding and imaginative interpretation of scripts. We then conduct Disentangled Expert Learning, which decouples script writing from keyframe generation, enabling greater flexibility and creativity in storytelling. Extensive experiments demonstrate that UniMAGE achieves state-of-the-art performance among open-source models, generating logically coherent video scripts and visually consistent keyframe images.