通过统一导演系统连接想象与音视频生成
Bridging Your Imagination with Audio-Video Generation via a Unified Director
December 29, 2025
作者: Jiaxu Zhang, Tianshu Hu, Yuan Zhang, Zenan Li, Linjie Luo, Guosheng Lin, Xin Chen
cs.AI
摘要
現有AI驅動的影片創作系統通常將劇本草擬與關鍵鏡頭設計視為兩項獨立任務:前者依賴大型語言模型,後者則基於圖像生成模型。我們認為這兩項任務應統一於單一框架內,因為邏輯推理與想像思維同屬電影導演的基礎素養。本研究提出UniMAGE——一種銜接用戶提示與結構化劇本的統一導演模型,使非專業人士能借助現有音視頻生成模型創作出長上下文、多鏡頭的影片。為實現這一目標,我們採用統一文本與圖像生成的混合變換器架構。為進一步增強敘事邏輯與關鍵幀一致性,我們引入「先交織後解耦」的訓練範式:具體而言,先進行交織概念學習,利用交織型文本-圖像數據促進模型對劇本的深度理解與想像闡釋;隨後實施解耦專家學習,將劇本撰寫與關鍵幀生成分離,以提升故事敘述的靈活性與創造力。大量實驗表明,UniMAGE在開源模型中實現了最先進的性能,能生成邏輯連貫的視頻劇本與視覺一致的關鍵幀圖像。
English
Existing AI-driven video creation systems typically treat script drafting and key-shot design as two disjoint tasks: the former relies on large language models, while the latter depends on image generation models. We argue that these two tasks should be unified within a single framework, as logical reasoning and imaginative thinking are both fundamental qualities of a film director. In this work, we propose UniMAGE, a unified director model that bridges user prompts with well-structured scripts, thereby empowering non-experts to produce long-context, multi-shot films by leveraging existing audio-video generation models. To achieve this, we employ the Mixture-of-Transformers architecture that unifies text and image generation. To further enhance narrative logic and keyframe consistency, we introduce a ``first interleaving, then disentangling'' training paradigm. Specifically, we first perform Interleaved Concept Learning, which utilizes interleaved text-image data to foster the model's deeper understanding and imaginative interpretation of scripts. We then conduct Disentangled Expert Learning, which decouples script writing from keyframe generation, enabling greater flexibility and creativity in storytelling. Extensive experiments demonstrate that UniMAGE achieves state-of-the-art performance among open-source models, generating logically coherent video scripts and visually consistent keyframe images.