劇本即是一切:面向長時序對話轉電影影片生成的智能體框架
The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation
January 25, 2026
作者: Chenyu Mu, Xin He, Qu Yang, Wanshun Chen, Jiadi Yao, Huang Liu, Zihao Yi, Bo Zhao, Xingyu Chen, Ruotian Ma, Fanghua Ye, Erkun Yang, Cheng Deng, Zhaopeng Tu, Xiaolong Li, Linus
cs.AI
摘要
近期影片生成技術的突破已能透過簡短文字提示合成令人驚豔的視覺內容。然而,這些模型在根據對話等高層次概念生成長篇連貫敘事時仍面臨挑戰,暴露出創意構想與影像化呈現之間的「語意鴻溝」。為彌合此鴻溝,我們提出創新的端到端智能體框架,實現從對話到電影級影片的生成。該框架的核心是ScripterAgent模型,其經訓練能將粗略對話轉譯為細粒度、可執行的電影腳本。為此我們構建了ScriptBench——一個透過專家指導流程標註、具豐富多模態情境的大型基準數據集。生成的腳本將引導DirectorAgent,該智能體採用跨場景連續生成策略協調頂尖影片模型,確保長時序連貫性。我們透過AI驅動的CriticAgent與新型視覺-腳本對齊(VSA)指標進行全面評估,結果顯示本框架能顯著提升所有測試影片模型的腳本忠實度與時間連貫性。此外,我們的分析揭示了當前頂尖模型在視覺效果與嚴格腳本遵循度之間存在關鍵權衡,為自動化電影製作的未來提供了重要啟示。
English
Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However, these models struggle to generate long-form, coherent narratives from high-level concepts like dialogue, revealing a ``semantic gap'' between a creative idea and its cinematic execution. To bridge this gap, we introduce a novel, end-to-end agentic framework for dialogue-to-cinematic-video generation. Central to our framework is ScripterAgent, a model trained to translate coarse dialogue into a fine-grained, executable cinematic script. To enable this, we construct ScriptBench, a new large-scale benchmark with rich multimodal context, annotated via an expert-guided pipeline. The generated script then guides DirectorAgent, which orchestrates state-of-the-art video models using a cross-scene continuous generation strategy to ensure long-horizon coherence. Our comprehensive evaluation, featuring an AI-powered CriticAgent and a new Visual-Script Alignment (VSA) metric, shows our framework significantly improves script faithfulness and temporal fidelity across all tested video models. Furthermore, our analysis uncovers a crucial trade-off in current SOTA models between visual spectacle and strict script adherence, providing valuable insights for the future of automated filmmaking.