스크립트만으로 충분하다: 장기적 대화-시네마틱 비디오 생성을 위한 주체적 프레임워크

초록

최근 동영상 생성 기술의 발전으로 간단한 텍스트 프롬프트만으로도 놀라운 시각적 콘텐츠를 합성하는 모델이 등장했습니다. 그러나 이러한 모델들은 대화와 같은 높은 수준의 개념으로부터 장편의 일관된 서사를 생성하는 데 어려움을 겪으며, 창의적 아이디어와 영화적 실행 사이에 '의미론적 격차'가 존재함을 보여줍니다. 이러한 격차를 해소하기 위해, 우리는 대화에서 영화적 동영상 생성을 위한 새로운 종단 간 에이전트 프레임워크를 소개합니다. 우리 프레임워크의 핵심은 대략적인 대화를 세밀하게 실행 가능한 시나리오로 변환하도록 훈련된 ScripterAgent 모델입니다. 이를 위해 우리는 전문가 주도 파이프라인을 통해 주석이 달린 풍부한 다중 모달 컨텍스트를 가진 새로운 대규모 벤치마크인 ScriptBench를 구축했습니다. 생성된 시나리오는 최첨단 동영상 모델들을 장기간 일관성을 보장하는 장면 간 연속 생성 전략을 사용하여 조정하는 DirectorAgent를 안내합니다. AI 기반 CriticAgent와 새로운 Visual-Script Alignment(VSA) 메트릭을 포함한 포괄적인 평가를 통해, 우리의 프레임워크가 모든 테스트된 동영상 모델에서 시나리오 충실도와 시간적 정확도를 크게 향상시킴을 확인했습니다. 나아가, 우리의 분석은 현재 최첨단 모델들이 시각적 스펙터클과 엄격한 시나리오 준수 사이에서 중요한 트레이드오프 관계에 있음을 밝혀 자동화된 영화 제작의 미래를 위한 유용한 통찰을 제공합니다.

English

Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However, these models struggle to generate long-form, coherent narratives from high-level concepts like dialogue, revealing a ``semantic gap'' between a creative idea and its cinematic execution. To bridge this gap, we introduce a novel, end-to-end agentic framework for dialogue-to-cinematic-video generation. Central to our framework is ScripterAgent, a model trained to translate coarse dialogue into a fine-grained, executable cinematic script. To enable this, we construct ScriptBench, a new large-scale benchmark with rich multimodal context, annotated via an expert-guided pipeline. The generated script then guides DirectorAgent, which orchestrates state-of-the-art video models using a cross-scene continuous generation strategy to ensure long-horizon coherence. Our comprehensive evaluation, featuring an AI-powered CriticAgent and a new Visual-Script Alignment (VSA) metric, shows our framework significantly improves script faithfulness and temporal fidelity across all tested video models. Furthermore, our analysis uncovers a crucial trade-off in current SOTA models between visual spectacle and strict script adherence, providing valuable insights for the future of automated filmmaking.

스크립트만으로 충분하다: 장기적 대화-시네마틱 비디오 생성을 위한 주체적 프레임워크

The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

초록

Support