長視頻敘事生成技術綜述：架構、連貫性與電影質感

摘要

尽管在视频生成模型方面已取得显著进展，现有最先进的方法仅能制作持续5至16秒的视频，常被标记为“长视频”。此外，超过16秒的视频在叙事过程中难以保持角色外观和场景布局的一致性。特别是，涉及多主体的长视频仍无法维持角色一致性和动作连贯性。虽然某些方法能够生成长达150秒的视频，但往往存在帧冗余和时序多样性低的问题。近期研究尝试制作包含多个角色、叙事连贯且细节高保真的长视频。我们全面研究了32篇关于视频生成的论文，以识别出能够持续产出这些品质的关键架构组件和训练策略。同时，我们构建了一套全面的现有方法分类体系，并提供了按架构设计和性能特征分类的对比表格。

English

Despite the significant progress that has been made in video generative models, existing state-of-the-art methods can only produce videos lasting 5-16 seconds, often labeled "long-form videos". Furthermore, videos exceeding 16 seconds struggle to maintain consistent character appearances and scene layouts throughout the narrative. In particular, multi-subject long videos still fail to preserve character consistency and motion coherence. While some methods can generate videos up to 150 seconds long, they often suffer from frame redundancy and low temporal diversity. Recent work has attempted to produce long-form videos featuring multiple characters, narrative coherence, and high-fidelity detail. We comprehensively studied 32 papers on video generation to identify key architectural components and training strategies that consistently yield these qualities. We also construct a comprehensive novel taxonomy of existing methods and present comparative tables that categorize papers by their architectural designs and performance characteristics.