长视频叙事生成研究综述：架构、一致性与电影艺术品质

摘要

尽管视频生成模型已取得显著进展，但现有最先进的方法仅能生成持续5至16秒的视频，这些视频常被标记为“长视频”。此外，超过16秒的视频在叙事过程中难以保持角色外观和场景布局的一致性。特别是，涉及多主体的长视频仍无法维持角色一致性和动作连贯性。虽然某些方法能生成长达150秒的视频，但它们往往存在帧冗余和低时间多样性的问题。近期研究尝试制作包含多个角色、叙事连贯且细节高保真的长视频。我们深入研究了32篇关于视频生成的论文，以识别出能持续产出这些质量的关键架构组件和训练策略。同时，我们构建了一个全面的新分类体系，对现有方法进行了系统梳理，并通过架构设计和性能特征对论文进行了分类比较。

English

Despite the significant progress that has been made in video generative models, existing state-of-the-art methods can only produce videos lasting 5-16 seconds, often labeled "long-form videos". Furthermore, videos exceeding 16 seconds struggle to maintain consistent character appearances and scene layouts throughout the narrative. In particular, multi-subject long videos still fail to preserve character consistency and motion coherence. While some methods can generate videos up to 150 seconds long, they often suffer from frame redundancy and low temporal diversity. Recent work has attempted to produce long-form videos featuring multiple characters, narrative coherence, and high-fidelity detail. We comprehensively studied 32 papers on video generation to identify key architectural components and training strategies that consistently yield these qualities. We also construct a comprehensive novel taxonomy of existing methods and present comparative tables that categorize papers by their architectural designs and performance characteristics.