ChatPaper.aiChatPaper

无限剧情:具备世界一致性与角色感知镜头切换的无限视频生成技术

InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions

March 4, 2026
作者: Mohamed Elmoghany, Liangbing Zhao, Xiaoqian Shen, Subhojyoti Mukherjee, Yang Zhou, Gang Wu, Viet Dac Lai, Seunghyun Yoon, Ryan Rossi, Abdullah Rashwan, Puneet Mathur, Varun Manjunatha, Daksh Dangi, Chien Nguyen, Nedim Lipka, Trung Bui, Krishna Kumar Singh, Ruiyi Zhang, Xiaolei Huang, Jaemin Cho, Yu Wang, Namyong Park, Zhengzhong Tu, Hongjie Chen, Hoda Eldardiry, Nesreen Ahmed, Thien Nguyen, Dinesh Manocha, Mohamed Elhoseiny, Franck Dernoncourt
cs.AI

摘要

生成具有连贯视觉叙事的长篇故事视频仍是视频合成领域的重大挑战。本文提出了一种创新框架、数据集及模型,针对三个关键局限性问题展开攻关:跨镜头背景一致性、多主体镜头间无缝过渡,以及小时级叙事内容的可扩展性。我们引入的背景一致性生成管线能在保持角色身份与空间关系的同时,确保场景间的视觉连贯性。进一步提出过渡感知视频合成模块,可针对多主体进出画面的复杂场景生成流畅镜头转场,突破了现有技术仅支持单主体的限制。为此,我们贡献了包含1万个多主体过渡序列的合成数据集,涵盖动态场景构图中未被充分研究的类型。在VBench基准测试中,InfinityStory在背景一致性(88.94)、主体一致性(82.11)两项指标上均获最高分,并以最佳综合平均排名(2.80)展现出更优的稳定性、更平滑的过渡效果和更出色的时序连贯性。
English
Generating long-form storytelling videos with consistent visual narratives remains a significant challenge in video synthesis. We present a novel framework, dataset, and a model that address three critical limitations: background consistency across shots, seamless multi-subject shot-to-shot transitions, and scalability to hour-long narratives. Our approach introduces a background-consistent generation pipeline that maintains visual coherence across scenes while preserving character identity and spatial relationships. We further propose a transition-aware video synthesis module that generates smooth shot transitions for complex scenarios involving multiple subjects entering or exiting frames, going beyond the single-subject limitations of prior work. To support this, we contribute with a synthetic dataset of 10,000 multi-subject transition sequences covering underrepresented dynamic scene compositions. On VBench, InfinityStory achieves the highest Background Consistency (88.94), highest Subject Consistency (82.11), and the best overall average rank (2.80), showing improved stability, smoother transitions, and better temporal coherence.
PDF52March 6, 2026