ChatPaper.aiChatPaper

InfinityStory:具備世界一致性和角色感知鏡頭轉場的無限影片生成技術

InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions

March 4, 2026
作者: Mohamed Elmoghany, Liangbing Zhao, Xiaoqian Shen, Subhojyoti Mukherjee, Yang Zhou, Gang Wu, Viet Dac Lai, Seunghyun Yoon, Ryan Rossi, Abdullah Rashwan, Puneet Mathur, Varun Manjunatha, Daksh Dangi, Chien Nguyen, Nedim Lipka, Trung Bui, Krishna Kumar Singh, Ruiyi Zhang, Xiaolei Huang, Jaemin Cho, Yu Wang, Namyong Park, Zhengzhong Tu, Hongjie Chen, Hoda Eldardiry, Nesreen Ahmed, Thien Nguyen, Dinesh Manocha, Mohamed Elhoseiny, Franck Dernoncourt
cs.AI

摘要

生成具有连贯视觉叙事的长篇故事视频仍是视频合成领域的重大挑战。本文提出了一套创新框架、数据集及模型,针对三个关键局限性问题展开攻关:跨镜头背景一致性、多主体镜头间无缝过渡,以及小时级叙事内容的可扩展性。我们引入的背景一致性生成流程,能在保持角色身份与空间关系的同时,确保场景间的视觉连贯性。进一步提出的过渡感知视频合成模块,可针对多主体进出画面的复杂场景生成流畅的镜头转场,突破了现有技术仅限单主体的局限。为此,我们贡献了包含1万条多主体过渡序列的合成数据集,涵盖动态场景构图中未被充分探索的案例。在VBench评测中,InfinityStory在背景一致性(88.94)、主体一致性(82.11)两项指标均获最高分,并以2.80的综合排名位列第一,展现出更优的稳定性、更平滑的过渡效果及更出色的时序连贯性。
English
Generating long-form storytelling videos with consistent visual narratives remains a significant challenge in video synthesis. We present a novel framework, dataset, and a model that address three critical limitations: background consistency across shots, seamless multi-subject shot-to-shot transitions, and scalability to hour-long narratives. Our approach introduces a background-consistent generation pipeline that maintains visual coherence across scenes while preserving character identity and spatial relationships. We further propose a transition-aware video synthesis module that generates smooth shot transitions for complex scenarios involving multiple subjects entering or exiting frames, going beyond the single-subject limitations of prior work. To support this, we contribute with a synthetic dataset of 10,000 multi-subject transition sequences covering underrepresented dynamic scene compositions. On VBench, InfinityStory achieves the highest Background Consistency (88.94), highest Subject Consistency (82.11), and the best overall average rank (2.80), showing improved stability, smoother transitions, and better temporal coherence.
PDF52March 6, 2026