OneStory：基于自适应记忆的连贯多镜头视频生成

摘要

现实世界视频中的叙事通常通过多个镜头展开——这些镜头虽不连续但语义相连，共同构建出连贯的故事线。然而，现有多镜头视频生成方法因依赖有限时间窗口或单关键帧条件约束，难以有效建模长程跨镜头上下文，导致复杂叙事场景下性能下降。本文提出OneStory，通过全局且紧凑的跨镜头上下文建模实现连贯可扩展的叙事生成。该方法将多镜头视频生成重新定义为下一镜头生成任务，在利用预训练图像转视频模型实现强视觉条件约束的同时，支持自回归式镜头合成。我们引入两个核心模块：基于历史镜头信息帧构建语义相关全局记忆的帧选择模块，以及执行重要性引导分块化以生成紧凑上下文进行直接条件控制的自适应调节器。此外，我们策划了包含指称性标注的高质量多镜头数据集以反映真实叙事模式，并在下一镜头范式下设计了有效训练策略。通过在自建6万规模数据集上对预训练图像转视频模型进行微调，OneStory在文本和图像条件设置下均能实现跨多样复杂场景的最优叙事连贯性，赋能可控且沉浸式的长视频叙事生成。

English

Storytelling in real-world videos often unfolds through multiple shots -- discontinuous yet semantically connected clips that together convey a coherent narrative. However, existing multi-shot video generation (MSV) methods struggle to effectively model long-range cross-shot context, as they rely on limited temporal windows or single keyframe conditioning, leading to degraded performance under complex narratives. In this work, we propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation. OneStory reformulates MSV as a next-shot generation task, enabling autoregressive shot synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning. We introduce two key modules: a Frame Selection module that constructs a semantically-relevant global memory based on informative frames from prior shots, and an Adaptive Conditioner that performs importance-guided patchification to generate compact context for direct conditioning. We further curate a high-quality multi-shot dataset with referential captions to mirror real-world storytelling patterns, and design effective training strategies under the next-shot paradigm. Finetuned from a pretrained I2V model on our curated 60K dataset, OneStory achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings, enabling controllable and immersive long-form video storytelling.