ChatPaper.aiChatPaper

OneStory:具適應性記憶的連貫多鏡頭影片生成技術

OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory

December 8, 2025
作者: Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, Chenyang Zhang, Tao Xiang, Fanny Yang, Serge Belongie, Tian Xie
cs.AI

摘要

現實世界影片中的敘事往往透過多個鏡頭展開——這些不連續但語義相連的片段共同構建出連貫的故事線。然而,現有的多鏡頭影片生成方法因受限於局部時間窗口或單一關鍵幀條件約束,難以有效建模長距離跨鏡頭上下文,導致在複雜敘事場景下性能下降。本研究提出OneStory,通過全局且緊湊的跨鏡頭上下文建模,實現一致且可擴展的敘事生成。該方法將多鏡頭影片生成重新定義為「下一鏡頭生成」任務,在利用預訓練圖像轉影片模型實現強視覺條件控制的同時,支持自迴歸式鏡頭合成。我們引入兩個核心模塊:基於過往鏡頭信息幀構建語義相關全局記憶的幀選擇模塊,以及執行重要性引導分塊化以生成緊湊上下文條件的自適應條件器。我們進一步策劃了包含指代性描述的高質量多鏡頭數據集以反映真實敘事模式,並在下一鏡頭範式下設計有效的訓練策略。通過在自建60K數據集上對預訓練圖像轉影片模型進行微調,OneStory在文本與圖像條件設置下,於多樣化複雜場景中實現了業界領先的敘事連貫性,為可控沉浸式長影片敘事開闢了新路徑。
English
Storytelling in real-world videos often unfolds through multiple shots -- discontinuous yet semantically connected clips that together convey a coherent narrative. However, existing multi-shot video generation (MSV) methods struggle to effectively model long-range cross-shot context, as they rely on limited temporal windows or single keyframe conditioning, leading to degraded performance under complex narratives. In this work, we propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation. OneStory reformulates MSV as a next-shot generation task, enabling autoregressive shot synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning. We introduce two key modules: a Frame Selection module that constructs a semantically-relevant global memory based on informative frames from prior shots, and an Adaptive Conditioner that performs importance-guided patchification to generate compact context for direct conditioning. We further curate a high-quality multi-shot dataset with referential captions to mirror real-world storytelling patterns, and design effective training strategies under the next-shot paradigm. Finetuned from a pretrained I2V model on our curated 60K dataset, OneStory achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings, enabling controllable and immersive long-form video storytelling.
PDF311December 11, 2025