StoryMem:基于记忆机制的多镜头长视频故事叙述系统
StoryMem: Multi-shot Long Video Storytelling with Memory
December 22, 2025
作者: Kaiwen Zhang, Liming Jiang, Angtian Wang, Jacob Zhiyuan Fang, Tiancheng Zhi, Qing Yan, Hao Kang, Xin Lu, Xingang Pan
cs.AI
摘要
視覺敘事需要生成具有電影級畫質和長程一致性的多鏡頭影片。受人類記憶機制啟發,我們提出StoryMem框架,將長影片敘事重構為基於顯式視覺記憶的迭代鏡頭生成範式,將預訓練的單鏡頭影片擴散模型轉化為多鏡頭敘事生成器。該框架通過創新的記憶到影片(M2V)設計實現:維護由歷史生成鏡頭關鍵幀組成的緊湊動態記憶庫,通過潛在特徵拼接和負向RoPE偏移技術將存儲記憶注入單鏡頭影片擴散模型,且僅需LoRA微調。結合語義關鍵幀選擇策略與美學偏好過濾機制,確保全過程生成信息豐富且穩定的記憶。此外,該框架天然支持平滑鏡頭轉場和自定義敘事生成應用。為便於評估,我們構建了多樣化的多鏡頭影片敘事基準ST-Bench。大量實驗表明,StoryMem在保持高美學品質和提示語遵循度的同時,實現了優於現有方法的跨鏡頭一致性,標誌著向連貫分鐘級影片敘事邁出重要一步。
English
Visual storytelling requires generating multi-shot videos with cinematic quality and long-range consistency. Inspired by human memory, we propose StoryMem, a paradigm that reformulates long-form video storytelling as iterative shot synthesis conditioned on explicit visual memory, transforming pre-trained single-shot video diffusion models into multi-shot storytellers. This is achieved by a novel Memory-to-Video (M2V) design, which maintains a compact and dynamically updated memory bank of keyframes from historical generated shots. The stored memory is then injected into single-shot video diffusion models via latent concatenation and negative RoPE shifts with only LoRA fine-tuning. A semantic keyframe selection strategy, together with aesthetic preference filtering, further ensures informative and stable memory throughout generation. Moreover, the proposed framework naturally accommodates smooth shot transitions and customized story generation applications. To facilitate evaluation, we introduce ST-Bench, a diverse benchmark for multi-shot video storytelling. Extensive experiments demonstrate that StoryMem achieves superior cross-shot consistency over previous methods while preserving high aesthetic quality and prompt adherence, marking a significant step toward coherent minute-long video storytelling.