StoryMem:基于记忆机制的多镜头长视频故事叙述系统
StoryMem: Multi-shot Long Video Storytelling with Memory
December 22, 2025
作者: Kaiwen Zhang, Liming Jiang, Angtian Wang, Jacob Zhiyuan Fang, Tiancheng Zhi, Qing Yan, Hao Kang, Xin Lu, Xingang Pan
cs.AI
摘要
视觉叙事需要生成具有电影级质感和长程一致性的多镜头视频。受人类记忆机制启发,我们提出StoryMem范式,将长视频叙事重构为基于显式视觉记忆的迭代镜头生成,使预训练的单镜头视频扩散模型转化为多镜头叙事者。该范式通过创新的记忆到视频(M2V)设计实现:维护由历史生成镜头关键帧组成的紧凑动态记忆库,通过潜在空间拼接和负向RoPE偏移将存储记忆注入单镜头视频扩散模型,仅需LoRA微调。结合语义关键帧选择策略与审美偏好过滤,确保持续生成过程中的信息丰富且稳定的记忆。此外,该框架天然支持平滑镜头转场和定制化故事生成应用。为促进评估,我们推出多镜头视频叙事基准ST-Bench。大量实验表明,StoryMem在保持高审美品质和提示遵循度的同时,实现了优于现有方法的跨镜头一致性,标志着向分钟级连贯视频叙事迈出重要一步。
English
Visual storytelling requires generating multi-shot videos with cinematic quality and long-range consistency. Inspired by human memory, we propose StoryMem, a paradigm that reformulates long-form video storytelling as iterative shot synthesis conditioned on explicit visual memory, transforming pre-trained single-shot video diffusion models into multi-shot storytellers. This is achieved by a novel Memory-to-Video (M2V) design, which maintains a compact and dynamically updated memory bank of keyframes from historical generated shots. The stored memory is then injected into single-shot video diffusion models via latent concatenation and negative RoPE shifts with only LoRA fine-tuning. A semantic keyframe selection strategy, together with aesthetic preference filtering, further ensures informative and stable memory throughout generation. Moreover, the proposed framework naturally accommodates smooth shot transitions and customized story generation applications. To facilitate evaluation, we introduce ST-Bench, a diverse benchmark for multi-shot video storytelling. Extensive experiments demonstrate that StoryMem achieves superior cross-shot consistency over previous methods while preserving high aesthetic quality and prompt adherence, marking a significant step toward coherent minute-long video storytelling.