Memento：为记忆而重构——实现一致的长时间视频生成

摘要

长视频生成需要重复出现的主体在不同镜头、视角、运动和场景切换中保持一致。现有的时间分解方法通过逐镜头生成视频来提升可扩展性，但其主要关注优化合理的下一镜头衔接，而未验证历史记忆是否保留了身份关键的主体证据。因此，随着生成过程的推进，重复出现的主体可能被稀释、覆盖或遗忘。本文提出Memento框架，这是一种主体重建引导的方法，将主体保持视为明确的身份锚定问题，其核心前提是：能够忠实保存主体的记忆库应能仅凭记忆重建该主体。具体而言，Memento联合训练自回归的下一镜头生成与基于记忆的主体重建，利用历史记忆和全局故事描述恢复目标外观。为将长程主体证据与短程线索分离，Memento引入了双查询记忆机制，其中一个查询检索与身份相关的记忆，另一个查询选择短上下文关键帧以实现连贯衔接。此外，基于主体感知的电影化数据流水线通过一致且无代词的主体描述提供精确的重建监督。实验表明，Memento在长期主体一致性、跨镜头连贯性和视觉质量方面均达到了最先进水平。

English

Long-form video generation requires recurring subjects to remain consistent across various shots, viewpoints, motions, and scene transitions. Existing temporal decomposition methods improve scalability by generating videos shot by shot. However, they mainly focus on optimizing plausible next-shot continuations without verifying whether the historical memory preserves identity-critical subject evidence. Consequently, as generation proceeds, recurring subjects may be diluted, overwritten, or forgotten. In this paper, we propose Memento, a subject-reconstruction-guided framework that treats subject preservation as an explicit identity grounding problem, based on the premise that a memory bank faithfully preserving a subject should support reconstructing that subject from memory alone. Specifically, Memento jointly trains autoregressive next-shot generation with memory-based subject reconstruction, recovering target appearances using historical memory and global story captions. To disentangle long-range subject evidence from short-range cues, Memento introduces a dual-query memory mechanism, where one query retrieves identity-relevant memory and the other selects short-context keyframes for coherent continuation. Additionally, a subject-aware cinematic data pipeline provides precise reconstruction supervision via consistent, pronoun-free subject descriptions. Experiments demonstrate that Memento achieves state-of-the-art performance in long-term subject consistency, cross-shot coherence, and visual quality.