Memento：重構以記憶，實現一致的長視頻生成

摘要

長篇影片生成需要重複出現的主體在各種鏡頭、視角、動作及場景轉換中保持一致。現有的時間分解方法透過逐鏡頭生成影片來提升可擴展性，但它們主要專注於優化合理的下一個鏡頭接續，卻未驗證歷史記憶是否保留主體身分關鍵的證據。因此，隨著生成過程推進，重複出現的主體可能會被稀釋、覆蓋或遺忘。在本文中，我們提出Memento，這是一個主體重建引導框架，將主體保留視為明確的身分定位問題，其前提是：一個忠實保留主體的記憶庫應能僅憑記憶重建該主體。具體而言，Memento聯合訓練自回歸的下一個鏡頭生成與基於記憶的主體重建，利用歷史記憶和全局故事描述來恢復目標外觀。為了將長程主體證據與短程線索分離，Memento引入雙查詢記憶機制，其中一個查詢檢索與身分相關的記憶，另一個則選取短上下文關鍵幀以實現連貫的接續。此外，一個主體感知的電影級資料管道透過一致且無代名詞的主體描述提供精確的重建監督。實驗結果表明，Memento在長期主體一致性、跨鏡頭連貫性及視覺品質方面達到了最先進的效能。

English

Long-form video generation requires recurring subjects to remain consistent across various shots, viewpoints, motions, and scene transitions. Existing temporal decomposition methods improve scalability by generating videos shot by shot. However, they mainly focus on optimizing plausible next-shot continuations without verifying whether the historical memory preserves identity-critical subject evidence. Consequently, as generation proceeds, recurring subjects may be diluted, overwritten, or forgotten. In this paper, we propose Memento, a subject-reconstruction-guided framework that treats subject preservation as an explicit identity grounding problem, based on the premise that a memory bank faithfully preserving a subject should support reconstructing that subject from memory alone. Specifically, Memento jointly trains autoregressive next-shot generation with memory-based subject reconstruction, recovering target appearances using historical memory and global story captions. To disentangle long-range subject evidence from short-range cues, Memento introduces a dual-query memory mechanism, where one query retrieves identity-relevant memory and the other selects short-context keyframes for coherent continuation. Additionally, a subject-aware cinematic data pipeline provides precise reconstruction supervision via consistent, pronoun-free subject descriptions. Experiments demonstrate that Memento achieves state-of-the-art performance in long-term subject consistency, cross-shot coherence, and visual quality.