Memento: 一貫した長動画生成に向けた想起のための再構築

要旨

長尺動画生成では、登場する被写体が様々なショット、視点、動き、シーン遷移にわたって一貫している必要がある。既存の時間的分解手法は、ショットごとに動画を生成することでスケーラビリティを向上させる。しかし、それらは主に次のショットの妥当な継続を最適化することに焦点を当てており、過去の記憶が同一性に重要な被写体の証拠を保持しているかを検証しない。その結果、生成が進むにつれて、繰り返し登場する被写体が薄められ、上書きされ、または忘れられる可能性がある。本論文では、被写体の保存を明示的な同一性基盤問題として扱う、被写体再構成誘導型フレームワークMementoを提案する。これは、被写体を忠実に保存するメモリバンクが、メモリのみからその被写体を再構成できるはずであるという前提に基づく。具体的には、Mementoは自己回帰的な次ショット生成とメモリベースの被写体再構成を同時に学習し、過去のメモリと全体のストーリーキャプションを用いて目標の外観を復元する。長期的な被写体の証拠と短期的な手がかりを分離するために、Mementoはデュアルクエリメモリ機構を導入する。一方のクエリは同一性に関連するメモリを取得し、もう一方は一貫した継続のために短期コンテキストのキーフレームを選択する。さらに、被写体認識型のシネマティックデータパイプラインが、一貫した代名詞のない被写体記述を通じて精密な再構成の教師信号を提供する。実験により、Mementoは長期的な被写体の一貫性、ショット間のコヒーレンス、視覚的品質において最先端の性能を達成することを示す。

English

Long-form video generation requires recurring subjects to remain consistent across various shots, viewpoints, motions, and scene transitions. Existing temporal decomposition methods improve scalability by generating videos shot by shot. However, they mainly focus on optimizing plausible next-shot continuations without verifying whether the historical memory preserves identity-critical subject evidence. Consequently, as generation proceeds, recurring subjects may be diluted, overwritten, or forgotten. In this paper, we propose Memento, a subject-reconstruction-guided framework that treats subject preservation as an explicit identity grounding problem, based on the premise that a memory bank faithfully preserving a subject should support reconstructing that subject from memory alone. Specifically, Memento jointly trains autoregressive next-shot generation with memory-based subject reconstruction, recovering target appearances using historical memory and global story captions. To disentangle long-range subject evidence from short-range cues, Memento introduces a dual-query memory mechanism, where one query retrieves identity-relevant memory and the other selects short-context keyframes for coherent continuation. Additionally, a subject-aware cinematic data pipeline provides precise reconstruction supervision via consistent, pronoun-free subject descriptions. Experiments demonstrate that Memento achieves state-of-the-art performance in long-term subject consistency, cross-shot coherence, and visual quality.