Memento: 일관된 긴 비디오 생성을 위한 기억 재구성

초록

긴 형식 비디오 생성에서는 다양한 샷, 시점, 동작 및 장면 전환 전반에 걸쳐 재발하는 주제가 일관성을 유지해야 한다. 기존의 시간적 분해 방법은 비디오를 샷 단위로 생성하여 확장성을 개선한다. 그러나 이러한 방법들은 주로 역사적 기억이 정체성-중요 주제 증거를 보존하는지 확인하지 않은 채, 그럴듯한 다음 샷 연속을 최적화하는 데 초점을 맞춘다. 결과적으로 생성이 진행됨에 따라 재발하는 주제가 희석되거나, 덮어쓰여지거나, 망각될 수 있다. 본 논문에서는 기억 은행이 주제를 충실히 보존한다면 그 주제를 기억만으로 재구성할 수 있어야 한다는 전제에 기반하여, 주제 보존을 명시적 정체성 기반 문제로 취급하는 주제 재구성 유도 프레임워크인 Memento를 제안한다. 구체적으로, Memento는 자기회귀적 다음 샷 생성과 기억 기반 주제 재구성을 공동으로 학습하며, 역사적 기억과 글로벌 스토리 캡션을 사용하여 대상 외관을 복원한다. 장거리 주제 증거를 단거리 단서로부터 분리하기 위해, Memento는 이중 질의 기억 메커니즘을 도입한다. 여기서 하나의 질의는 정체성 관련 기억을 검색하고, 다른 질의는 일관된 연속을 위해 짧은 맥락 키프레임을 선택한다. 또한, 주제 인식 영화 데이터 파이프라인은 일관되고 대명사 없는 주제 설명을 통해 정밀한 재구성 감독을 제공한다. 실험 결과, Memento는 장기 주제 일관성, 샷 간 일관성, 시각적 품질에서 최첨단 성능을 달성함을 보여준다.

English

Long-form video generation requires recurring subjects to remain consistent across various shots, viewpoints, motions, and scene transitions. Existing temporal decomposition methods improve scalability by generating videos shot by shot. However, they mainly focus on optimizing plausible next-shot continuations without verifying whether the historical memory preserves identity-critical subject evidence. Consequently, as generation proceeds, recurring subjects may be diluted, overwritten, or forgotten. In this paper, we propose Memento, a subject-reconstruction-guided framework that treats subject preservation as an explicit identity grounding problem, based on the premise that a memory bank faithfully preserving a subject should support reconstructing that subject from memory alone. Specifically, Memento jointly trains autoregressive next-shot generation with memory-based subject reconstruction, recovering target appearances using historical memory and global story captions. To disentangle long-range subject evidence from short-range cues, Memento introduces a dual-query memory mechanism, where one query retrieves identity-relevant memory and the other selects short-context keyframes for coherent continuation. Additionally, a subject-aware cinematic data pipeline provides precise reconstruction supervision via consistent, pronoun-free subject descriptions. Experiments demonstrate that Memento achieves state-of-the-art performance in long-term subject consistency, cross-shot coherence, and visual quality.