AdaState: 스트리밍 비디오 생성을 위한 자기 진화 앵커

초록

자기회귀 비디오 확산 모델은 프레임을 순차적으로 생성하고 각 청크를 이전에 생성된 콘텐츠에 조건부로 두면서 스트리밍 비디오를 생성한다. 이러한 모델은 구조적으로 첫 번째 프레임에 고정되어 있다. 즉, 첫 번째 프레임의 키-값 표현은 어텐션 캐시 내에서 특권적 위치를 차지하며, 생성 과정 전반에 걸쳐 주요 장면 참조 역할을 한다. 캐시 내에서 가장 깨끗하고 오류가 없는 위치인 이 앵커는 불균형적인 주의를 끌어 비디오 역동성을 억제하고, 장면이 자연스럽게 진화하더라도 장면 구성을 초기 시점에 고정시킨다. 그 결과 움직임, 카메라 이동 및 장면 진행이 정적 일관성을 위해 억제된 시간적으로 얕은 비디오가 생성된다. 이 문제를 해결하기 위해, 우리는 정적 앵커를 적응형 상태, 즉 모델이 매 청크마다 콘텐츠와 함께 잡음 제거를 수행하지만 결코 렌더링하지 않는 숨겨진 잠재 변수로 대체한다. 모델은 고정된 첫 번째 프레임을 참조하는 대신, 이전 상태와 현재 콘텐츠 모두에 주의를 기울여 각 단계에서 자체 장면 앵커를 생성하며, 이는 생성된 콘텐츠와 함께 진화하는 참조를 만들어낸다. 절대적 시간 개념을 부호화하는 표준 비디오 생성과 달리, 우리의 공식은 시간을 상대적으로 취급한다. 즉, 모든 생성 단계는 생성이 얼마나 진행되었는지와 관계없이 동일한 위치 구조를 보며, 상태 전환은 모든 청크에서 동일하다. 이러한 속성들은 공동으로 생성 과정에 재귀성을 도입하며, 여기서 잡음 제거는 전이 함수 역할을 하고 KV 캐시는 전달자 역할을 하여 외부 모듈이 필요하지 않다. 실험 결과는 적응형 상태가 비디오 역동성을 크게 개선하여 생성된 비디오 내에서 더 풍부한 움직임과 자연스러운 장면 진행을 가능하게 함을 보여준다.

English

Autoregressive video diffusion models generate streaming video by producing frames sequentially, conditioning each chunk on previously generated content. These models are structurally anchored to the first frame: its key-value representation occupies a privileged position in the attention cache and serves as the primary scene reference throughout generation. As the cleanest and most error-free position in the cache, this anchor draws disproportionate attention, suppressing video dynamics, and locking scene composition to the initial viewpoint even as the scene naturally evolves. The result is a temporally shallow video in which motion, camera movement, and scene progression are dampened in favor of static consistency. To address this, we replace the static anchor with an adaptive state, a hidden latent that the model denoises alongside content at every chunk but never renders. Rather than referencing a frozen first frame, the model generates its own scene anchor at each step by attending to both the previous state and the current content, producing a reference that evolves with the generated content. Unlike standard video generation, which encodes an absolute notion of time, our formulation treats time as relative: every generation step sees the same positional structure regardless of how far generation has progressed, and the state transition is identical at every chunk. Together, these properties introduce a recurrence into the generation process, where denoising serves as the transition function, and the KV cache serves as the carrier, requiring no external module. Experiments demonstrate that the adaptive state substantially improves video dynamics, enabling richer motion and natural scene progression within generated videos.