토큰트림: 자기회귀적 장영상 생성을 위한 추론 시점 토큰 프루닝

초록

자기회귀 비디오 생성은 이전에 생성된 프레임들을 조건으로 삼아 새로운 프레임들을 반복적으로 생성함으로써 긴 동영상 합성을 가능하게 합니다. 그러나 최근 연구에 따르면 이러한 파이프라인은 심각한 시간적 드리프트 현상으로 인해 오류가 장기간 누적 및 증폭되는 문제가 발생합니다. 우리는 이러한 드리프트가 주로 모델 용량 부족에서 기인하는 것이 아니라 추론 시점의 오류 전파에서 비롯된다고 가정합니다. 구체적으로, 드리프트는 자기회귀 추론 과정에서 손상된 잠재 조건 토큰이 통제되지 않은 채 재사용되면서 발생한다고 주장합니다. 이러한 오류 누적을 해결하기 위해 우리는 조건으로 재사용되기 전에 불안정한 잠재 토큰을 식별 및 제거함으로써 시간적 드리프트를 완화하는 간단한 추론 시점 방법을 제안합니다. 이를 위해 불안정 토큰을 이전에 생성된 배치의 표현과 현저히 벗어나 잠재적 손상이나 의미적 드리프트를 나타내는 잠재 토큰으로 정의합니다. 전체 공간 영역이나 모델 파라미터를 수정하는 대신 자기회귀 컨텍스트에서 손상된 잠재 토큰을 명시적으로 제거함으로써, 우리의 방법은 신뢰할 수 없는 잠재 정보가 미래 생성 단계에 영향을 미치는 것을 방지합니다. 그 결과, 모델 구조나 학습 절차를 수정하거나 잠재 공간을 이탈하지 않으면서도 장기간의 시간적 일관성을 크게 향상시킵니다.

English

Auto-regressive video generation enables long video synthesis by iteratively conditioning each new batch of frames on previously generated content. However, recent work has shown that such pipelines suffer from severe temporal drift, where errors accumulate and amplify over long horizons. We hypothesize that this drift does not primarily stem from insufficient model capacity, but rather from inference-time error propagation. Specifically, we contend that drift arises from the uncontrolled reuse of corrupted latent conditioning tokens during auto-regressive inference. To correct this accumulation of errors, we propose a simple, inference-time method that mitigates temporal drift by identifying and removing unstable latent tokens before they are reused for conditioning. For this purpose, we define unstable tokens as latent tokens whose representations deviate significantly from those of the previously generated batch, indicating potential corruption or semantic drift. By explicitly removing corrupted latent tokens from the auto-regressive context, rather than modifying entire spatial regions or model parameters, our method prevents unreliable latent information from influencing future generation steps. As a result, it significantly improves long-horizon temporal consistency without modifying the model architecture, training procedure, or leaving latent space.

토큰트림: 자기회귀적 장영상 생성을 위한 추론 시점 토큰 프루닝

TokenTrim: Inference-Time Token Pruning for Autoregressive Long Video Generation

초록

Support