TokenTrim: 自己回帰的長尺動画生成における推論時トークンプルーニング

要旨

自己回帰型動画生成は、生成済みのフレーム群を条件付けとして逐次的に新規フレーム群を生成することで、長時間の動画合成を可能とする。しかし、最近の研究では、このようなパイプラインが深刻な時間的ドリフト（誤差が長時間にわたり蓄積・増幅される現象）に悩まされることが明らかになっている。本研究では、このドリフトの主因がモデル容量の不足ではなく、推論時の誤差伝播にあると仮説を立てる。具体的には、自己回帰的推論において、破損した潜在条件トークンが制御されず再利用されることにドリフトが起因すると主張する。この誤差蓄積を補正するため、我々は推論時に、条件付けに再利用される前に不安定な潜在トークンを特定・除去することで時間的ドリフトを軽減する簡便な手法を提案する。ここで不安定トークンとは、その表現が直前のフレーム群の表現から著しく乖離しており、破損や意味的ドリフトの可能性を示す潜在トークンと定義する。空間領域全体やモデルパラメータを変更するのではなく、自己回帰的文脈から破損した潜在トークンを明示的に除去することにより、本手法は信頼性の低い潜在情報が将来の生成ステップに影響を与えるのを防ぐ。その結果、モデル構造、訓練手順、あるいは潜在空間を変更することなく、長時間にわたる時間的一貫性を大幅に改善する。

English

Auto-regressive video generation enables long video synthesis by iteratively conditioning each new batch of frames on previously generated content. However, recent work has shown that such pipelines suffer from severe temporal drift, where errors accumulate and amplify over long horizons. We hypothesize that this drift does not primarily stem from insufficient model capacity, but rather from inference-time error propagation. Specifically, we contend that drift arises from the uncontrolled reuse of corrupted latent conditioning tokens during auto-regressive inference. To correct this accumulation of errors, we propose a simple, inference-time method that mitigates temporal drift by identifying and removing unstable latent tokens before they are reused for conditioning. For this purpose, we define unstable tokens as latent tokens whose representations deviate significantly from those of the previously generated batch, indicating potential corruption or semantic drift. By explicitly removing corrupted latent tokens from the auto-regressive context, rather than modifying entire spatial regions or model parameters, our method prevents unreliable latent information from influencing future generation steps. As a result, it significantly improves long-horizon temporal consistency without modifying the model architecture, training procedure, or leaving latent space.

TokenTrim: 自己回帰的長尺動画生成における推論時トークンプルーニング

TokenTrim: Inference-Time Token Pruning for Autoregressive Long Video Generation

要旨

Support