HiAR: 階層的ノイズ除去による効率的な自己回帰的長尺動画生成

要旨

オートリグレッシブ（AR）拡散は、理論的に無限の長さのビデオを生成する有望なフレームワークを提供する。しかし、誤差蓄積による画質の漸次的劣化を防ぎながら時間的連続性を維持することが主要な課題である。既存手法では連続性を確保するため、高度にノイズ除去されたコンテキストを条件付けすることが一般的だが、この手法は予測誤差を高い確実性で伝播させ、劣化を悪化させる。本論文では、高度にクリーンなコンテキストは不要であると主張する。双方向拡散モデルから着想を得て、フレームを共有ノイズレベルでノイズ除去しながら一貫性を維持する手法に基づき、現在のブロックと同一ノイズレベルのコンテキストを条件付けることで、時間的一貫性のための十分な信号を提供しつつ、誤差伝播を効果的に軽減できることを提案する。この知見に基づき、我々はHiARを提案する。これは従来の生成順序を逆転させた階層的ノイズ除去フレームワークであり、各ブロックを順次完了させる代わりに、すべてのノイズ除去ステップにおいて全ブロックにわたって因果的生成を行う。これにより、各ブロックは常に同一ノイズレベルのコンテキストを条件付けされる。この階層構造はパイプライン化された並列推論を自然に可能とし、我々の4ステップ設定では実時間で1.8倍の高速化を実現した。さらに、このパラダイム下での自己ロールアウト蒸留は、最頻値指向の逆KL目的関数に内在する低モーションショートカットを増幅することが観察された。これに対抗するため、双方向アテンションモードにおける順方向KL正則化を導入し、蒸留損失を妨げることなく因果推論のためのモーション多様性を保持する。VBench（20秒生成）において、HiARは比較対象全ての手法の中で最高の総合スコアと最低の時間的ドリフトを達成した。

English

Autoregressive (AR) diffusion offers a promising framework for generating videos of theoretically infinite length. However, a major challenge is maintaining temporal continuity while preventing the progressive quality degradation caused by error accumulation. To ensure continuity, existing methods typically condition on highly denoised contexts; yet, this practice propagates prediction errors with high certainty, thereby exacerbating degradation. In this paper, we argue that a highly clean context is unnecessary. Drawing inspiration from bidirectional diffusion models, which denoise frames at a shared noise level while maintaining coherence, we propose that conditioning on context at the same noise level as the current block provides sufficient signal for temporal consistency while effectively mitigating error propagation. Building on this insight, we propose HiAR, a hierarchical denoising framework that reverses the conventional generation order: instead of completing each block sequentially, it performs causal generation across all blocks at every denoising step, so that each block is always conditioned on context at the same noise level. This hierarchy naturally admits pipelined parallel inference, yielding a 1.8 wall-clock speedup in our 4-step setting. We further observe that self-rollout distillation under this paradigm amplifies a low-motion shortcut inherent to the mode-seeking reverse-KL objective. To counteract this, we introduce a forward-KL regulariser in bidirectional-attention mode, which preserves motion diversity for causal inference without interfering with the distillation loss. On VBench (20s generation), HiAR achieves the best overall score and the lowest temporal drift among all compared methods.

HiAR: 階層的ノイズ除去による効率的な自己回帰的長尺動画生成

HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising

要旨

Support