HiAR：基於分層去噪的高效自迴歸長影片生成技術

摘要

自回歸擴散模型為生成理論上無限長度的影片提供了一個前景廣闊的框架。然而，其主要挑戰在於維持時間連續性的同時，避免因誤差累積導致的漸進性畫質退化。為確保連續性，現有方法通常依賴高度去噪的上下文；但這種做法會以高置信度傳播預測誤差，從而加劇退化。本文論證了極度清晰的上下文並非必要條件。受雙向擴散模型的啟發（該模型在共享噪聲水平下對影格去噪的同時保持連貫性），我們提出在與當前區塊相同噪聲水平上對上下文進行條件化，既能為時間一致性提供足夠訊號，又可有效抑制誤差傳播。基於此洞見，我們提出HiAR——一種層次化去噪框架，其顛覆了傳統生成順序：並非順序完成每個區塊，而是在每個去噪步驟中對所有區塊執行因果生成，使每個區塊始終處於相同噪聲水平的上下文條件下。這種層次結構天然支持流水線並行推理，在我們的4步設定中實現了1.8倍的實時加速。我們進一步觀察到，該範式下的自展開蒸餾會放大模式尋求型反向KL目標固有的低運動捷徑。為抵消此現象，我們引入雙向注意力模式下的正向KL正則化器，在不干擾蒸餾損失的前提下，為因果推理保留運動多樣性。在VBench（20秒生成）測試中，HiAR在所有對比方法中獲得最高綜合評分與最低的時間漂移值。

English

Autoregressive (AR) diffusion offers a promising framework for generating videos of theoretically infinite length. However, a major challenge is maintaining temporal continuity while preventing the progressive quality degradation caused by error accumulation. To ensure continuity, existing methods typically condition on highly denoised contexts; yet, this practice propagates prediction errors with high certainty, thereby exacerbating degradation. In this paper, we argue that a highly clean context is unnecessary. Drawing inspiration from bidirectional diffusion models, which denoise frames at a shared noise level while maintaining coherence, we propose that conditioning on context at the same noise level as the current block provides sufficient signal for temporal consistency while effectively mitigating error propagation. Building on this insight, we propose HiAR, a hierarchical denoising framework that reverses the conventional generation order: instead of completing each block sequentially, it performs causal generation across all blocks at every denoising step, so that each block is always conditioned on context at the same noise level. This hierarchy naturally admits pipelined parallel inference, yielding a 1.8 wall-clock speedup in our 4-step setting. We further observe that self-rollout distillation under this paradigm amplifies a low-motion shortcut inherent to the mode-seeking reverse-KL objective. To counteract this, we introduce a forward-KL regulariser in bidirectional-attention mode, which preserves motion diversity for causal inference without interfering with the distillation loss. On VBench (20s generation), HiAR achieves the best overall score and the lowest temporal drift among all compared methods.

HiAR：基於分層去噪的高效自迴歸長影片生成技術

HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising

摘要

Support