HiAR：基于分层去噪的高效自回归长视频生成

摘要

自回归扩散模型为生成长度理论上无限的视频提供了有前景的框架。然而，如何保持时间连续性同时避免误差累积导致的渐进性质量退化，仍是主要挑战。为确保连续性，现有方法通常依赖高度去噪的上下文帧，但这一做法会以高置信度传播预测误差，反而加剧质量退化。本文提出高度纯净的上下文并非必要。受双向扩散模型启发——该模型通过在共享噪声水平下对帧进行去噪来保持连贯性，我们认为在当前块相同噪声水平下对上下文进行条件化，既能提供足够的时间一致性信号，又可有效抑制误差传播。基于此，我们提出HiAR：一种分层去噪框架，它颠覆了传统生成顺序——不是在每个去噪步骤顺序完成各个块，而是在所有块上同步进行因果生成，确保每个块始终处于相同噪声水平的上下文条件下。这种分层结构天然支持流水线并行推理，在我们的4步设置中实现了1.8倍的实际加速。进一步发现，该范式下的自推演蒸馏会放大逆向KL目标固有的低运动捷径倾向。为此，我们引入双向注意力模式下的正向KL正则器，在保持因果推理运动多样性的同时不干扰蒸馏损失。在VBench（20秒生成）测试中，HiAR取得了所有对比方法中的最高综合得分与最低时间漂移。

English

Autoregressive (AR) diffusion offers a promising framework for generating videos of theoretically infinite length. However, a major challenge is maintaining temporal continuity while preventing the progressive quality degradation caused by error accumulation. To ensure continuity, existing methods typically condition on highly denoised contexts; yet, this practice propagates prediction errors with high certainty, thereby exacerbating degradation. In this paper, we argue that a highly clean context is unnecessary. Drawing inspiration from bidirectional diffusion models, which denoise frames at a shared noise level while maintaining coherence, we propose that conditioning on context at the same noise level as the current block provides sufficient signal for temporal consistency while effectively mitigating error propagation. Building on this insight, we propose HiAR, a hierarchical denoising framework that reverses the conventional generation order: instead of completing each block sequentially, it performs causal generation across all blocks at every denoising step, so that each block is always conditioned on context at the same noise level. This hierarchy naturally admits pipelined parallel inference, yielding a 1.8 wall-clock speedup in our 4-step setting. We further observe that self-rollout distillation under this paradigm amplifies a low-motion shortcut inherent to the mode-seeking reverse-KL objective. To counteract this, we introduce a forward-KL regulariser in bidirectional-attention mode, which preserves motion diversity for causal inference without interfering with the distillation loss. On VBench (20s generation), HiAR achieves the best overall score and the lowest temporal drift among all compared methods.