HiAR: 계층적 노이즈 제거를 통한 효율적인 자기회귀적 장영상 생성

초록

자기회귀(AR) 확산 모델은 이론적으로 무한한 길이의 비디오를 생성할 수 있는 유망한 프레임워크를 제공합니다. 그러나 시간적 연속성을 유지하면서 오류 누적으로 인한 점진적인 화질 저하를 방지하는 것이 주요 과제로 남아있습니다. 기존 방법들은 연속성을 보장하기 위해 일반적으로 고도로 잡음이 제거된 컨텍스트를 조건으로 사용하지만, 이 방식은 예측 오류를 높은 확신도로 전파하여 화질 저하를 악화시킵니다. 본 논문에서는 고도로 깨끗한 컨텍스트가 불필요함을 주장합니다. 양방향 확산 모델에서 영감을 얻어, 공유된 잡음 수준에서 프레임의 잡음을 제거하면서 일관성을 유지하는 방식에 기반하여, 현재 블록과 동일한 잡음 수준의 컨텍스트를 조건으로 사용하는 것이 시간적 일관성을 위한 충분한 신호를 제공하면서 오류 전파를 효과적으로 완화한다고 제안합니다. 이러한 통찰을 바탕으로, 우리는 기존 생성 순서를 반전하는 계층적 잡음 제거 프레임워크인 HiAR를 제안합니다. HiAR는 각 블록을 순차적으로 완성하는 대신, 모든 잡음 제거 단계에서 모든 블록에 걸쳐 인과적 생성을 수행하여 각 블록이 항상 동일한 잡음 수준의 컨텍스트를 조건으로 갖도록 합니다. 이 계층적 구조는 파이프라인 병렬 추론을 자연스럽게 허용하여, 우리의 4단계 설정에서 1.8배의 실제 시간 속도 향상을 가져옵니다. 또한, 우리는 이 패러다임 하에서의 자기 롤아웃 증류가 최빈값을 찾는 역 KL 목적함수에 내재된 저운동 단축 경로를 증폭시킨다는 것을 관찰했습니다. 이를 상쇄하기 위해, 우리는 양방향 어텐션 모드에서 순방향 KL 정규화기를 도입하여, 증류 손실에 간섭하지 않으면서 인과적 추론을 위한 운동 다양성을 보존합니다. VBench(20초 생성)에서 HiAR는 비교 대상 모든 방법 중 최고의 종합 점수와 가장 낮은 시간적 드리프트를 달성했습니다.

English

Autoregressive (AR) diffusion offers a promising framework for generating videos of theoretically infinite length. However, a major challenge is maintaining temporal continuity while preventing the progressive quality degradation caused by error accumulation. To ensure continuity, existing methods typically condition on highly denoised contexts; yet, this practice propagates prediction errors with high certainty, thereby exacerbating degradation. In this paper, we argue that a highly clean context is unnecessary. Drawing inspiration from bidirectional diffusion models, which denoise frames at a shared noise level while maintaining coherence, we propose that conditioning on context at the same noise level as the current block provides sufficient signal for temporal consistency while effectively mitigating error propagation. Building on this insight, we propose HiAR, a hierarchical denoising framework that reverses the conventional generation order: instead of completing each block sequentially, it performs causal generation across all blocks at every denoising step, so that each block is always conditioned on context at the same noise level. This hierarchy naturally admits pipelined parallel inference, yielding a 1.8 wall-clock speedup in our 4-step setting. We further observe that self-rollout distillation under this paradigm amplifies a low-motion shortcut inherent to the mode-seeking reverse-KL objective. To counteract this, we introduce a forward-KL regulariser in bidirectional-attention mode, which preserves motion diversity for causal inference without interfering with the distillation loss. On VBench (20s generation), HiAR achieves the best overall score and the lowest temporal drift among all compared methods.

HiAR: 계층적 노이즈 제거를 통한 효율적인 자기회귀적 장영상 생성

HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising

초록

Support