Q-ARVD: 자기회귀 비디오 확산 모델의 양자화

초록

자기회귀 비디오 확산 모델(ARVDs)은 스트리밍 비디오 생성을 위한 유망한 아키텍처로 부상하며, 실시간 대화형 비디오 생성 및 세계 모델링의 길을 열고 있다. 이러한 잠재력에도 불구하고, ARVD의 상당한 추론 비용은 실제 배포에 주요 장애물로 남아 있어, 모델 양자화가 효율성 향상을 위한 자연스러운 방향이 된다. 그러나 ARVD에 대한 양자화는 아직 많이 탐구되지 않았다. 우리의 실증 분석에 따르면, 표준 확산 트랜스포머를 위해 개발된 기존 양자화 기법을 ARVD에 직접 적용하면 최적 이하의 성능을 보이며, 이는 양방향 확산 모델에서 관찰된 것과 다른 양자화 행동을 드러낸다. 본 논문에서는 ARVD 양자화의 두 가지 중요한 과제를 식별한다: (C1) 매우 불균형한 프레임별 양자화 민감도. 자기회귀 생성 중 오류 누적은 지수적 감쇠 패턴을 따라 프레임 간에 심하게 왜곡된 양자화 민감도를 유발할 수 있다. (C2) 가중치에서 두드러지고 이질적인 이상치 패턴. 가중치 분포는 뚜렷한 이상치 채널을 나타내며, 그 패턴은 계층 유형과 블록 깊이에 따라 상당히 달라진다. 이러한 문제를 해결하기 위해, 우리는 정확한 ARVD 양자화를 위한 새로운 프레임워크인 Q-ARVD를 제안한다. (S1) 매우 불균형한 프레임별 민감도를 해결하기 위해, Q-ARVD는 최종 품질 인식 프레임 가중치 메커니즘을 양자화 목적에 통합한다. (S2) 이질적인 이상치가 성능을 저하시키는 것을 방지하기 위해, Q-ARVD는 이상치 인식 적응형 이중 스케일 양자화를 도입하며, 이는 임의 계층에 대한 이상치 채널의 존재와 개수를 자동으로 감지하고 이를 격리하여 정상 채널을 보호한다. 광범위한 실험을 통해 Q-ARVD의 우수성이 입증된다.

English

Autoregressive video diffusion models (ARVDs) have emerged as a promising architecture for streaming video generation, paving the way for real-time interactive video generation and world modeling. Despite their potential, the substantial inference cost of ARVDs remains a major obstacle to practical deployment, making model quantization a natural direction for improving efficiency. However, quantization for ARVDs remains largely unexplored. Our empirical analysis shows that directly applying existing quantization schemes developed for standard diffusion transformers to ARVDs leads to suboptimal performance, revealing quantization behaviors that differ from those observed in bidirectional diffusion models. In this paper, we identify two critical challenges in quantizing ARVDs: (C1) Highly unbalanced frame-wise quantization sensitivity. Error accumulation during autoregressive generation can induce severely skewed quantization sensitivity across frames, following an exponential-like decay pattern. (C2) Prominent and heterogeneous outlier patterns in weights. Weight distributions exhibit pronounced outlier channels, whose patterns vary substantially across layer types and block depths. To address these issues, we propose Q-ARVD, a novel framework for accurate ARVD quantization. (S1) To tackle the highly unbalanced frame-wise sensitivity, Q-ARVD incorporates a final-quality aware frame-weighting mechanism into the quantization objective. (S2) To prevent heterogeneous outliers from degrading performance, Q-ARVD introduces an outlier-aware adaptive dual-scale quantization, which automatically detects the presence and quantity of outlier channels for an arbitrary layer, and isolates them to protect normal channels. Extensive experiments demonstrate the superiority of Q-ARVD.