자율 회귀 비디오 생성을 위한 추론적 디코딩

초록

자기회귀 비디오 확산은 스트리밍 비디오 합성을 위한 유망한 패러다임으로 부상하고 있으며, 단계 증류가 추론 가속화의 주요 수단으로 활용되고 있다. 대규모 언어 모델의 지배적 가속화 전략인 추측 디코딩이 자기회귀 비디오 생성에 효과적으로 적용될 수 있는지는 공개된 질문으로 남아있는데, 이는 비디오 블록이 토큰 수준 분포가 없는 연속 시공간 텐서이므로 정확한 기각 샘플링이 불가능하기 때문이다. 우리는 토큰 검증을 이미지 품질 라우터로 대체하여 블록 기반 자기회귀 비디오 확산에 추측 디코딩을 도입한 SDVG를 제안한다. 13억 파라미터 드래프터가 4회의 노이즈 제거 단계를 통해 후보 블록을 제안하면, 각 블록은 VAE로 디코딩된 후 ImageReward가 최악 프레임 집계(평균화가 가려버릴 단일 프레임 결함을 포착하기 위해 프레임별 보상 최솟값을 채택) 방식으로 점수를 매긴다. 고정 임계값 τ 이상의 점수를 받은 블록은 140억 파라미터 타겟의 KV 캐시에 수용되며, 나머지는 타겟에 의해 재생성된다. 두 가지 추가 설계 선택이 결정적으로 중요함이 입증되었는데: 첫 번째 블록은 장면 구도를 고정하기 위해 항상 강제 기각되며, τ는 원활한 품질-속도 파레토 프론티어를 추적하는 단일 조정 장치로 작용한다. MovieGenVideoBench의 1003개 프롬프트(832x480)에서 SDVG는 τ=-0.7일 때 타겟 전용 VisionReward 품질의 98.1%(0.0773 vs. 0.0788)를 유지하면서 1.59배 가속을 달성했으며, 95.7% 품질 유지율에서 2.09배 가속에 도달했다—동시에 드래프트 전용 생성보다 consistently +17% 이상 우수한 성능을 보였다. 이 프레임워크는 학습이 필요 없으며, 아키텍처 변경이 불필요하고, 기존 자기회귀 비디오 생성 파이프라인에 원활하게 통합될 수 있다.

English

Autoregressive video diffusion is emerging as a promising paradigm for streaming video synthesis, with step distillation serving as the primary means of accelerating inference. Whether speculative decoding, the dominant acceleration strategy for large language models, can be effectively adapted to autoregressive video generation remains an open question, because video blocks are continuous spatiotemporal tensors with no token-level distribution for exact rejection sampling. We introduce SDVG, which brings speculative decoding to block-based autoregressive video diffusion by replacing token verification with an image-quality router. A 1.3B drafter proposes candidate blocks via four denoising steps; each block is VAE-decoded and scored by ImageReward using worst-frame aggregation--taking the minimum per-frame reward to catch single-frame artifacts that averaging would mask. Blocks scoring above a fixed threshold tau are accepted into the 14B target's KV cache; the rest are regenerated by the target. Two additional design choices prove critical: the first block is always force-rejected to anchor scene composition, and tau serves as a single knob that traces a smooth quality-speed Pareto frontier. On 1003 MovieGenVideoBench prompts (832x480), SDVG retains 98.1% of target-only VisionReward quality (0.0773 vs. 0.0788) at a 1.59x speedup with tau=-0.7, and reaches 2.09x at 95.7% quality retention--while consistently outperforming draft-only generation by over +17%. The framework is training-free, requires no architectural changes, and can be seamlessly integrated into existing autoregressive video generation pipelines.

자율 회귀 비디오 생성을 위한 추론적 디코딩

Speculative Decoding for Autoregressive Video Generation

초록

Support