스트리밍 비디오 생성을 위한 신뢰도-퍼플렉서티 인식 보상 증류

초록

증류 기반 가속 기술은 자기회귀적 스트리밍 비디오 확산 모델을 실용화하는 데 필수적인 기반이 되었으며, 분포 매칭 증류(DMD)가 사실상 표준 방법으로 자리 잡았습니다. 그러나 기존 방법들은 모든 롤아웃, 프레임, 픽셀을 동등한 신뢰도를 가진 지도 신호로 취급하여 학생 모델이 교사 모델의 출력을 무분별하게 학습하도록 합니다. 우리는 이러한 접근이 증류된 품질의 상한을 결정짓는다고 주장합니다. 왜냐하면 이는 DMD 지도 신호의 두 가지 상보적인 변동 축, 즉 신뢰도가 다양한 학생 롤아웃 간의 **상호 신뢰도(Inter-Reliability)** 와, 품질 향상이 필요한 공간적 영역과 시간적 프레임이 균등하지 않게 기여하는 **내적 복잡도(Intra-Perplexity)** 를 간과하기 때문입니다. 따라서 기존 목적 함수는 '각 롤아웃으로부터 학습할 것인가'와 '그 안에서 최적화를 어디에 집중할 것인가'라는 두 질문을 동일한 가중치로 혼동하고 있습니다. 이를 해결하기 위해 우리는 단일 공유 보상 지도 메커니즘을 통해 롤아웃 수준과 시공간 요소 수준에서 증류 목적 함수를 적응적으로 재가중하는 **신뢰도-복잡도 인식 보상 증류(Stream-R1)** 프레임워크를 제안합니다. 상호 신뢰도 수준에서 Stream-R1은 사전 학습된 비디오 보상 점수의 지수 함수를 통해 각 롤아웃의 손실을 재조정하여, 신뢰할 수 있는 지도 신호를 가진 롤아웃이 최적화를 주도하도록 합니다. 내적 복잡도 수준에서는 동일한 보상 모델을 역전파하여 픽셀 단위 그래디언트 중요도를 추출하며, 이를 공간적 및 시간적 가중치로 분해하여 개선 시 가장 큰 효과가 예상되는 영역과 프레임에 최적화 압력을 집중합니다. 적응형 균형 조절 메커니즘은 시각적 품질, 동작 품질, 텍스트 정렬 간에 특정 품질 축이 지배하는 것을 방지합니다. Stream-R1은 구조 변경이나 추가 추론 비용 없이, 표준 스트리밍 비디오 생성 벤치마크에서 기반 증류 방법 대비 세 가지 차원 모두에서 일관된 성능 향상을 달성합니다.

English

Distillation-based acceleration has become foundational for making autoregressive streaming video diffusion models practical, with distribution matching distillation (DMD) as the de facto choice. Existing methods, however, train the student to match the teacher's output indiscriminately, treating every rollout, frame, and pixel as equally reliable supervision. We argue that this caps distilled quality, since it overlooks two complementary axes of variance in DMD supervision: Inter-Reliability across student rollouts whose supervision varies in reliability, and Intra-Perplexity across spatial regions and temporal frames that contribute unequally to where quality can still be improved. The objective thus conflates two questions under a uniform weight: whether to learn from each rollout, and where to concentrate optimization within it. To address this, we propose Stream-R1, a Reliability-Perplexity Aware Reward Distillation framework that adaptively reweights the distillation objective at both rollout and spatiotemporal-element levels through a single shared reward-guided mechanism. At the Inter-Reliability level, Stream-R1 rescales each rollout's loss by an exponential of a pretrained video reward score, so that rollouts with reliable supervision dominate optimization. At the Intra-Perplexity level, it back-propagates the same reward model to extract per-pixel gradient saliency, which is factored into spatial and temporal weights that concentrate optimization pressure on regions and frames where refinement yields the largest expected gain. An adaptive balancing mechanism prevents any single quality axis from dominating across visual quality, motion quality, and text alignment. Stream-R1 attains consistent improvements on all three dimensions over distillation baselines on standard streaming video generation benchmarks, without architectural modification or additional inference cost.

스트리밍 비디오 생성을 위한 신뢰도-퍼플렉서티 인식 보상 증류

Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

초록

Support