アストロラーベ：蒸留された自己回帰型ビデオモデルのための順方向プロセス強化学習の制御

要旨

蒸留自己回帰（AR）動画モデルは効率的なストリーミング生成を可能とするが、人間の視覚的選好との整合性に課題を残す。既存の強化学習（RL）フレームワークはこれらのアーキテクチャに自然に適合せず、高コストな再蒸留またはソルバー結合型の逆過程最適化を必要とし、多大なメモリと計算オーバーヘッドを伴う。本論文では、蒸留ARモデルに特化した効率的なオンラインRLフレームワーク「Astrolabe」を提案する。既存のボトルネックを克服するため、ネガティブ認識ファインチューニングに基づく順過程RL定式化を導入する。推論端点で正例と負例を直接対比することで、逆過程の展開を必要とせずに暗黙的な方策改善方向を確立する。長尺動画への適用を可能にするため、ローリングKVキャッシュによる逐次生成と、局所クリップウィンドウへのRL更新に限定しつつ前文を条件付けることで長距離一貫性を保証するストリーミング訓練方式を考案した。さらに報酬ハッキングを軽減するため、不確実性認識選択的正則化と動的参照更新で安定化した多報酬目的関数を統合する。大規模実験により、本手法が複数の蒸留AR動画モデルで生成品質を一貫して向上させ、堅牢かつスケーラブルなアライメント解決策となることを実証する。

English

Distilled autoregressive (AR) video models enable efficient streaming generation but frequently misalign with human visual preferences. Existing reinforcement learning (RL) frameworks are not naturally suited to these architectures, typically requiring either expensive re-distillation or solver-coupled reverse-process optimization that introduces considerable memory and computational overhead. We present Astrolabe, an efficient online RL framework tailored for distilled AR models. To overcome existing bottlenecks, we introduce a forward-process RL formulation based on negative-aware fine-tuning. By contrasting positive and negative samples directly at inference endpoints, this approach establishes an implicit policy improvement direction without requiring reverse-process unrolling. To scale this alignment to long videos, we propose a streaming training scheme that generates sequences progressively via a rolling KV-cache, applying RL updates exclusively to local clip windows while conditioning on prior context to ensure long-range coherence. Finally, to mitigate reward hacking, we integrate a multi-reward objective stabilized by uncertainty-aware selective regularization and dynamic reference updates. Extensive experiments demonstrate that our method consistently enhances generation quality across multiple distilled AR video models, serving as a robust and scalable alignment solution.

アストロラーベ：蒸留された自己回帰型ビデオモデルのための順方向プロセス強化学習の制御

Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

要旨

Support